Amazon EMR - Getting Started
The examples use a Talend Studio with Big Data. In addition, they use these licensed products provided by Amazon:
To perform the steps listed below, you must have an Amazon AWS account. If you don’t have an Amazon AWS account, please follow the instructions in the Creating an Amazon Web Services Account video.
Launch and Connect to an Amazon EC2 instance
- Connect to your Amazon account and then navigate to your AWS Management Console.
In the Amazon Web Services list, click EC2:
To start an EC2 new instance, click Launch Instance:
Recommended settings of an EC2 instance
It is recommended to configure your EC2 instance as follows.
For more explanations about the EC2 settings, see Launching an Instance in the Amazon documentation.
- You may choose any resource Location; however, it is recommended not to use the EU (Frankfurt) region.
- You may use any Windows or Linux OS Amazon image, as long as it is supported by Talend Studio. Please refer to the Compatible Platforms and Java environments documentation.
- For the instance type, we recommend at least 2 vCPU and 7.5 GiB of RAM (m3.large or m4.large). For more information about this topic, please refer to the Instance Types web page.
- Keep the default network configuration (VPC and Subnet) in the Configure Instance tab.
For storage purposes, allow at least 60 GiB.
- You can create a new security group or use your own. The purpose of the security group is to control the inbound and outbound traffic allowed for instances attached to this particular security group.
- To launch your instance and connect to it, you will need a key pair. Following the instructions in the Amazon EC2 Key Pairs documentation page, create your key pair and save it in a secure place. In the Review and Launch window, click Launch and use the key pair you just created.
- Once your instance is started, it is attributed a Public and a Private IP. You will connect to your instance using the Public IP with the previously-generated key pair.
Install and Start the Talend Studio
Before you begin
Installing a Java JRE and set environment variables. Please refer to the Installation Requirements page.
- Install Talend Studio referring to the section about how to install the Studio in the Installation Guide.
- Once installed, start your Studio and create a new project, as described in the the section about how to create a project in Talend Studio User Guide.
Launch an Amazon EMR cluster from the Talend Studio
Getting your Amazon Credentials
- Follow the instructions in Managing Access Keys for IAM Users to get your Amazon credentials and make sure that your get access key and secret access key without “/” symbols. This will avoid bugs and strange behavior while using the services from Talend Studio..
- In the Repository, right-click Contexts and click Create context group to save your credentials as context variables in the Studio.
- Provide a name, such as AmazonCredentials.
Add two context variables named AccessKey and SecretKey and set the corresponding values:
Define roles in Amazon EMR
- Follow the instructions in the Default IAM roles for Amazon EMR documentation to define the roles. This will create the EMR_DefaultRole and the EMR_EC2_DefaultRole that will be used to launch an Amazon EMR cluster from Talend Studio.
- From your AWS Management Console, navigate to the Identity&Access Management service.
In the left menu, click Roles and check that your roles
have been successfully created.
Start an Amazon EMR cluster
- Create a new Standard Job.
In the Contexts view of your Job, add the
AmazonCredentials context that is stored in the
- Add a tAmazonEMRManage component and open the Component view.
- In the Access Key and Secret key fields, use the context variables you created previously, context.AccessKey and context.SecretKey respectively.
- From the Action list, select Start. This list also allows you to select Stop to stop the Job.
- In the Region list, select the region to be used.
- Give a name to your cluster
- Select an EMR 4.0.0 distribution with All Applications. This will allow you to work with Core Hadoop Services and with Spark as well.
- Select the Use EC2 key pair check box and provide your EC2 key pair name.
In the Instance configuration area, specify the number
of nodes you want. At runtime, one instance will be designated as the master and
the others are designated as slaves. You can also specify the instance type for
the master node and the slave nodes.
- Press F6 to run the Job.
A new cluster is launched. You can verify it from the Amazon EMR home page:
You can also check the the status from the EC2 instances list:
In the Studio, the console in the Run view shows the following message:
Your cluster is now ready.
Update the hosts file
Once started, each EC2 instance is attributed a public and private IP, and a public and private DNS.
The cluster nodes are configured using the private DNS. Therefore, you will update the hosts file of the Talend Studio instance with the private DNS and private IP of your master node.
From the Amazon EMR cluster list, expand your running cluster, then click View cluster details.
Expand the Hardware section to see your master and slave
Click the hyperlink corresponding to the master node. The master node details
are displayed, such as its public and private IPs, and its public and private
- Update your hosts file, that is to say, on a Windows instance, navigate to C:\Windows\System32\drivers\etc\ and open the hosts file, or on a Linux instance, open the /etc/hosts file.
Edit the hosts file as follows: