Amazon EMR - Getting Started

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Big Data Platform
Talend Open Studio for Big Data
Talend Big Data
task
Design and Development > Designing Jobs > Hadoop distributions > Amazon EMR
EnrichPlatform
Talend Studio

Amazon EMR - Getting Started

This article shows how to get started managing an Amazon EMR cluster using Talend Studio.

Environment:

The examples use a Talend Studio with Big Data. In addition, they use these licensed products provided by Amazon:

  • Amazon EC2

  • Amazon EMR

To perform the steps listed below, you must have an Amazon AWS account. If you don’t have an Amazon AWS account, please follow the instructions in the Creating an Amazon Web Services Account video.

Launch and Connect to an Amazon EC2 instance

The easiest way to leverage Amazon Web Services from the Talend Studio is to have a Studio instance available in an Amazon EC2 instance.

Procedure

  1. Connect to your Amazon account and then navigate to your AWS Management Console.
  2. In the Amazon Web Services list, click EC2:
  3. To start an EC2 new instance, click Launch Instance:

Recommended settings of an EC2 instance

Install and Start the Talend Studio

Once connected to your EC2 instance, you should install Talend Studio.

Before you begin

Installing a Java JRE and set environment variables. Please refer to the Installation Requirements page.

Procedure

  1. Install Talend Studio referring to the section about how to install the Studio in the Installation Guide.
  2. Once installed, start your Studio and create a new project, as described in the the section about how to create a project in Talend Studio User Guide.

Launch an Amazon EMR cluster from the Talend Studio

To launch an Amazon EMR cluster from the Talend Studio, you can use the tAmazonEMRManage component.

Getting your Amazon Credentials

To access the Amazon Services, you will need your Amazon credentials (access key and secret access key).

Procedure

  1. Follow the instructions in Managing Access Keys for IAM Users to get your Amazon credentials and make sure that your get access key and secret access key without “/” symbols. This will avoid bugs and strange behavior while using the services from Talend Studio..
  2. In the Repository, right-click Contexts and click Create context group to save your credentials as context variables in the Studio.
  3. Provide a name, such as AmazonCredentials.
  4. Add two context variables named AccessKey and SecretKey and set the corresponding values:

Define roles in Amazon EMR

You need to define the Service role and Job flow role in Amazon EMR before starting an Amazon EMR cluster.

Procedure

  1. Follow the instructions in the Default IAM roles for Amazon EMR documentation to define the roles. This will create the EMR_DefaultRole and the EMR_EC2_DefaultRole that will be used to launch an Amazon EMR cluster from Talend Studio.
  2. From your AWS Management Console, navigate to the Identity&Access Management service.
  3. In the left menu, click Roles and check that your roles have been successfully created.

Start an Amazon EMR cluster

In the Talend Studio, you create a new Standard Job to launch an Amazon EMR cluster.

Procedure

  1. Create a new Standard Job.
  2. In the Contexts view of your Job, add the AmazonCredentials context that is stored in the Repository.
  3. Add a tAmazonEMRManage component and open the Component view.
  4. In the Access Key and Secret key fields, use the context variables you created previously, context.AccessKey and context.SecretKey respectively.
  5. From the Action list, select Start. This list also allows you to select Stop to stop the Job.
  6. In the Region list, select the region to be used.
  7. Give a name to your cluster
  8. Select an EMR 4.0.0 distribution with All Applications. This will allow you to work with Core Hadoop Services and with Spark as well.
  9. Select the Use EC2 key pair check box and provide your EC2 key pair name.
  10. In the Instance configuration area, specify the number of nodes you want. At runtime, one instance will be designated as the master and the others are designated as slaves. You can also specify the instance type for the master node and the slave nodes.
  11. Press F6 to run the Job.

Results

A new cluster is launched. You can verify it from the Amazon EMR home page:

You can also check the the status from the EC2 instances list:

In the Studio, the console in the Run view shows the following message:

Your cluster is now ready.

Update the hosts file

Once started, each EC2 instance is attributed a public and private IP, and a public and private DNS.

The cluster nodes are configured using the private DNS. Therefore, you will update the hosts file of the Talend Studio instance with the private DNS and private IP of your master node.

Procedure

  1. From the Amazon EMR cluster list, expand your running cluster, then click View cluster details.
  2. Expand the Hardware section to see your master and slave nodes.