Amazon EMR - Getting Started - 7.3

author
Talend Documentation Team
EnrichVersion
7.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs > Hadoop distributions > Amazon EMR
EnrichPlatform
Talend Studio

Amazon EMR - Getting Started

This article shows how to get started managing an Amazon EMR cluster using Talend Studio.

Environment:

The examples use a Talend Studio with Big Data. In addition, they use these licensed products provided by Amazon:

  • Amazon EC2

  • Amazon EMR

To perform the steps listed below, you must have an Amazon AWS account. If you don’t have an Amazon AWS account, please follow the instructions in the Creating an Amazon Web Services Account video.

Launch and Connect to an Amazon EC2 instance

The easiest way to leverage Amazon Web Services from the Talend Studio is to have a Studio instance available in an Amazon EC2 instance.

Procedure

  1. Connect to your Amazon account and then navigate to your AWS Management Console.
  2. In the Amazon Web Services list, click EC2:
  3. To start an EC2 new instance, click Launch Instance:

Recommended settings of an EC2 instance

It is recommended to configure your EC2 instance as follows.

For more explanations about the EC2 settings, see Launching an Instance in the Amazon documentation.

Procedure

  1. You may choose any resource Location; however, it is recommended not to use the EU (Frankfurt) region.
  2. You may use any Windows or Linux OS Amazon image, as long as it is supported by Talend Studio. Please refer to the Compatible Platforms and Java environments documentation.
  3. For the instance type, we recommend at least 2 vCPU and 7.5 GiB of RAM (m3.large or m4.large). For more information about this topic, please refer to the Instance Types web page.
  4. Keep the default network configuration (VPC and Subnet) in the Configure Instance tab.
  5. For storage purposes, allow at least 60 GiB.
  6. You can create a new security group or use your own. The purpose of the security group is to control the inbound and outbound traffic allowed for instances attached to this particular security group.
  7. To launch your instance and connect to it, you will need a key pair. Following the instructions in the Amazon EC2 Key Pairs documentation page, create your key pair and save it in a secure place. In the Review and Launch window, click Launch and use the key pair you just created.
  8. Once your instance is started, it is attributed a Public and a Private IP. You will connect to your instance using the Public IP with the previously-generated key pair.

Install and Start the Talend Studio

Once connected to your EC2 instance, you should install Talend Studio.

Before you begin

Installing a Java JRE and set environment variables. Please refer to the Installation Requirements page.

Procedure

  1. Install Talend Studio referring to the section about how to install the Studio in the Installation Guide.
  2. Once installed, start your Studio and create a new project, as described in the the section about how to create a project in Talend Studio User Guide.

Launch an Amazon EMR cluster from the Talend Studio

To launch an Amazon EMR cluster from the Talend Studio, you can use the tAmazonEMRManage component.

Getting your Amazon Credentials

To access the Amazon Services, you will need your Amazon credentials (access key and secret access key).

If the security policy of your organization does not allow you to explicitly expose the credentials in a client application such as a Job, skip this section and use the inherit credentials from AWS role check box that will be explained later in this article.

Procedure

  1. Follow the instructions in Managing Access Keys for IAM Users to get your Amazon credentials and make sure that your get access key and secret access key without “/” symbols. This will avoid bugs and strange behavior while using the services from Talend Studio..
  2. In the Repository, right-click Contexts and click Create context group to save your credentials as context variables in the Studio.
  3. Provide a name, such as AmazonCredentials.
  4. Add two context variables named AccessKey and SecretKey and set the corresponding values:

Define roles in Amazon EMR

You need to define the Service role and Job flow role in Amazon EMR before starting an Amazon EMR cluster.

Procedure

  1. Follow the instructions in the Default IAM roles for Amazon EMR documentation to define the roles. This will create the EMR_DefaultRole and the EMR_EC2_DefaultRole that will be used to launch an Amazon EMR cluster from Talend Studio.
  2. From your AWS Management Console, navigate to the Identity&Access Management service.
  3. In the left menu, click Roles and check that your roles have been successfully created.

Start an Amazon EMR cluster

In the Talend Studio, you create a new Standard Job to launch an Amazon EMR cluster.

Procedure

  1. Create a new Standard Job.
  2. Add a tAmazonEMRManage component and open the Component view.
  3. Provide your Amazon credentials to the Job.
    • If you have set up the context for these credentials in the previous steps, do the following:
      1. In the Contexts view of your Job, add the AmazonCredentials context that is stored in the Repository.
      2. In the Access Key and Secret key fields, use the context variables you created previously, context.AccessKey and context.SecretKey respectively.
    • If the security policy of your organization does not allow you to expose the credentials in a Job, select the Inherit credentials from AWS check box to obtain AWS security credentials from your EMR instance metadata. To use this option, the S3 system to be used must be S3A and an IAM role must has been configured to manage temporary credentials for client applications. For more information, see Using an IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances.
  4. From the Action list, select Start. This list also allows you to select Stop to stop the Job.
  5. In the Region list, select the region to be used.
  6. Give a name to your cluster
  7. Select an EMR distribution with All Applications. This will allow you to work with Core Hadoop Services and with Spark as well.
  8. Select the Use EC2 key pair check box and provide your EC2 key pair name.
  9. In the Instance configuration area, specify the number of nodes you want. At runtime, one instance will be designated as the master and the others are designated as slaves. You can also specify the instance type for the master node and the slave nodes.
  10. Press F6 to run the Job.

Results

A new cluster is launched. You can verify it from the Amazon EMR home page:

You can also check the the status from the EC2 instances list:

In the Studio, the console in the Run view shows the following message:

Your cluster is now ready.

Update the hosts file

Once started, each EC2 instance is attributed a public and private IP, and a public and private DNS.

The cluster nodes are configured using the private DNS. Therefore, you will update the hosts file of the Talend Studio instance with the private DNS and private IP of your master node.

Procedure

  1. From the Amazon EMR cluster list, expand your running cluster, then click View cluster details.
  2. Expand the Hardware section to see your master and slave nodes.
  3. Click the hyperlink corresponding to the master node. The master node details are displayed, such as its public and private IPs, and its public and private DNS
  4. Update your hosts file, that is to say, on a Windows instance, navigate to C:\Windows\System32\drivers\etc\ and open the hosts file, or on a Linux instance, open the /etc/hosts file.
  5. Edit the hosts file using the public IP addresses and the private DNS names of your cluster nodes.

    As the DNS and IP addresses of the nodes change each time you start a new Amazon EMR cluster, then you must update this file accordingly.

  6. Save and close the file.

Create the cluster metadata in the Studio

It is recommended to create a new Hadoop cluster metadata in the Repository. This allows you to easily reuse the connection information to your EMR cluster.

Procedure

  1. In the Import option window, select the options for the Amazon EMR 4.0.0 distribution, then select Enter manually Hadoop services and click Finish.
  2. In the window that pops up, replace localhost and 0.0.0.0 with the private DNS of the master node.
  3. In the User name field, enter hadoop.