Managing an Amazon EMR cluster - 6.3

Talend Open Studio for Big Data Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Open Studio for Big Data
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Here's an example of using Talend components to manage an Amazon EMR cluster.

Creating an Amazon EMR cluster management Job

Create a Job to start a new Amazon EMR cluster, then resize the cluster, and finally list the ID and name information of the instance groups in the cluster.

  1. Create a new Job and add a tAmazonEMRManage component, a tAmazonEMRResize component, a tAmazonEMRListInstances component, and a tJava component by typing their names in the design workspace or dropping them from the Palette.

  2. Link the tAmazonEMRManage component to the tAmazonEMRResize component using a Trigger > OnSubjobOk connection.

  3. Link the tAmazonEMRResize component to the tAmazonEMRListInstances component using a Trigger > OnSubjobOk connection.

  4. Link the tAmazonEMRListInstances component to the tJava component using a Row > Iterate connection.

Starting a new Amazon EMR cluster

Configure the tAmazonEMRManage component to start a new Amazon EMR cluster.

  1. Double-click the tAmazonEMRManage component to open its Basic settings view.

  2. In the Access Key and Secret Key fields, enter the authentication credentials required to access Amazon S3.

  3. From the Action list, select Start to start a cluster.

  4. Select the AWS region from the Region drop-down list. In this example, it is Asia Pacific (Tokyo).

  5. In the Cluster name field, enter the name of the cluster to be started. In this example, it is talend-doc-emr-cluster.

  6. From the Cluster version and Application drop-down list, select the version of the cluster and the application to be installed on the cluster.

  7. Select the Enable log check box and in the field displayed, specify the path to a folder in an S3 bucket where you want Amazon EMR to write the log data. In this example, it is s3://talend-doc-emr-bucket.

Resizing the Amazon EMR cluster by adding a new task instance group

Configure the tAmazonEMRResize component to resize a running Amazon EMR cluster by adding a new task instance group.

  1. Double-click the tAmazonEMRResize component to open its Basic settings view.

  2. In the Access Key and Secret Key fields, enter the authentication credentials required to access Amazon S3.

  3. From the Action drop-down list, select Add task instance group to resize the cluster by adding a new task instance group.

  4. In the Cluster id field, enter the ID of the cluster to be resized. In this example, the returned value of the global variable CLUSTER_FINAL_ID of the previous tAmazonEMRManage component is used.

    Note that you can retrieve the global variable by pressing Ctrl + Space and selecting the relevant global variable from the list.

  5. In the Group name field, enter the name of the task instance group to be added in the cluster. In this example, it is talend-doc-instance-group.

  6. In the Instance count field, specify the number of the instances to be created.

  7. From the Task instance type drop-down list, select the type of the instances to be created.

Listing the instance groups in the Amazon EMR cluster

Configure the tAmazonEMRListInstances component and the tJava component to retrieve and display the ID and name information of all instance groups in a running cluster.

  1. Double-click the tAmazonEMRListInstances component to open its Basic settings view.

  2. In the Access Key and Secret Key fields, enter the authentication credentials required to access Amazon S3.

  3. Select the AWS region from the Region drop-down list. In this example, it is Asia Pacific (Tokyo).

  4. Clear the Filter master and core instances check box to list all instance groups, including the Master, Core, and Task type instance groups.

  5. In the Cluster id field, enter the ID of the cluster for which to list the instance groups. In this example, the returned value of the global variable CLUSTER_FINAL_ID of the previous tAmazonEMRManage component is used.

  6. Double-click the tJava component to open its Basic settings view.

  7. In the Code field, enter the following code to print the ID and Name information of each instance group in the cluster.

    System.out.println("\r\n===== Instance Group =====");
    System.out.println("Instance Group ID:    " + (String)globalMap.get("tAmazonEMRListInstances_1_CURRENT_GROUP_ID"));
    System.out.println("Instance Group Name:  " + (String)globalMap.get("tAmazonEMRListInstances_1_CURRENT_GROUP_NAME"));

Executing the Job to manage the Amazon EMR cluster

After setting up the Job and configuring the components used in the Job for managing Amazon EMR cluster, you can then execute the Job and verify the Job execution result.

  1. Press Ctrl + S to save the Job and then F6 to execute the Job.

    As shown above, the Job starts and resizes the Amazon EMR cluster, and then lists all instance groups in the cluster.

  2. View the cluster details on the Amazon EMR Cluster List page to validate the Job execution result.