tAmazonEMRManage - 6.3

Talend Open Studio for Big Data Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Open Studio for Big Data
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tAmazonEMRManage launches or terminates a cluster on Amazon EMR (Elastic MapReduce).

Purpose

tAmazonEMRManage allows you to manage Amazon EMR (Elastic MapReduce) clusters.

tAmazonEMRManage properties

Component family

Cloud/Amazon/EMR

Basic settings

Access key and Secret key

Specify the access keys (the access key ID in the Access Key field and the secret access key in the Secret Key field) required to access the Amazon Web Services. For more information about AWS access keys, see Access keys (access key ID and secret access key).

To enter the secret key, click the [...] button next to the secret key field, and then in the pop-up dialog box enter the password between double quotes and click OK to save the settings.

 

Inherit credentials from AWS role

Select this check box to leverage the instance profile credentials. These credentials can be used on Amazon EC2 instances, and are delivered through the Amazon EC2 metadata service. To use this option, your Job must be running within Amazon EC2 or other services that can leverage IAM Roles for access to resources. For more information, see Using an IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances.

 

Assume role

Select this check box and specify the values for the following parameters used to create a new assumed role session.

  • Role ARN: the Amazon Resource Name (ARN) of the role to assume.

  • Role session name: an identifier for the assumed role session.

  • Session duration (minutes): the duration (in minutes) for which we want to have the assumed role session to be active.

For more information about assuming roles, see AssumeRole.

Configuration

Action

Select an action to be performed from the list, either Start or Stop.

  • Start: launch an Amazon EMR cluster.

  • Stop: terminate an Amazon EMR cluster.

Region

Specify the AWS region by selecting a region name from the list or entering a region between double quotation marks (for example "us-east-1"). For more information about how to specify the AWS region, see Choose an AWS Region.

Cluster name

Enter the name of the cluster.

Cluster version

Select the version of the cluster.

Application

Select the applications to be installed on the cluster.

This list is available only when an EMR version is selected from the Cluster version list.

Service role

Enter the IAM (Identity and Access Management) role for the Amazon EMR service. The default role is EMR_DefaultRole. To use this default role, you must have already created it.

Job flow role

Enter the IAM role for the EC2 instances that Amazon EMR manages. The default role is EMR_EC2_DefaultRole. To use this default role, you must have already created it.

Enable log

Select this check box to enable logging and in the field displayed specify the path to a folder in an S3 bucket where you want Amazon EMR to write the log data.

Use EC2 key pair

Select this check box to associate an Amazon EC2 (Elastic Compute Cloud) key pair with the cluster and in the field displayed enter the name of your EC2 key pair.

Predicate

Specify the cluster(s) that you want to stop:

  • All running clusters: all running clusters will be stopped.

  • All running clusters with predefined name: the running cluster with a given name will be stopped. In the Cluster name field displayed, you need to specify the name of the cluster to be stopped.

  • Running cluster with predefined id: the running cluster with a given ID will be stopped. In the Cluster id field displayed, you need to specify the ID of the cluster to be stopped.

This list is available only when Stop is selected from the Action list.

Instance Configuration

Instance count

Enter the number of Amazon EC2 instances to initialize.

Master instance type

Select the type of the master instance to initialize.

Slave instance type

Select the type of the slave instance to initialize.

Advanced settings

STS Endpoint

Select this check box and in the field displayed, specify the AWS Security Token Service endpoint where session credentials are retrieved from.

This check box is available only when the Assume role check box is selected.

Wait for cluster ready

Select this check box to let your Job wait until the launch of the cluster is completed.

Visible to all users

Select this check box to make the cluster visible to all IAM users.

Termination Protect

Select this check box to enable termination protection to prevent instances in the cluster from shutting down due to errors or issues during processing.

Enable debug

Select this check box to enable the debug mode.

Subnet id

Specify the identifier of the Amazon VPC (Virtual Private Cloud) subnet where you want the job flow to launch.

Availability Zone

Specify the availability zone for your cluster's EC2 instances.

Master security group

Specify the security group for the master instance.

Additional master security groups

Specify additional security groups for the master instance and separate them with a comma, for example, gname1, gname2, gname3.

Slave security group

Specify the security group for the slave instances.

Additional slave security groups

Specify additional security groups for the slave instances and separate them with a comma, for example, gname1, gname2, gname3.

Bootstrap

Actions

Specify the bootstrap actions associated with the cluster, by clicking the [+] button below the table to add as many rows as needed, each row for a bootstrap action, and setting the following parameters for each action:

  • Name: enter the name of the bootstrap action.

  • Script location: specify the location of the script run by the bootstrap action, for example, s3://ap-northeast-1.elasticmapreduce/bootstrap-actions/run-if.

  • Arguments: enter the list of command line arguments (separated by commas) passed to the bootstrap action script, for example, "arg0","arg1","arg2".

For more information about the bootstrap actions, see BootstrapActionConfig.

Step Configuration

Steps

Specify the job flow step(s) to be invoked on the cluster after its launch, by clicking the [+] button below the table to add as many rows as needed, each row for a step, and setting the following parameters for each step:

  • Name: enter the name of the job flow step.

  • Action on Failure: click the cell and from the drop-down list select the action to take if the job flow step fails.

  • Main Class: enter the name of the main class in the specified Java file. If not specified, the JAR file should specify a Main-Class in its manifest file.

  • Jar: enter the path to the JAR file run during the step, for example, "s3://inputjar/test.jar".

  • Args: enter the list of command line arguments (separated by commas) passed to the JAR file's main function when executed, for example, "arg0","arg1","arg2".

For more information about the job flow steps, see StepConfig.

Keep alive after steps complete

Select this check box to keep the job flow alive after completing all steps.

Wait for steps to complete

Select this check box to let your Job wait until the job flow steps are completed.

This check box is available only when the Wait for cluster ready check box is selected.

Properties

Specify the classification and property information supplied to the configuration object of the EMR cluster to be created, by clicking the [+] button below the table to add as many rows as needed, each row for a property, and setting the following parameters:

  • Classification: specify the classification of the configuration.

  • Key: enter the key of the property.

  • Value: enter the value of the property.

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Global Variables

CLUSTER_FINAL_ID: the ID of the cluster. This is an After variable and it returns a string.

CLUSTER_FINAL_NAME: the name of the cluster. This is an After variable and it returns a string.

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

tAmazonEMRManage is usually used as a standalone component.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Managing an Amazon EMR cluster

Here's an example of using Talend components to manage an Amazon EMR cluster.

Creating an Amazon EMR cluster management Job

Create a Job to start a new Amazon EMR cluster, then resize the cluster, and finally list the ID and name information of the instance groups in the cluster.

  1. Create a new Job and add a tAmazonEMRManage component, a tAmazonEMRResize component, a tAmazonEMRListInstances component, and a tJava component by typing their names in the design workspace or dropping them from the Palette.

  2. Link the tAmazonEMRManage component to the tAmazonEMRResize component using a Trigger > OnSubjobOk connection.

  3. Link the tAmazonEMRResize component to the tAmazonEMRListInstances component using a Trigger > OnSubjobOk connection.

  4. Link the tAmazonEMRListInstances component to the tJava component using a Row > Iterate connection.

Starting a new Amazon EMR cluster

Configure the tAmazonEMRManage component to start a new Amazon EMR cluster.

  1. Double-click the tAmazonEMRManage component to open its Basic settings view.

  2. In the Access Key and Secret Key fields, enter the authentication credentials required to access Amazon S3.

  3. From the Action list, select Start to start a cluster.

  4. Select the AWS region from the Region drop-down list. In this example, it is Asia Pacific (Tokyo).

  5. In the Cluster name field, enter the name of the cluster to be started. In this example, it is talend-doc-emr-cluster.

  6. From the Cluster version and Application drop-down list, select the version of the cluster and the application to be installed on the cluster.

  7. Select the Enable log check box and in the field displayed, specify the path to a folder in an S3 bucket where you want Amazon EMR to write the log data. In this example, it is s3://talend-doc-emr-bucket.

Resizing the Amazon EMR cluster by adding a new task instance group

Configure the tAmazonEMRResize component to resize a running Amazon EMR cluster by adding a new task instance group.

  1. Double-click the tAmazonEMRResize component to open its Basic settings view.

  2. In the Access Key and Secret Key fields, enter the authentication credentials required to access Amazon S3.

  3. From the Action drop-down list, select Add task instance group to resize the cluster by adding a new task instance group.

  4. In the Cluster id field, enter the ID of the cluster to be resized. In this example, the returned value of the global variable CLUSTER_FINAL_ID of the previous tAmazonEMRManage component is used.

    Note that you can retrieve the global variable by pressing Ctrl + Space and selecting the relevant global variable from the list.

  5. In the Group name field, enter the name of the task instance group to be added in the cluster. In this example, it is talend-doc-instance-group.

  6. In the Instance count field, specify the number of the instances to be created.

  7. From the Task instance type drop-down list, select the type of the instances to be created.

Listing the instance groups in the Amazon EMR cluster

Configure the tAmazonEMRListInstances component and the tJava component to retrieve and display the ID and name information of all instance groups in a running cluster.

  1. Double-click the tAmazonEMRListInstances component to open its Basic settings view.

  2. In the Access Key and Secret Key fields, enter the authentication credentials required to access Amazon S3.

  3. Select the AWS region from the Region drop-down list. In this example, it is Asia Pacific (Tokyo).

  4. Clear the Filter master and core instances check box to list all instance groups, including the Master, Core, and Task type instance groups.

  5. In the Cluster id field, enter the ID of the cluster for which to list the instance groups. In this example, the returned value of the global variable CLUSTER_FINAL_ID of the previous tAmazonEMRManage component is used.

  6. Double-click the tJava component to open its Basic settings view.

  7. In the Code field, enter the following code to print the ID and Name information of each instance group in the cluster.

    System.out.println("\r\n===== Instance Group =====");
    System.out.println("Instance Group ID:    " + (String)globalMap.get("tAmazonEMRListInstances_1_CURRENT_GROUP_ID"));
    System.out.println("Instance Group Name:  " + (String)globalMap.get("tAmazonEMRListInstances_1_CURRENT_GROUP_NAME"));

Executing the Job to manage the Amazon EMR cluster

After setting up the Job and configuring the components used in the Job for managing Amazon EMR cluster, you can then execute the Job and verify the Job execution result.

  1. Press Ctrl + S to save the Job and then F6 to execute the Job.

    As shown above, the Job starts and resizes the Amazon EMR cluster, and then lists all instance groups in the cluster.

  2. View the cluster details on the Amazon EMR Cluster List page to validate the Job execution result.