Defining the EMR connection parameters - Cloud

Complete the EMR connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Procedure

Enter the basic configuration information:

Use local timezone	Select this check box to let Spark use the local time zone provided by the system. Note: If you clear this check box, Spark use UTC time zone. Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
Use dataset API in migrated components	Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: If you select the check box, the components inside the Job run with DS which improves performance. If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility. This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD. Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
Use timestamp for dataset components	Select this check box to use `java.sql.Timestamp` for dates. Note: If you leave this check box clear, `java.sql.Timestamp` or `java.sql.Date` can be used depending on the pattern.

Use local timezone

Select this check box to let Spark use the local time zone provided by the system.

Note:

If you clear this check box, Spark use UTC time zone.
Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.

Use dataset API in migrated components

Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:

If you select the check box, the components inside the Job run with DS which improves performance.
If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility.

This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD.

Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.

Use timestamp for dataset components

Select this check box to use java.sql.Timestamp for dates.

Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.

Enter the basic connection information to EMR:

Yarn client	Talend Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly. If you are using the Yarn client mode, you need to set the following parameters in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored): In the Resource managerUse datanode field, enter the address of the ResourceManager service of the Hadoop cluster to be used. Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears. Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server. Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution. If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the Resource Manager service and the Job History service in the displayed fields. This enables you to use your username to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution, such as in yarn-site.xml and in mapred-site.xml. If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend JobServer. Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the username you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used. The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login username for your distribution. If you leave it empty, the username of the machine hosting Talend Studio will be used. If the Spark cluster cannot recognize the machine in which the Job is launched, select this Define the driver hostname or IP address check box and enter the host name or the IP address of this machine. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver. Note that in this situation, you also need to add the name and the IP address of this machine to its host file.
Yarn cluster	The Spark driver runs in your Yarn cluster to orchestrate how the Job should be performed. If you are using the Yarn cluster mode, you need to define the following parameters in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored): In the Resource managerUse datanode field, enter the address of the ResourceManager service of the Hadoop cluster to be used. Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears. Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server. Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution. Set path to custom Hadoop configuration JAR: if you are using connections defined in Repository to connect to your Cloudera or Hortonworks cluster, you can select this check box in the Repository wizard and in the field that is displayed, specify the path to the JAR file that provides the connection parameters of your Hadoop environment. Note that this file must be accessible from the machine where you Job is launched. This kind of Hadoop configuration JAR file is automatically generated when you build a big data Job from Talend Studio. This JAR file is by default named with this pattern: `hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar` You can also download this JAR file from the web console of your cluster or simply create a JAR file yourself by putting the configuration files in the root of your JAR file. For example: `hdfs-sidt.xml core-site.xml` The parameters from your custom JAR file override the parameters you put in the Spark configuration field. They also override the configuration you set in the configuration components such as tHDFSConfiguration or tHBaseConfiguration when the related storage system such as HDFS, HBase or Hive are native to Hadoop. But they do not override the configuration set in the configuration components for the third-party storage system such as tAzureFSConfiguration. If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the Resource Manager service and the Job History service in the displayed fields. This enables you to use your username to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution, such as in yarn-site.xml and in mapred-site.xml. If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend JobServer. Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the username you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used. The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login username for your distribution. If you leave it empty, the username of the machine hosting Talend Studio will be used. Select the Wait for the Job to complete check box to make Talend Studio or, if you use Talend JobServer, your Job JVM keep monitoring the Job until the execution of the Job is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While it is generally useful to select this check box when running a Spark Batch Job, it makes more sense to keep this check box clear when running a Spark Streaming Job.

Yarn client

Talend Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly.

If you are using the Yarn client mode, you need to set the following parameters in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored):

In the Resource managerUse datanode field, enter the address of the ResourceManager service of the Hadoop cluster to be used.
Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.
Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.
Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.
If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the Resource Manager service and the Job History service in the displayed fields. This enables you to use your username to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend JobServer.

Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the username you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.
The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login username for your distribution. If you leave it empty, the username of the machine hosting Talend Studio will be used.
If the Spark cluster cannot recognize the machine in which the Job is launched, select this Define the driver hostname or IP address check box and enter the host name or the IP address of this machine. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver.

Note that in this situation, you also need to add the name and the IP address of this machine to its host file.

Yarn cluster

The Spark driver runs in your Yarn cluster to orchestrate how the Job should be performed.

If you are using the Yarn cluster mode, you need to define the following parameters in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored):

In the Resource managerUse datanode field, enter the address of the ResourceManager service of the Hadoop cluster to be used.
Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.
Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.
Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.
Set path to custom Hadoop configuration JAR: if you are using connections defined in Repository to connect to your Cloudera or Hortonworks cluster, you can select this check box in the Repository wizard and in the field that is displayed, specify the path to the JAR file that provides the connection parameters of your Hadoop environment. Note that this file must be accessible from the machine where you Job is launched.
This kind of Hadoop configuration JAR file is automatically generated when you build a big data Job from Talend Studio. This JAR file is by default named with this pattern:
```
hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar
```
You can also download this JAR file from the web console of your cluster or simply create a JAR file yourself by putting the configuration files in the root of your JAR file. For example:
```
hdfs-sidt.xml
core-site.xml
```
The parameters from your custom JAR file override the parameters you put in the Spark configuration field. They also override the configuration you set in the configuration components such as tHDFSConfiguration or tHBaseConfiguration when the related storage system such as HDFS, HBase or Hive are native to Hadoop. But they do not override the configuration set in the configuration components for the third-party storage system such as tAzureFSConfiguration.
If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the Resource Manager service and the Job History service in the displayed fields. This enables you to use your username to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend JobServer.

Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the username you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.
The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login username for your distribution. If you leave it empty, the username of the machine hosting Talend Studio will be used.
Select the Wait for the Job to complete check box to make Talend Studio or, if you use Talend JobServer, your Job JVM keep monitoring the Job until the execution of the Job is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While it is generally useful to select this check box when running a Spark Batch Job, it makes more sense to keep this check box clear when running a Spark Streaming Job.

Ensure that the username in the Yarn client mode is the same one you put in tS3Configuration, the component used to provide S3 connection information to Spark.

With the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then Talend Studio will reuse that set of connection information for this Job.
If you need to launch from Windows, it is recommended to specify where the winutils.exe program to be used is stored.
- If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box and enter the directory where your winutils.exe is stored.
- Otherwise, leave this check box clear, Talend Studio generates one by itself and automatically uses it for this Job.
In the Spark "scratch" directory field, enter the directory in which Talend Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. As a result, if you leave /tmp in this field, this directory is C:/tmp.

Results

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:
- Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
- Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
It is recommended to activate the Spark logging and checkpointing system in the Spark configuration tab of the Run view of your Spark Job, in order to help debug and resume your Spark Job when issues arise:
- Logging and checkpointing the activities of your Apache Spark Job.

Defining the EMR connection parameters - Cloud - 8.0

Amazon EMR distribution

Procedure

Results