Defining the HD Insight connection parameters

Complete the HD Insight connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

Only the Yarn cluster mode is available for this type of cluster.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Procedure

Enter the basic connection information to Microsoft HD Insight:

Livy configuration	The Hostname of Livy is the URL of your HDInsight cluster. This URL can be found in the Overview blade of your cluster. Enter this URL without the https:// part. The default Port is 443. The Username is the one defined when creating your cluster. You can find it in the SSH + Cluster login blade of your cluster. For further information about the Livy service used by HD Insight, see Submit Spark jobs using Livy.
Job status polling configuration	In the Poll interval when retrieving Job status (in ms) field, enter the time interval (in milliseconds) at the end of which you want Talend Studio to ask Spark for the status of your Job. For example, this status could be Pending or Running. In the Maximum number of consecutive statuses missing field, enter the maximum number of times Talend Studio should retry to get a status when there is no status response.
HDInsight configuration	Enter the address and the authentication information of the Microsoft HD Insight cluster to be used. For example, the address could be `your_hdinsight_cluster_name.azurehdinsight.net` and the authentication information is your Azure account name: `ychen`. Talend Studio uses this service to submit the Job to the HD Insight cluster. In the Job result folder field, enter the location in which you want to store the execution result of a Job in the Azure Storage to be used.
Windows Azure Storage configuration	Enter the address and the authentication information of the Azure Storage or ADLS Gen2 account to be used. In this configuration, you do not define where to read or write your business data but define where to deploy your Job only. In the Container field, enter the name of the container to be used. You can find the available containers in the Blob blade of the Azure Storage account to be used. In the Deployment Blob field, enter the location in which you want to store the current Job and its dependent libraries in this Azure Storage account. In the Hostname field, enter the Primary Blob Service Endpoint of your Azure Storage account without the https:// part. You can find this endpoint in the Properties blade of this storage account. In the Username field, enter the name of the Azure Storage account to be used. In the Password field, enter the access key of the Azure Storage account to be used. This key can be found in the Access keys blade of this storage account.

Enter the basic configuration information:

Use local timezone	Select this check box to let Spark use the local time zone provided by the system. Information noteNote: If you clear this check box, Spark use UTC time zone. Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
Use dataset API in migrated components	Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: If you select the check box, the components inside the Job run with DS which improves performance. If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility. This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD. Information noteImportant: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
Use timestamp for dataset components	Select this check box to use java.sql.Timestamp for dates. Information noteNote: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.

Use local timezone

Select this check box to let Spark use the local time zone provided by the system.

Note:

If you clear this check box, Spark use UTC time zone.
Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.

Use dataset API in migrated components

Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:

If you select the check box, the components inside the Job run with DS which improves performance.
If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility.

This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD.

Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.

Use timestamp for dataset components

Select this check box to use java.sql.Timestamp for dates.

Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.

In the Spark "scratch" directory field, enter the directory in which Talend Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. As a result, if you leave /tmp in this field, this directory is C:/tmp.
Select the Wait for the Job to complete check box to make Talend Studio or, if you use Talend JobServer, your Job JVM keep monitoring the Job until the execution of the Job is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While it is generally useful to select this check box when running a Spark Batch Job, it makes more sense to keep this check box clear when running a Spark Streaming Job.
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.

Results

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:
- Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
- Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
It is recommended to activate the Spark logging and checkpointing system in the Spark configuration tab of the Run view of your Spark Job, in order to help debug and resume your Spark Job when issues arise:
- Logging and checkpointing the activities of your Apache Spark Job.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here