Defining HDInsight connection parameters with Spark Universal - Cloud - 8.0

Talend Studio User Guide

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-04-16
Available in...

Big Data

Big Data Platform

Cloud Big Data

Cloud Big Data Platform

Cloud Data Fabric

Data Fabric

Real-Time Big Data Platform

Complete the HDInsight connection configuration with Spark Universal in the Spark configuration tab of the Run view of your Spark Batch Job. This configuration is effective on a per-Job basis.

Procedure

  1. Enter the basic configuration information to connect to HDInsight:
    1. Username: enter your HDInsight cluster username.
    2. Password: enter your HDInsight cluster password.
  2. Enter the basic configuration information for Livy:
    1. Hostname: enter the URL of your HDInsight cluster.
    2. Port: enter the port number. The default one is 443.
    3. Username: enter the username you defined when creating your cluster. You can find it in the SSH + Cluster login blade of your cluster.
  3. Enter the Job status polling configuration:
    1. Poll interval when retrieving Job status (in ms): enter the time interval (in milliseconds) at the end of which you want Talend Studio to ask Spark for the status of your Job.
    2. Maximum number of consecutive statuses missing: enter the maximum number of times Talend Studio should retry to get a status when there is no status response.
  4. Enter the configuration information for Windows Azure Storage:
    Parameter Usage
    Primary storage Select from the drop-down list the type of storage where you want to deploy your Job:
    • ADLS Gen2
    • Azure Storage
    Authentication mode Select from the drop-down list the authentication type you want to use:
    • Azure Active Directory
    • Secret key
    Hostname Enter the Primary Blob Service Endpoint of your Azure Storage account. You can find this endpoint in the Properties blade of the storage account.
    Container Enter the name of the container to be used. You can find the available containers in the Blob blade of the Azure Storage account to be used.
    Directory ID Enter the directory ID.
    Application ID Enter the application ID.
    Client key Enter the client key.
    Deployment Blob Enter the location in which you want to store the current Job and its dependent libraries in the storage account.
  5. Enter the basic configuration information:
    Parameter Usage
    Define the hadoop home directory
    If you need to launch your Spark Job from Windows, specify where the winutils.exe program to be used is stored:
    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box and enter the directory where your winutils.exe is stored.
    • Otherwise, leave the Define the Hadoop home directory check box clear, Talend Studio generates one by itself and automatically uses it for this Job.
    Use local timezone Select this check box to let Spark use the local time zone provided by the system.
    Note:
    • If you clear this check box, Spark use UTC time zone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Spark Batch Job run with DS which improves performance.
    • If you clear the check box, the components inside the Spark Batch Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility.

    This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD.

    Important: If your Spark Batch Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
  6. Select the Set tuning properties check box to define the tuning parameters, by following the process explained in Tuning Spark for Apache Spark Batch Jobs.
    Important: You must define the tuning parameters otherwise you could get an error (400 - Bad request).
  7. In the Spark "scratch" directory field, enter the directory in which Talend Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
  8. If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the Checkpoint directory field, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.
  9. In the Advanced properties table, add any Spark properties you need to use to override their default counterparts used by Talend Studio.

Results

The connection details are complete, you are ready to schedule executions of your Spark Job or to run it immediately.