Defining the HD Insight connection parameters - 7.1

Spark Streaming

author
Talend Documentation Team
EnrichVersion
7.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs > Job Frameworks > Spark Streaming
EnrichPlatform
Talend Studio

Complete the HD Insight connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

Only the Yarn cluster mode is available for this type of cluster.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.

Procedure

  1. Enter the basic connection information to Microsoft HD Insight:

    Livy configuration

    • The Hostname of Livy is the URL of your HDInsight cluster. This URL can be found in the Overview blade of your cluster. Enter this URL without the https:// part.
    • The default Port is 443.
    • The Username is the one defined when creating your cluster. You can find it in the SSH + Cluster login blade of your cluster.
    For further information about the Livy service used by HD Insight, see Submit Spark jobs using Livy.

    HDInsight configuration

    • The Username is the one defined when creating your cluster. You can find it in the SSH + Cluster login blade of your cluster.
    • The Password is defined when creating your HDInsight cluster for authentication to this cluster.

    Windows Azure Storage configuration

    Enter the address and the authentication information of the Azure Storage account to be used. In this configuration, you do not define where to read or write your business data but define where to deploy your Job only. Therefore always use the Azure Storage system for this configuration.

    In the Container field, enter the name of the container to be used. You can find the available containers in the Blob blade of the Azure Storage account to be used.

    In the Deployment Blob field, enter the location in which you want to store the current Job and its dependent libraries in this Azure Storage account.

    In the Hostname field, enter the Primary Blob Service Endpoint of your Azure Storage account without the https:// part. You can find this endpoint in the Properties blade of this storage account.

    In the Username field, enter the name of the Azure Storage account to be used.

    In the Password field, enter the access key of the Azure Storage account to be used. This key can be found in the Access keys blade of this storage account.

  2. In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
  3. Select the Wait for the Job to complete check box to make your Studio or, if you use Talend Jobserver, your Job JVM keep monitoring the Job until the execution of the Job is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While it is generally useful to select this check box when running a Spark Batch Job, it makes more sense to keep this check box clear when running a Spark Streaming Job.

Results