Configuring your Job to run on the Hadoop cluster - Cloud - 8.0

Machine Learning

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Machine Learning components
Data Quality and Preparation > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
Last publication date
2024-02-20

This section explains how to configure your Job to run directly on the Hadoop cluster.

Procedure

  1. Click Spark Configuration on the Run tab.
  2. Add the following Advanced properties.
    The value is specific to the distribution and version of Hadoop. This tutorial uses Hortonworks 2.4 V3, which is 2.4.0.0-169. Your entry for this parameter will be different if you do not use Hortonworks 2.4 V3.
    Note: When running the code on the cluster, it is crucial to ensure that there is unfettered access between the two systems. In this example, make sure that the Hortonworks cluster can communicate with your instance of Talend Studio. This is necessary because Spark, even though it is running on the cluster, still needs to reference the Spark drivers shipped with Talend. Moreover, if you deploy a Spark Job into a production environment, it will be run from a Talend Job server (edge node). You also need to ensure that there is unfettered communication between it and the cluster.

    For more information on the ports needed by each service, see the Spark Security documentation.

  3. Select the Advanced settings tab and add a new JVM argument that indicates the version of Hadoop.
    It is the string you added as value in the previous step.
  4. Select the Basic Run tab, then click Run.
    When it is complete, you are prompted by a message indicating success.
  5. Navigate to the HDFS directory, Ambari in this example, to verify that the model was created and persisted to HDFS.