Configuring your Job to run on the Hadoop cluster - 7.3

Machine Learning

Version
7.3
Language
English (United States)
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Machine Learning components
Data Quality and Preparation > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
This section explains how to configure your Job to run directly on the Hadoop cluster.

Procedure

  1. Click Spark Configuration on the Run tab.
  2. Add the following Advanced properties.
    The value is specific to the distribution and version of Hadoop. This tutorial uses Hortonworks 2.4 V3, which is 2.4.0.0-169. Your entry for this parameter will be different if you do not use Hortonworks 2.4 V3.
    Note: When running the code on the cluster, it is crucial to ensure that there is unfettered access between the two systems. In this example, you have to ensure that the Hortonworks cluster can communicate with your instance of Talend Studio. This is necessary because Spark, even though it is running on the cluster, still needs to reference the Spark drivers shipped with Talend. Moreover, if you deploy a Spark Job into a production environment, it will be run from a Talend Job server (edge node). You also need to ensure that there is unfettered communication between it and the cluster.

    For more information on the ports needed by each service, see the Spark Security documentation.

  3. Click the Advanced settings tab and add a new JVM argument that indicates the version of Hadoop. It is the string you added as value in the previous step.
  4. Click the Basic Run tab, then click Run.
    When it is complete, you are prompted by a message indicating success.
  5. Navigate to the HDFS directory, Ambari in this case, to verify that the model was created and persisted to HDFS.