Defining the Cloudera connection parameters - 7.3

Spark Batch

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
Last publication date
2024-02-21

Complete the Cloudera connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

If you cannot find the Cloudera or Hortonworks version to be used from the Version drop-down list, you can add your distribution via some dynamic distribution settings in the Studio.
  • On the version list of the distributions, some versions are labelled Builtin. These versions were added by Talend via the Dynamic distribution mechanism and delivered with the Studio when the Studio was released. They are certified by Talend, thus officially supported and ready to use.
If you cannot find the Cloudera version to be used from this drop-down list, you can add your distribution via some dynamic distribution settings in the Studio.
  • On the version list of the distributions, some versions are labelled Builtin. These versions were added by Talend via the Dynamic distribution mechanism and delivered with the Studio when the Studio was released. They are certified by Talend, thus officially supported and ready to use.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Procedure

  1. Enter the Knox configuration information:
    Use Knox If you are using Knox, you need to set the following parameters in their corresponding fields:
    • Knox URL: enter the Knox URL respecting the following format https://<host>/<datahub>/cdp-proxy-api. You can find the Knox URL on the Cloudera Management Console in the Endpoints section of your Data Hub under Livy Server.
      Important: The URL should not include /livy or any other suffix after cdp-proxy-api.
    • Knox user: enter your Workload User Name from Cloudera Management Console.
    • Knox password: enter your Workload Password from Cloudera Management Console.
    • Knox session timeout: specify the amount of time to wait for the Job to reconnect to the cluster via Knox.
    • Webhdfs directory: type in the location storing the loaded file in HDFS.
    • Poll interval when retrieving Job status (in ms): enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the status of your Job. For example, this status could be Pending or Running.

      The default value is 30000, meaning 30 seconds

    • Maximum number of consecutive statuses missing: enter the maximum number of times the Studio should retry to get a status when there is no status response.

      The default value is 10.

    These options are available for CDP 7.1 and onwards in YARN cluster mode for Spark Batch and Spark Streaming Jobs.

  2. Enter the basic configuration information:
    Use local timezone Select this check box to let Spark use the local timezone provided by the system.
    Note:
    • If you clear this check box, Spark use UTC timezone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Job run with DS which improves performance.
    • If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility.
    Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Note: Newly created Jobs in 7.3 use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
  3. Select the type of the Spark cluster you need to connect to.

    Standalone

    The Studio connects to a Spark-enabled cluster to run the Job from this cluster.

    If you are using the Standalone mode, you need to set the following parameters:

    • In the Spark host field, enter the URI of the Spark Master of the Hadoop cluster to be used.

    • In the Spark home field, enter the location of the Spark executable installed in the Hadoop cluster to be used.

    • If the Spark cluster cannot recognize the machine in which the Job is launched, select this Define the driver hostname or IP address check box and enter the host name or the IP address of this machine. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver.

      Note that in this situation, you also need to add the name and the IP address of this machine to its host file.

    Yarn client

    The Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly.

    If you are using the Yarn client mode, you need to set the following parameters in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored):

    • In the Resource managerUse datanode field, enter the address of the ResourceManager service of the Hadoop cluster to be used.

    • Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.

    • Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.

    • Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the ResourceManager service and the JobHistory service in the displayed fields. This enables you to use your username to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the username you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

    • The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login username for your distribution. If you leave it empty, the username of the machine hosting the Studio will be used.

    • If the Spark cluster cannot recognize the machine in which the Job is launched, select this Define the driver hostname or IP address check box and enter the host name or the IP address of this machine. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver.

      Note that in this situation, you also need to add the name and the IP address of this machine to its host file.

    Yarn cluster

    The Spark driver runs in your Yarn cluster to orchestrate how the Job should be performed.

    If you are using the Yarn cluster mode, you need to define the following parameters in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored):

    • In the Resource managerUse datanode field, enter the address of the ResourceManager service of the Hadoop cluster to be used.

    • Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.

    • Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.

    • Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

    • Set path to custom Hadoop configuration JAR: if you are using connections defined in Repository to connect to your Cloudera or Hortonworks cluster, you can select this check box in the Repository wizard and in the field that is displayed, specify the path to the JAR file that provides the connection parameters of your Hadoop environment. Note that this file must be accessible from the machine where you Job is launched.

      This kind of Hadoop configuration JAR file is automatically generated when you build a Big Data Job from the Studio. This JAR file is by default named with this pattern:
      hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar
      You can also download this JAR file from the web console of your cluster or simply create a JAR file yourself by putting the configuration files in the root of your JAR file. For example:
      hdfs-sidt.xml
      core-site.xml

      The parameters from your custom JAR file override the parameters you put in the Spark configuration field. They also override the configuration you set in the configuration components such as tHDFSConfiguration or tHBaseConfiguration when the related storage system such as HDFS, HBase or Hive are native to Hadoop. But they do not override the configuration set in the configuration components for the third-party storage system such as tAzureFSConfiguration.

    • If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the ResourceManager service and the JobHistory service in the displayed fields. This enables you to use your username to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the username you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

    • The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login username for your distribution. If you leave it empty, the username of the machine hosting the Studio will be used.

    • Select the Wait for the Job to complete check box to make your Studio or, if you use Talend Jobserver, your Job JVM keep monitoring the Job until the execution of the Job is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While it is generally useful to select this check box when running a Spark Batch Job, it makes more sense to keep this check box clear when running a Spark Streaming Job.

    Ensure that the username in the Yarn client mode is the same one you put in tHDFSConfiguration, the component used to provide HDFS connection information to Spark.

  4. With the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then the Studio will reuse that set of connection information for this Job.
  5. If you need to launch from Windows, it is recommended to specify where the winutils.exe program to be used is stored.
    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box and enter the directory where your winutils.exe is stored.

    • Otherwise, leave this check box clear, the Studio generates one by itself and automatically uses it for this Job.

  6. In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. As a result, if you leave /tmp in this field, this directory is C:/tmp.

Results