Defining Cloudera Data Engineering connection parameters with Spark Universal - Cloud - 8.0

Talend Data Fabric Studio User Guide

Version
Cloud
8.0
Language
English (United States)
EnrichDitaval
Data Fabric
Product
Talend Data Fabric
Module
Talend Studio
Content
Design and Development

About this task

Talend Studio connects to Cloudera Data Engineering (CDE) service to run the Spark Job from this cluster.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.

Complete the Spark Universal connection configuration with Kubernetes either on Spark 3.1.x or Spark 3.2.x in the Spark configuration tab of the Run view of your Spark Job. This configuration is effective on a per-Job basis.

Procedure

  1. Click the Run view beneath the design workspace, then click the Spark configuration view.
  2. Select Built-in from the Property type drop-down list.
    If you have already set up the connection parameters in the Repository as explained in Centralizing a Hadoop connection, you can easily reuse it. To do this, select Repository from the Property type drop-down list, then click […] button to open the Repository Content dialog box and select the Hadoop connection to be used.
    Tip: Setting up the connection in the Repository allows you to avoid configuring that connection each time you need it in the Spark configuration view of your Jobs. The fields are automatically filled.
  3. Select Universal from the Distribution drop-down list, Spark 3.1.x or Spark 3.2.x from the Version drop-down list and Cloudera Data Engineering from the Runtime mode/environment drop-down list.
  4. If you need to launch your Spark Job from Windows, specify where the winutils.exe program to be used is stored:
    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box and enter the directory where your winutils.exe is stored.

    • Otherwise, leave the Define the Hadoop home directory check box clear, the Studio generates one by itself and automatically uses it for this Job.

  5. Enter the basic Configuration information:
    Use local timezone Select this check box to let Spark use the local timezone provided by the system.
    Note:
    • If you clear this check box, Spark use UTC timezone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Job run with DS which improves performance.
    • If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility.
    Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Note: Newly created Jobs in 7.3 or later use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
  6. Complete the CDE configuration parameters:
    CDE API endpoint Enter the CDE API endpoint. You can find the URL from JOBS API URL link.
    CDE API token Enter the CDE token you use for API authentication. The URL must respect the following format: [BASE_URL]/gateway/authtkn. For more information, see CDE API access token from Cloudera documentation.

    This property is available only when Auto generate token check box is cleared.

    Auto generate token Select this check box to create a new token before a Job is submitted.
    • CDE token endpoint: enter the CDE token you want to use.
    • Workload user: enter the CDP workload user you want to use to generate a new token. For more information, see CDP workload user from Cloudera documention.
    • Workload password: enter the password associated with the workload user.
    Enable client debugging Select this check box to enable debug logging for CDE API client.
    Override dependencies Select this check box to override the dependencies needed for Spark.
    Job status/logs polling interval (in ms) Enter the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the status of your Job.
    Fetch driver logs Select this check box to fetch the driver logs at runtime. You can choose to fetch the following information by selecting the check box:
    • Standard output
    • Error output
    Advanced parameters Select this check box to enter the following CDE API advanced parameters:
    • Number of executors: enter the number of executors.
    • Driver cores: enter the number of driver cores.
    • Driver memory: enter the allocation size of memory to be used by the driver.
    • Executor cores: enter the number of executor cores.
    • Executor memory: enter the allocation size of memory to be used by each executor.
  7. In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
  8. Activate checkpointing
  9. In the Advanced properties table, add any Spark properties you need to use to override their default counterparts used by the Studio.

Results

The connection details are complete, you are ready to schedule executions of your Job or to run it immediately.