Defining Dataproc connection parameters with Spark Universal - Cloud - 8.0

Talend Data Fabric Studio User Guide

Version
Cloud
8.0
Language
English (United States)
EnrichDitaval
Data Fabric
Product
Talend Data Fabric
Module
Talend Studio
Content
Design and Development

About this task

Talend Studio connects to a Dataproc cluster to run the Job from this cluster. Talend Studio is compatible with Dataproc 2.0.x version.

Complete the Spark Universal connection configuration with Dataproc on Spark 3.1.x in the Spark configuration tab of the Run view of your Spark Job. This configuration is effective on a per-Job basis.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.

Procedure

  1. Click the Run view beneath the design workspace, then click the Spark configuration view.
  2. Select Built-in from the Property type drop-down list.
    If you have already set up the connection parameters in the Repository as explained in Centralizing a Hadoop connection, you can easily reuse it. To do this, select Repository from the Property type drop-down list, then click […] button to open the Repository Content dialog box and select the Hadoop connection to be used.
    Tip: Setting up the connection in the Repository allows you to avoid configuring that connection each time you need it in the Spark configuration view of your Jobs. The fields are automatically filled.
  3. Select Universal from the Distribution drop-down list, Spark 3.1.x from the Version drop-down list and Dataproc from the Runtime mode/environment drop-down list.
  4. Enter the basic configuration information:
    Use local timezone Select this check box to let Spark use the local timezone provided by the system.
    Note:
    • If you clear this check box, Spark use UTC timezone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Job run with DS which improves performance.
    • If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility.
    Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Note: Newly created Jobs in 7.3 or later use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
  5. Complete the Dataproc parameters:
    Project ID Enter the ID of your Google Cloud Platform project.
    Cluster ID Enter the ID of your Dataproc cluster to be used.
    Region Enter the name of the Google Cloud region to be used.
    Google Storage staging bucket As a Talend Job expects its dependent jar files for execution, specify the Google Storage directory to which these jar files are transferred so that your Job can access these files at execution.
    Provide Google Credentials Leave this check box clear, when you launch your Job from a given machine in which Google Cloud SDK has been installed and authorized to use your user account credentials to access Google Cloud Platform. In this situation, this machine is often your local machine.
    Credential type Select the mode to be used to authenticate to your project:
    • Service account: authenticate using a Google account that is associated with your Google Cloud Platform project. When selecting this mode, the parameters to be defined is Path to Google Credentials file.
    • OAuth2 Access Token: authenticate the access using OAuth credentials. When selecting this mode, the parameter to be defined is OAuth2 Access Token.
    Service account Enter the path to the credentials file associated to the user account to be used. This file must be stored in the machine in which your Talend Job is actually launched and executed.
    OAuth2 Access Token Enter an access token.
    Important: The token is only valid for one hour. Talend Studio does not perform the token refresh operation so you must regenerate a new one beyond the one-hour limit.

    You can generate an OAuth Access Token on Google Developers OAuth Playground by going to BigQuery API v2 and choosing all the needed permissions (bigquery, devstorage.full_control, and cloud-platform).

  6. In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
  7. If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the Checkpoint directory field, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.
  8. In the Advanced properties table, add any Spark properties you need to use to override their default counterparts used by the Studio.

Results

The connection details are complete, you are ready to schedule executions of your Spark Job or to run it immediately.