Defining the Dataproc connection parameters - 7.3

Spark Streaming

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Streaming
Last publication date
2024-02-21

Complete the Google Dataproc connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

Only the Yarn client mode is available for this type of cluster.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Procedure

  1. Enter the basic connection information to Dataproc:
    Use local timezone Select this check box to let Spark use the local timezone provided by the system.
    Note:
    • If you clear this check box, Spark use UTC timezone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Job run with DS which improves performance.
    • If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility.
    Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Note: Newly created Jobs in 7.3 use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.

    Project identifier

    Enter the ID of your Google Cloud Platform project.

    If you are not certain about your project ID, check it in the Manage Resources page of your Google Cloud Platform services.

    Cluster identifier

    Enter the ID of your Dataproc cluster to be used.

    Region

    From this drop-down list, select the Google Cloud region to be used.

    Google Storage staging bucket

    As a Talend Job expects its dependent jar files for execution, specify the Google Storage directory to which these jar files are transferred so that your Job can access these files at execution.

    The directory to be entered must end with a slash (/). If not existing, the directory is created on the fly but the bucket to be used must already exist.

  2. Provide the authentication information to your Google Dataproc cluster:

    Provide Google Credentials in file

    Leave this check box clear, when you launch your Job from a given machine in which Google Cloud SDK has been installed and authorized to use your user account credentials to access Google Cloud Platform. In this situation, this machine is often your local machine.

    When you launch your Job from a remote machine, such as a Jobserver, select this check box and in the Path to Google Credentials file field that is displayed, enter the directory in which this JSON file is stored in the Jobserver machine. You can also click the [...] button, and then in the pop-up dialog box, browse for the JSON file.

    For further information about this Google Credentials file, see the administrator of your Google Cloud Platform or visit Google Cloud Platform Auth Guide.

    Credential type Select the mode to be used to authenticate to your project:
    • Service Account: authenticate using a Google account that is associated with your Google Cloud Platform project. When selecting this mode, the parameters to be defined in the Basic settings view are Path to Google Credentials file and optionally Use P12 credentials file format and Service Account Id.
    • OAuth2 Access Token: authenticate the access using OAuth credentials. When selecting this mode, the parameter to be defined in the Basic settings view is OAuth2 Access Token.

    This field is only available for Dataproc 1.4 distribution.

    OAuth2 Access Token
    Enter an access token.
    Important: The token is only valid for one hour. Talend Studio does not perform the token refresh operation so you must regenerate a new one beyond the one-hour limit.

    You can generate an OAuth Access Token on Google Developers OAuth Playground by going to BigQuery API v2 and choosing all the needed permissions.

    This field is only available when you select OAuth2 Access Token from the Credential type drop-down list.

    This field is only available for Dataproc 1.4 distribution.

    Use P12 credentials file format

    When the Google credentials file to be used is in P12 format, select this check box and then in the Service Account Id field that is displayed, enter the ID of the service account for which this P12 credentials file has been created.

    This field is only available when you select Service Account from the Credential type drop-down list.

    This field is only available for Dataproc 1.4 distribution.

  3. With the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then the Studio will reuse that set of connection information for this Job.
  4. In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. As a result, if you leave /tmp in this field, this directory is C:/tmp.

Results