Defining the Dataproc connection parameters

Spark Batch

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Real-Time Big Data Platform
Talend Big Data
Talend Data Fabric
Talend Big Data Platform
task
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
EnrichPlatform
Talend Studio

Complete the Google Dataproc connnection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

Only the Yarn client mode is available for this type of cluster.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data, but it is not applicable to Talend Open Studio for Big Data users.

Procedure

  1. Enter the basic connection information to Dataproc:

    Project identifier

    Enter the ID of your Google Cloud Platform project.

    If you are not certain about your project ID, check it in the Manage Resources page of your Google Cloud Platform services.

    Cluster identifier

    Enter the ID of your Dataproc cluster to be used.

    Region

    Enter the geographic zones in which the computing resources are used and your data is stored and processed. If you do not need to specify a particular region, leave the default value global.

    For further information about the available regions and the zones each region groups, see Regions and Zones.

    Google Storage staging bucket

    As a Talend Job expects its dependent jar files for execution, specify the Google Storage directory to which these jar files are transferred so that your Job can access these files at execution.

    The directory to be entered must end with a slash (/). If not existing, the directory is created on the fly but the bucket to be used must already exist.

  2. Provide the authentication information to your Google Dataproc cluster:

    Provide Google Credentials in file

    Leave this check box clear, when you launch your Job from a given machine in which Google Cloud SDK has been installed and authorized to use your user account credentials to access Google Cloud Platform. In this situation, this machine is often your local machine.

    When you launch your Job from a remote machine, such as a Jobserver, select this check box and in the Path to Google Credentials file field that is displayed, enter the directory in which this JSON file is stored in the Jobserver machine.

    For further information about this Google Credentials file, see the administrator of your Google Cloud Platform or visit Google Cloud Platform Auth Guide.

  3. With the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then the Studio will reuse that set of connection information for this Job.
  4. In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

Results

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in: