Defining the Dataproc connection parameters

Complete the Google Dataproc connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

Only the Yarn client mode is available for this type of cluster.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Procedure

Enter the basic connection information to Dataproc:

Use local timezone	Select this check box to let Spark use the local timezone provided by the system. Note: If you clear this check box, Spark use UTC timezone. Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
Use dataset API in migrated components	Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: If you select the check box, the components inside the Job run with DS which improves performance. If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility. Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box. Note: Newly created Jobs in 7.3 use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.
Use timestamp for dataset components	Select this check box to use `java.sql.Timestamp` for dates. Note: If you leave this check box clear, `java.sql.Timestamp` or `java.sql.Date` can be used depending on the pattern.
Project identifier	Enter the ID of your Google Cloud Platform project. If you are not certain about your project ID, check it in the Manage Resources page of your Google Cloud Platform services.
Cluster identifier	Enter the ID of your Dataproc cluster to be used.
Region	From this drop-down list, select the Google Cloud region to be used.
Google Storage staging bucket	As a Talend Job expects its dependent jar files for execution, specify the Google Storage directory to which these jar files are transferred so that your Job can access these files at execution. The directory to be entered must end with a slash (/). If not existing, the directory is created on the fly but the bucket to be used must already exist.

Provide the authentication information to your Google Dataproc cluster:

Provide Google Credentials in file	Leave this check box clear, when you launch your Job from a given machine in which Google Cloud SDK has been installed and authorized to use your user account credentials to access Google Cloud Platform. In this situation, this machine is often your local machine. When you launch your Job from a remote machine, such as a Jobserver, select this check box and in the Path to Google Credentials file field that is displayed, enter the directory in which this JSON file is stored in the Jobserver machine. You can also click the [...] button, and then in the pop-up dialog box, browse for the JSON file. For further information about this Google Credentials file, see the administrator of your Google Cloud Platform or visit Google Cloud Platform Auth Guide.
Credential type	Select the mode to be used to authenticate to your project: Service Account: authenticate using a Google account that is associated with your Google Cloud Platform project. When selecting this mode, the parameters to be defined in the Basic settings view are Path to Google Credentials file and optionally Use P12 credentials file format and Service Account Id. OAuth2 Access Token: authenticate the access using OAuth credentials. When selecting this mode, the parameter to be defined in the Basic settings view is OAuth2 Access Token. This field is only available for Dataproc 1.4 distribution.
OAuth2 Access Token	Enter an access token. Important: The token is only valid for one hour. Talend Studio does not perform the token refresh operation so you must regenerate a new one beyond the one-hour limit. You can generate an OAuth Access Token on Google Developers OAuth Playground by going to BigQuery API v2 and choosing all the needed permissions. This field is only available when you select OAuth2 Access Token from the Credential type drop-down list. This field is only available for Dataproc 1.4 distribution.
Use P12 credentials file format	When the Google credentials file to be used is in P12 format, select this check box and then in the Service Account Id field that is displayed, enter the ID of the service account for which this P12 credentials file has been created. This field is only available when you select Service Account from the Credential type drop-down list. This field is only available for Dataproc 1.4 distribution.

With the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then the Studio will reuse that set of connection information for this Job.
In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. As a result, if you leave /tmp in this field, this directory is C:/tmp.

Results

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:
- Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
- Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
It is recommended to activate the Spark logging and checkpointing system in the Spark configuration tab of the Run view of your Spark Job, in order to help debug and resume your Spark Job when issues arise:
- Logging and checkpointing the activities of your Apache Spark Job.