Complete the Google Dataproc connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.
Only the Yarn client mode is available for this type of cluster.
Enter the basic connection information to Dataproc:
Enter the ID of your Google Cloud Platform project.
If you are not certain about your project ID, check it in the Manage Resources page of your Google Cloud Platform services.
Enter the ID of your Dataproc cluster to be used.
From this drop-down list, select the Google Cloud region to be used.
Google Storage staging bucket
As a Talend Job expects its dependent jar files for execution, specify the Google Storage directory to which these jar files are transferred so that your Job can access these files at execution.
The directory to be entered must end with a slash (/). If not existing, the directory is created on the fly but the bucket to be used must already exist.
Provide the authentication information to your Google Dataproc cluster:
Provide Google Credentials in file
Leave this check box clear, when you launch your Job from a given machine in which Google Cloud SDK has been installed and authorized to use your user account credentials to access Google Cloud Platform. In this situation, this machine is often your local machine.
When you launch your Job from a remote machine, such as a Jobserver, select this check box and in the Path to Google Credentials file field that is displayed, enter the directory in which this JSON file is stored in the Jobserver machine.
For further information about this Google Credentials file, see the administrator of your Google Cloud Platform or visit Google Cloud Platform Auth Guide.
- With the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then the Studio will reuse that set of connection information for this Job.
- In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:
for Spark Batch Jobs.
for Spark Streaming Jobs.
It is recommended to activate the Spark logging and checkpointing system in the Spark configuration tab of the Run view of your Spark Job, in order to help debug and resume your Spark Job when issues arise: