Defining the AWS Qubole connection parameters for Spark Jobs

Complete the Qubole connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

Qubole is supported only in the traditional data integration framework (the Standard framework) and in the Spark frameworks.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Before you begin

You have properly set up your Qubole cluster on AWS. For further information about how to do this, see Getting Started with Qubole on AWS from the Qubole documentation.
Ensure that the Qubole account to be used has the proper IAM role that is allowed to read/write to the S3 bucket to be used. For further details, contact the administrator of your Qubole system or see Cross-account IAM Role for QDS from the Qubole documentation.
Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.

Procedure

Enter the basic configuration information:

Use local timezone	Select this check box to let Spark use the local timezone provided by the system. Note: If you clear this check box, Spark use UTC timezone. Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
Use dataset API in migrated components	Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: If you select the check box, the components inside the Job run with DS which improves performance. If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility. Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box. Note: Newly created Jobs in 7.3 use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.
Use timestamp for dataset components	Select this check box to use `java.sql.Timestamp` for dates. Note: If you leave this check box clear, `java.sql.Timestamp` or `java.sql.Date` can be used depending on the pattern.

Enter the basic connection information to Qubole:

Connection configuration	Click the ... button next to the API Token field to enter the authentication token generated for the Qubole user account to be used. For further information about how to obtain this token, see Manage Qubole account from the Qubole documentation. This token allows you to specify the user account you want to use to access Qubole. Your Job automatically uses the rights and permissions granted to this user account in Qubole. Select the Cluster label check box and enter the name of the Qubole cluster to be used. If leaving this check box clear, the default cluster is used. If you need details about your default cluster, ask the administrator of your Qubole service. You can also read this article from the Qubole documentation to find more information about configuring a default Qubole cluster. Select the Change API endpoint check box and select the region to be used. If leaving this check box clear, the default region is used. For further information about the Qubole Endpoints supported on QDS-on-AWS, see Supported Qubole Endpoints on Different Cloud Providers.

Connection configuration

Click the ... button next to the API Token field to enter the authentication token generated for the Qubole user account to be used. For further information about how to obtain this token, see Manage Qubole account from the Qubole documentation.

This token allows you to specify the user account you want to use to access Qubole. Your Job automatically uses the rights and permissions granted to this user account in Qubole.
Select the Cluster label check box and enter the name of the Qubole cluster to be used. If leaving this check box clear, the default cluster is used.

If you need details about your default cluster, ask the administrator of your Qubole service. You can also read this article from the Qubole documentation to find more information about configuring a default Qubole cluster.
Select the Change API endpoint check box and select the region to be used. If leaving this check box clear, the default region is used.

For further information about the Qubole Endpoints supported on QDS-on-AWS, see Supported Qubole Endpoints on Different Cloud Providers.

Configure the connection to the S3 file system to be used to temporarily store the dependencies of your Job so that your Qubole cluster has access to these dependencies.
This configuration is used for your Job dependencies only. Use a tS3Configuration in your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster.
- Access key and Secret key: enter the authentication information required to connect to the Amazon S3 bucket to be used.
  
  To enter the password, click the [...] button next to the password field, and then in the pop-up dialog box enter the password between double quotes and click OK to save the settings.
- Bucket name: enter the name of the bucket in which you want to store the dependencies of your Job. This bucket must already exist on S3.
- Temporary resource folder: enter the directory in which you want to store the dependencies of your Job. For example, enter temp_resources to write the dependencies in the /temp_resources folder in the bucket.
  If this folder already exists at runtime, its contents are overwritten by the upcoming dependencies; otherwise, this folder is automatically created.
- Region: specify the AWS region by selecting a region name from the list. For more information about the AWS Region, see Regions and Endpoints.

Results

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:
- Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
- Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.
For more information about the Spark checkpointing operation, see the official Spark documentation.