Complete the Qubole connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.
Qubole is supported only in the traditional data integration framework (the Standard framework) and in the Spark frameworks.
The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.
Before you begin
- You have properly set up your Qubole cluster on AWS. For further information about how to do this, see Getting Started with Qubole on AWS from the Qubole documentation.
- Ensure that the Qubole account to be used has the proper IAM role that is allowed to read/write to the S3 bucket to be used. For further details, contact the administrator of your Qubole system or see Cross-account IAM Role for QDS from the Qubole documentation.
- Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.
Procedure
Results
-
After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:
-
Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
-
Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
-
-
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.
For more information about the Spark checkpointing operation, see the official Spark documentation.