Complete the Databricks connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.
The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.
Before you begin
Ensure that only one Job is sent to run on the same Databricks cluster per time and do not send another Job before this Job finishes running. Since each run automatically restarts the cluster, the Jobs that are launched in parallel interrupt each other and thus cause execution failure.
- Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.
Procedure
Standalone |
|
Results
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.
For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .