Defining the Databricks-on-AWS connection parameters for Spark Jobs - 7.1

Amazon S3

Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
Talend Studio
Data Governance > Third-party systems > Amazon services (Integration) > Amazon S3 components
Data Quality and Preparation > Third-party systems > Amazon services (Integration) > Amazon S3 components
Design and Development > Third-party systems > Amazon services (Integration) > Amazon S3 components

Complete the Databricks connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.

Before you begin

  • Ensure that only one Job is sent to run on the same Databricks cluster per time and do not send another Job before this Job finishes running. Since each run automatically restarts the cluster, the Jobs that are launched in parallel interrupt each other and thus cause execution failure.

  • Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.


Enter the basic connection information to Databricks on AWS.


  • In the Endpoint field, enter the URL address of the workspace of your Databricks on AWS. For example, this URL could look like https://<your_endpoint>

  • In the Cluster ID field, enter the ID of the Databricks cluster to be used. This ID is the value of the spark.databricks.clusterUsageTags.clusterId property of your Spark cluster. You can find this property on the properties list in the Environment tab in the Spark UI view of your cluster.

    You can also easily find this ID from the URL of your Databricks cluster. It is present immediately after cluster/ in this URL.

  • Click the [...] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For further information, see Token management from the Databricks documentation.

  • In the DBFS dependencies folder field, enter the directory that is used to store your Job related dependencies on Databricks Filesystem at runtime, putting a slash (/) at the end of this directory. For example, enter /jars/ to store the dependencies in a folder named jars. This folder is created on the fly if it does not exist then.

    This directory stores your Job dependencies on DBFS only. In your Job, use tS3Configuration, tDynamoDBConfiguration or, in a Spark Streaming Job, the Kinesis components, to read or write your business data to the related systems.


If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.

For further information about the Spark checkpointing operation, see .