Defining the Databricks-on-AWS connection parameters for Spark Jobs - 7.1

Databricks

author
Talend Documentation Team
EnrichVersion
Cloud
7.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs > Hadoop distributions > Databricks
Design and Development > Designing Jobs > Serverless > Databricks
EnrichPlatform
Talend Studio

Complete the Databricks connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.

Before you begin

  • Ensure that only one Job is sent to run on the same Databricks cluster per time and do not send another Job before this Job finishes running. Since each run automatically restarts the cluster, the Jobs that are launched in parallel interrupt each other and thus cause execution failure.

  • Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.

Procedure

Enter the basic connection information to Databricks on AWS.

Standalone

  • In the Endpoint field, enter the URL address of the workspace of your Databricks on AWS. For example, this URL could look like https://<your_endpoint>.cloud.databricks.com.

  • In the Cluster ID field, enter the ID of the Databricks cluster to be used. This ID is the value of the spark.databricks.clusterUsageTags.clusterId property of your Spark cluster. You can find this property on the properties list in the Environment tab in the Spark UI view of your cluster.

    You can also easily find this ID from the URL of your Databricks cluster. It is present immediately after cluster/ in this URL.

  • Click the [...] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For further information, see Token management from the Databricks documentation.

  • In the DBFS dependencies folder field, enter the directory that is used to store your Job related dependencies on Databricks Filesystem at runtime, putting a slash (/) at the end of this directory. For example, enter /jars/ to store the dependencies in a folder named jars. This folder is created on the fly if it does not exist then.

    This directory stores your Job dependencies on DBFS only. In your Job, use tS3Configuration, tDynamoDBConfiguration or, in a Spark Streaming Job, the Kinesis components, to read or write your business data to the related systems.

Results