Defining Databricks connection parameters with Spark Universal - Cloud

Defining Databricks connection parameters with Spark Universal - Cloud - 8.0

Talend Studio User Guide

Version

Cloud

8.0

Language

English

Product

Talend Big Data

Talend Big Data Platform

Talend Cloud

Talend Data Fabric

Talend Data Integration

Talend Data Management Platform

Talend Data Services Platform

Talend ESB

Talend MDM Platform

Talend Real-Time Big Data Platform

Module

Talend Studio

Content

Design and Development

Last publication date

2024-04-16

Available in...

Big Data

Big Data Platform

Cloud Big Data

Cloud Big Data Platform

Cloud Data Fabric

Data Fabric

Real-Time Big Data Platform

About this task

Talend Studio connects to an all-purpose Databricks cluster to run the Job from this cluster.

Procedure

Click the Run view beneath the design workspace, then click the Spark configuration view.
Select Built-in from the Property type drop-down list.
If you have already set up the connection parameters in the Repository as explained in Centralizing a Hadoop connection, you can easily reuse it. To do this, select Repository from the Property type drop-down list, then click […] button to open the Repository Content dialog box and select the Hadoop connection to be used.
Tip: Setting up the connection in the Repository allows you to avoid configuring that connection each time you need it in the Spark configuration view of your Jobs. The fields are automatically filled.
Select Universal from the Distribution drop-down list, the Spark version from the Version drop-down list, and Databricks from the Runtime mode/environment drop-down list.

Enter the basic configuration information:

Parameter Usage

Parameter	Usage
Use local timezone	Select this check box to let Spark use the local time zone provided by the system. Note: If you clear this check box, Spark use UTC time zone. Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
Use dataset API in migrated components	Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: If you select the check box, the components inside the Job run with DS which improves performance. If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility. This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD. Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
Use timestamp for dataset components	Select this check box to use `java.sql.Timestamp` for dates. Note: If you leave this check box clear, `java.sql.Timestamp` or `java.sql.Date` can be used depending on the pattern.

Use local timezone

Select this check box to let Spark use the local time zone provided by the system.

Note:

If you clear this check box, Spark use UTC time zone.
Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.

Use dataset API in migrated components

Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:

If you select the check box, the components inside the Job run with DS which improves performance.
If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility.

This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD.

Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.

Use timestamp for dataset components

Select this check box to use java.sql.Timestamp for dates.

Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.

Complete the Databricks configuration parameters:

Parameter	Usage
Cloud provider	Select the cloud provider to be used between AWS, Azure and GCP.
Run mode	Select the mode you want to use to run your Job on Databricks cluster when you execute your Job in Talend Studio. With Create and run now, a new Job is created and run immediately on Databricks and with Runs submit, a one-time run is submitted without creating a Job on Databricks.
Use pool	You can select this check box to leverage a Databricks pool. If you do, you must indicate the Pool ID instead of the Cluster ID. You must also select Job clusters from the Cluster type drop-down list.
Endpoint	Enter the URL address of your workspace.
Cluster ID	Enter the ID of the Databricks cluster to be used. This ID is the value of the spark.databricks.clusterUsageTags.clusterId property of your Spark cluster. You can find this property on the properties list in the Environment tab in the Spark UI view of your cluster.
Token	Enter the authentication token generated for your Databricks user account.
DBFS dependencies folder	Enter the directory that is used to store your Job related dependencies on Databricks Filesystem at runtime, putting a slash (/) at the end of this directory. For example, enter /jars/ to store the dependencies in a folder named jars. This folder is created on the fly if it does not exist then.
Project ID	Enter the ID of your Google Platform project where the Databricks project is located. This field is only available when you select GCP from the Cloud provider drop-down list.
Bucket	Enter the name of the bucket you use for Databricks from Google Platform. This field is only available when you select GCP from the Cloud provider drop-down list.
Workspace ID	Enter the ID of your Google Platform workspace respecting the following format: `databricks-workspaceid`. This field is only available when you select GCP from the Cloud provider drop-down list.
Google credentials	Enter the directory in which the JSON file containing your service account key is stored in the Talend JobServer machine. This field is only available when you select GCP from the Cloud provider drop-down list.
Poll interval when retrieving Job status (in ms)	Enter the time interval (in milliseconds) at the end of which you want Talend Studio to ask Spark for the status of your Job.
Cluster type	From the drop-down list, select the type of cluster you want to use. For more information, see About Databricks clusters.
Do not restart the cluster when submitting	Select this check box to prevent Talend Studio restarting the cluster when Talend Studio is submitting your Jobs. However, if you make changes in your Jobs, clear this check box so that Talend Studio restarts your cluster to take these changes into account.

In the Spark "scratch" directory field, enter the directory in which Talend Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the Checkpoint directory field, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.
In the Advanced properties table, add any Spark properties you need to use to override their default counterparts used by Talend Studio.

Results

The connection details to the Databricks cluster are complete, you are ready to schedule executions of your Job or to run it immediately from this cluster.