Configuring the Spark cluster on Azure Databricks

Procedure

On the Configuration tab of your Databricks cluster page, scroll down to the Spark tab at the bottom of the page.
Click Edit to make the fields on this page editable.

As your Spark cluster uses tAzureFSConfiguration to connect your ADLS Gen1 folder from which you move data to Gen2, in this Spark tab, enter the Spark properties regarding the credentials to be used to access that ADLS Gen1 folder.

spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential
spark.hadoop.dfs.adls.oauth2.client.id <your_app_id>
spark.hadoop.dfs.adls.oauth2.credential <your_client_secret>
spark.hadoop.dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/<your_app_TENANT-ID>/oauth2/token

Add these ADLS Gen1 related properties each per line.

Restart your Spark cluster.
In the Spark UI tab of your Databricks cluster page, click Environment to display the list of properties and verify that each of the properties you added in the previous steps is present on that list.

In the Spark Configuration tab of the Run view of your Job, enter the basic connection information to Databricks.

Standalone	Use pool: you can select this check box to leverage a Databricks pool. If you do, you must indicate the pool ID instead of the cluster ID in the Spark Configuration. You must also select Job clusters from the Cluster type drop-down list. In the Endpoint field, enter the URL address of your Azure Databricks workspace. This URL can be found in the Overview blade of your Databricks workspace page on your Azure portal. For example, this URL could look like https://adb-$workspaceId.$random.azuredatabricks.net. In the Cluster ID field, enter the ID of the Databricks cluster to be used. This ID is the value of the spark.databricks.clusterUsageTags.clusterId property of your Spark cluster. You can find this property on the properties list in the Environment tab in the Spark UI view of your cluster. You can also easily find this ID from the URL of your Databricks cluster. It is present immediately after cluster/ in this URL. If you selected the Use pool option, in the Pool ID field, enter the ID of the Databricks pool to be used. This ID is the value of the DatabricksInstancePoolId key of your pool. You can find this key under Tags in the Configuration tab of your pool. It is also available in the tags of the clusters that are using the pool. You can also easily find this ID from the URL of your Databricks pool. It is present immediately after cluster/instance-pools/view/ in this URL. Click the [...] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For further information, see Personal access tokens from the Azure documentation. In the DBFS dependencies folder field, enter the directory that is used to store your Job related dependencies on Databricks Filesystem at runtime, putting a slash (/) at the end of this directory. For example, enter /jars/ to store the dependencies in a folder named jars. This folder is created on the fly if it does not exist then. Poll interval when retrieving Job status (in ms): enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want Talend Studio to ask Spark for the status of your Job. For example, this status could be Pending or Running. The default value is `300000`, meaning 30 seconds. This interval is recommended by Databricks to correctly retrieve the Job status. Cluster type: select the type of cluster to be used between Job clusters and All-purpose clusters. The custom properties you defined in the Advanced properties table are automatically taken into account by the job clusters at runtime. Autoscale: select or clear this check box to define the number of workers to be used by your job cluster. If you select this check box, autoscaling is enabled. Then define the minimum number of workers in Min workers and the maximum number of worders in Max workers. Your job cluster is scaled up and down within this scope based on its workload. According to the Databricks documentation, autoscaling works best with Databricks runtime versions 3.0 or onwards. If you clear this check box, autoscaling is deactivated. Then define the number of workers a job cluster is expected to have. This number does not include the Spark driver node. Node type and Driver node type: select the node types for the workers and the Spark driver node. These types determine the capacity of your nodes and their pricing by Databricks. For details about these node types and the Databricks Units they use, see Supported Instance Types from the Databricks documentation. Elastic disk: select this check box to enable your job cluster to automatically scale up its disk space when its Spark workers are running low on disk space. For more details about this elastic disk feature, search for the section about autoscaling local storage from your Databricks documentation. SSH public key: if an SSH access has been set up for your cluster, enter the public key of the generated SSH key pair. This public key is automatically added to each node of your job cluster. If no SSH access has been set up, ignore this field. For further information about SSH access to your cluster, see SSH access to clusters from the Databricks documentation. Configure cluster log: select this check box to define where to store your Spark logs for a long term. This storage system could be S3 or DBFS. Do not restart the cluster when submitting: select this check box to prevent Talend Studio restarting the cluster when Talend Studio is submitting your Jobs. However, if you make changes in your Jobs, clear this check box so that Talend Studio restarts your cluster to take these changes into account.

Standalone

Use pool: you can select this check box to leverage a Databricks pool. If you do, you must indicate the pool ID instead of the cluster ID in the Spark Configuration. You must also select Job clusters from the Cluster type drop-down list.
In the Endpoint field, enter the URL address of your Azure Databricks workspace. This URL can be found in the Overview blade of your Databricks workspace page on your Azure portal. For example, this URL could look like https://adb-$workspaceId.$random.azuredatabricks.net.
In the Cluster ID field, enter the ID of the Databricks cluster to be used. This ID is the value of the spark.databricks.clusterUsageTags.clusterId property of your Spark cluster. You can find this property on the properties list in the Environment tab in the Spark UI view of your cluster.

You can also easily find this ID from the URL of your Databricks cluster. It is present immediately after cluster/ in this URL.
If you selected the Use pool option, in the Pool ID field, enter the ID of the Databricks pool to be used. This ID is the value of the DatabricksInstancePoolId key of your pool. You can find this key under Tags in the Configuration tab of your pool. It is also available in the tags of the clusters that are using the pool.

You can also easily find this ID from the URL of your Databricks pool. It is present immediately after cluster/instance-pools/view/ in this URL.
Click the [...] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For further information, see Personal access tokens from the Azure documentation.
In the DBFS dependencies folder field, enter the directory that is used to store your Job related dependencies on Databricks Filesystem at runtime, putting a slash (/) at the end of this directory. For example, enter /jars/ to store the dependencies in a folder named jars. This folder is created on the fly if it does not exist then.
Poll interval when retrieving Job status (in ms): enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want Talend Studio to ask Spark for the status of your Job. For example, this status could be Pending or Running.

The default value is 300000, meaning 30 seconds. This interval is recommended by Databricks to correctly retrieve the Job status.
Cluster type: select the type of cluster to be used between Job clusters and All-purpose clusters.

The custom properties you defined in the Advanced properties table are automatically taken into account by the job clusters at runtime.
1. Autoscale: select or clear this check box to define the number of workers to be used by your job cluster.
  1. If you select this check box, autoscaling is enabled. Then define the minimum number of workers in Min workers and the maximum number of worders in Max workers. Your job cluster is scaled up and down within this scope based on its workload.
    According to the Databricks documentation, autoscaling works best with Databricks runtime versions 3.0 or onwards.
  2. If you clear this check box, autoscaling is deactivated. Then define the number of workers a job cluster is expected to have. This number does not include the Spark driver node.
2. Node type and Driver node type: select the node types for the workers and the Spark driver node. These types determine the capacity of your nodes and their pricing by Databricks.
  For details about these node types and the Databricks Units they use, see Supported Instance Types from the Databricks documentation.
3. Elastic disk: select this check box to enable your job cluster to automatically scale up its disk space when its Spark workers are running low on disk space.
  For more details about this elastic disk feature, search for the section about autoscaling local storage from your Databricks documentation.
4. SSH public key: if an SSH access has been set up for your cluster, enter the public key of the generated SSH key pair. This public key is automatically added to each node of your job cluster. If no SSH access has been set up, ignore this field.
  For further information about SSH access to your cluster, see SSH access to clusters from the Databricks documentation.
5. Configure cluster log: select this check box to define where to store your Spark logs for a long term. This storage system could be S3 or DBFS.
Do not restart the cluster when submitting: select this check box to prevent Talend Studio restarting the cluster when Talend Studio is submitting your Jobs. However, if you make changes in your Jobs, clear this check box so that Talend Studio restarts your cluster to take these changes into account.

Press F6 to run this Job to start the migration.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here