Defining the Databricks-on-AWS connection parameters for Spark Jobs - 7.3

Databricks

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Hadoop distributions > Databricks
Design and Development > Designing Jobs > Serverless > Databricks
Last publication date
2024-02-21

Complete the Databricks connection configuration in the Spark Configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Before you begin

    1. When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
    2. When running a Spark Batch Job, only if you have selected the Do not restart the cluster when submitting check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run automatically restarts the cluster, the Jobs that are launched in parallel interrupt each other and thus cause execution failure.
  • Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.

Procedure

  1. Enter the basic configuration information:
    Use local timezone Select this check box to let Spark use the local timezone provided by the system.
    Note:
    • If you clear this check box, Spark use UTC timezone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Job run with DS which improves performance.
    • If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility.
    Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Note: Newly created Jobs in 7.3 use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
  2. From the Cloud provider drop-down list, select AWS.
  3. From the Run mode drop-down list, select the method you want to use to run your Job on Databricks:
    • Create and run now: a new Job is created and runned immediately. With this method, you can retrieve your Job with its ID in your Databricks workspace. For more information, see Run now, from the official Databricks documentation.
    • Runs submit: a one-time run is submitted without creating a Job. With this method, nothing is displayed in the user interface and no Job ID are created in your Databricks workspace. For more information, see Runs submit, from the offical Databricks documentation.
  4. Enter the basic connection information to Databricks on AWS:

    Standalone

    • Use pool: you can select this check box to leverage a Databricks pool. If you do, you must indicate the pool ID instead of the cluster ID in the Spark Configuration. You must also select Job cluster from the Cluster type drop-down list.

    • In the Endpoint field, enter the URL address of the workspace of your Databricks on AWS. For example, this URL could look like https://adb-$workspaceId.$random.azuredatabricks.net.

    • In the Cluster ID field, enter the ID of the Databricks cluster to be used. This ID is the value of the spark.databricks.clusterUsageTags.clusterId property of your Spark cluster. You can find this property on the properties list in the Environment tab in the Spark UI view of your cluster.

      You can also easily find this ID from the URL of your Databricks cluster. It is present immediately after cluster/ in this URL.

    • If you selected the Use pool option, in the Pool ID field, enter the ID of the Databricks pool to be used. This ID is the value of the DatabricksInstancePoolId key of your pool. You can find this key under Tags in the Configuration tab of your pool. It is also available in the tags of the clusters that are using the pool.

      You can also easily find this ID from the URL of your Databricks pool. It is present immediately after cluster/instance-pools/view/ in this URL.

    • Click the [...] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For more information, see Personal access tokens from the Databricks documentation.

    • In the DBFS dependencies folder field, enter the directory that is used to store your Job related dependencies on Databricks Filesystem at runtime, putting a slash (/) at the end of this directory. For example, enter /jars/ to store the dependencies in a folder named jars. This folder is created on the fly if it does not exist then.

      This directory stores your Job dependencies on DBFS only. In your Job, use tS3Configuration, tDynamoDBConfiguration or, in a Spark Streaming Job, the Kinesis components, to read or write your business data to the related systems.

    • Poll interval when retrieving Job status (in ms): enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the status of your Job. For example, this status could be Pending or Running.

      The default value is 300000, meaning 30 seconds. This interval is recommended by Databricks to correctly retrieve the Job status.

    • Cluster type: select the type of cluster to be used between Job clusters and All-purpose clusters.

      The custom properties you defined in the Advanced properties table are automatically taken into account by the job clusters at runtime.

      1. Use policy: select this check box to enter the name of the policy to be used by your job cluster. You can use a policy to limit the ability to configure clusters based on a set of rules. For more information about cluster policies, see Manage cluster policies from the official Databricks documentation.
      2. Autoscale: select or clear this check box to define the number of workers to be used by your job cluster.
        1. If you select this check box, autoscaling is enabled. Then define the minimum number of workers in Min workers and the maximum number of worders in Max workers. Your job cluster is scaled up and down within this scope based on its workload.

          According to the Databricks documentation, autoscaling works best with Databricks runtime versions 3.0 or onwards.

        2. If you clear this check box, autoscaling is deactivated. Then define the number of workers a job cluster is expected to have. This number does not include the Spark driver node.
      3. Node type and Driver node type: select the node types for the workers and the Spark driver node. These types determine the capacity of your nodes and their pricing by Databricks.

        For more information about these node types and the Databricks Units they use, see Supported Instance Types from the Databricks documentation.

      4. Elastic disk: select this check box to enable your job cluster to automatically scale up its disk space when its Spark workers are running low on disk space.

        For more details about this elastic disk feature, search for the section about autoscaling local storage from your Databricks documentation.

      5. SSH public key: if an SSH access has been set up for your cluster, enter the public key of the generated SSH key pair. This public key is automatically added to each node of your job cluster. If no SSH access has been set up, ignore this field.

        For more information about SSH access to your cluster, see SSH access to clusters from the official Databricks documentation.

      6. Configure cluster log: select this check box to define where to store your Spark logs for a long term. This storage system could be S3 or DBFS.
    • Do not restart the cluster when submitting: select this check box to prevent the Studio restarting the cluster when the Studio is submitting your Jobs. However, if you make changes in your Jobs, clear this check box so that the Studio restarts your cluster to take these changes into account.

Results

If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.

For more information about the Spark checkpointing operation, see the official Spark documentation.