Creating Databricks run profiles - Cloud

Talend Cloud Management Console for Pipelines User Guide

author
Talend Documentation Team
EnrichVersion
Cloud
EnrichProdName
Talend Cloud
task
Administration and Monitoring > Managing projects
Administration and Monitoring > Managing users
Deployment > Deploying > Executing Tasks
Deployment > Scheduling > Scheduling Tasks
EnrichPlatform
Talend Management Console

Before you begin

  • You must have Operator or Administrator rights for Talend Cloud Pipeline Designer.
  • You must set up the Remote Engine Gen2 or ensure that your subscription allows the use of the Cloud Engine for Design. For instructions on setting up the Remote Engine, see the Talend Remote Engine Gen2 Quick Start Guide.

About this task

Currently only Long Term Support (LTS) Databricks runtime versions are supported.

Procedure

  1. Go to the Engines tab.
  2. Click the name of the engine on which you want to configure the run profile.
  3. Click the Run profiles tab on the Engine details page.
  4. Click ADD PROFILE.
  5. Select the engine to which you want to apply the run profile.
    The current engine is selected by default.
  6. Select the Databricks run profile type.
  7. Enter the name of the profile.
  8. Optional: Enter the description of the run profile.
  9. Select your cloud provider from the drop-down list.
    • AWS
    • Azure
  10. Enter your Databricks API endpoint.
    The expected syntax of the endpoint is https://<DatabricksAccount>.cloud.databricks.com.
  11. Optional: Enter your Databricks API token.
    Your token can be found in the User Settings > Access Tokens menu of your Databricks account.
  12. Enter the address of your Databricks File System's staging directory.
    The path must start with dbfs:/, for example, dbfs:/tpd-staging/. This folder is used to store all the dependencies of the Connectors used in Talend Cloud Pipeline Designer.

    Example

  13. In the Basic configuration section, enter the number of micro-batch intervals in milliseconds.
    The default value is 5000.
  14. Select the type of target cluster to use from the drop-down list.
    • New cluster
    • Existing cluster
    If you have chosen to use an existing cluster, you only need to enter its ID.
  15. If you are using a new cluster, configure the following attributes:
    1. Enter the node type ID.
      This field determines the size of the machine for the Spark nodes. For more information on Amazon node types, refer to the Amazon documentation.
    2. Define in which folder in DBFS to collect the logs.
    3. Specify the number of machines to use.
  16. In the Advanced configuration section, click ADD PARAMETER to create a parameter.

    Example

    To set the amount of memory to use per executor process, enter spark.executor.memory to the parameter key and 16g to the value fields.
  17. Click SAVE.

Results

The created run profile is listed on the Engines > RUN PROFILES page in Talend Cloud Management Console. In Talend Cloud Pipeline Designer, the same run profile appears in the drop-down list of the pipeline.

Note: The first execution of a pipeline on the cluster takes more time than the following ones because dependencies are deployed on Databricks File System (DBFS). To manually upload these dependencies to DBFS and significantly reduce the first execution duration, follow this procedure.