Before you begin
- You must have Operator or Administrator rights for Talend Cloud Pipeline Designer.
- You must set up the Remote Engine Gen2 or ensure that your subscription allows the use of the Cloud Engine for Design. For instructions on setting up the Remote Engine, see the Talend Remote Engine Gen2 Quick Start Guide.
Currently only Long Term Support (LTS) Databricks runtime versions are supported.
About this task
- Go to the Engines tab.
- Click the name of the engine on which you want to configure the run profile.
- Click the Run profiles tab on the Engine details page.
- Click ADD PROFILE.
Select the engine
to which you want to apply the run profile.
The current engine is selected by default.
- Select the Databricks run profile type.
- Enter the name of the profile.
- Optional: Enter the description of the run profile.
- Select your cloud provider from the drop-down list.
- Enter your Databricks API endpoint.The expected syntax of the endpoint is https://<DatabricksAccount>.cloud.databricks.com.
- Optional: Enter your Databricks API token.Your token can be found in themenu of your Databricks account.
- Enter the address of your Databricks File System's staging directory.The path must start with dbfs:/, for example, dbfs:/tpd-staging/. This folder is used to store all the dependencies of the Connectors used in Talend Cloud Pipeline Designer.
In the Basic
configuration section, enter the number of micro-batch intervals
The default value is 5000.
- Select the type of target cluster to use from the drop-down list.
If you have chosen to use an existing cluster, you only need to enter its ID.
- New cluster
- Existing cluster
- If you are using a new cluster, configure the following attributes:
- Enter the node type ID.This field determines the size of the machine for the Spark nodes. For more information on Amazon node types, refer to the Amazon documentation.
- Define in which folder in DBFS to collect the logs.
- Specify the number of machines to use.
- Enter the node type ID.
- In the Advanced configuration section, click ADD
PARAMETER to create a parameter.
ExampleTo set the amount of memory to use per executor process, enter spark.executor.memory to the parameter key and 16g to the value fields.
- Click SAVE.
The created run profile is listed on the Talend Cloud Management Console. In Talend Cloud Pipeline Designer, the same run profile appears in the drop-down list of the pipeline.page in
Note: The first execution of a pipeline on the cluster takes more time than the following ones because dependencies are deployed on Databricks File System (DBFS). To manually upload these dependencies to DBFS and significantly reduce the first execution duration, follow this procedure.