Setting up the Databricks File System (DBFS) staging - Cloud

Talend Cloud Management Console for Pipelines User Guide

author
Talend Documentation Team
EnrichVersion
Cloud
EnrichProdName
Talend Cloud
task
Administration and Monitoring > Managing projects
Administration and Monitoring > Managing users
Deployment > Deploying > Executing Tasks
Deployment > Scheduling > Scheduling Tasks
EnrichPlatform
Talend Management Console

Some DBFS resources need to be created using the Databricks REST API in order for your soon-to-be created cluster to be compatible with Talend Cloud Pipeline Designer.

Procedure

  1. Follow the Databricks documentation about how to setup REST API authentication for AWS or for Azure.
  2. Check that you are able to browse the Databricks File System using the REST API.
  3. Create the root staging directory:
    curl -n -H 'Authorization:Bearer MY_TOKEN' -d '{"path" : "/DBFS_STAGING_DIRECTORY_NAME"}' "https://<account>.cloud.databricks.com/api/2.0/dbfs/mkdirs"

    where DBFS_STAGING_DIRECTORY_NAME corresponds to the name of your new DBFS staging directory and <account> corresponds to your Databricks account name.

  4. Create the Init Scripts directory:
    curl -n -H 'Authorization:Bearer MY_TOKEN' -d '{"path" : "/DBFS_STAGING_DIRECTORY_NAME/scripts"}' "https://<account>.cloud.databricks.com/api/2.0/dbfs/mkdirs"
  5. Copy the following content in a file and save it on your local machine:
    #!/bin/bash
    
    #
    # Talend patches for Databricks spark 2.4.X
    #
    
    # Pin everything from com.fasterxml.jackson to 2.9.4
    rm /databricks/jars/spark--maven-trees--spark_2.4--com.fasterxml.jackson.core--jackson-core--com.fasterxml.jackson.core__jackson-core__2.6.7.jar
    wget --quiet -O /databricks/jars/jackson-core-2.9.4.jar https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-core/2.9.4/jackson-core-2.9.4.jar
    
    rm /databricks/jars/spark--maven-trees--spark_2.4--com.fasterxml.jackson.core--jackson-databind--com.fasterxml.jackson.core__jackson-databind__2.6.7.1.jar
    wget --quiet -O /databricks/jars/jackson-databind-2.9.4.jar http://central.maven.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.9.4/jackson-databind-2.9.4.jar
    
    rm /databricks/jars/spark--maven-trees--spark_2.4--com.fasterxml.jackson.core--jackson-annotations--com.fasterxml.jackson.core__jackson-annotations__2.6.7.jar
    wget --quiet -O /databricks/jars/jackson-annotations-2.9.4.jar http://central.maven.org/maven2/com/fasterxml/jackson/core/jackson-annotations/2.9.4/jackson-annotations-2.9.4.jar
    
    rm /databricks/jars/spark--maven-trees--spark_2.4--com.fasterxml.jackson.module--jackson-module-scala_2.11--com.fasterxml.jackson.module__jackson-module-scala_2.11__2.6.7.1.jar
    wget --quiet -O /databricks/jars/jackson-module-scala_2.11-2.9.4.jar http://central.maven.org/maven2/com/fasterxml/jackson/module/jackson-module-scala_2.11/2.9.4/jackson-module-scala_2.11-2.9.4.jar
    
    rm /databricks/jars/spark--maven-trees--spark_2.4--com.fasterxml.jackson.dataformat--jackson-dataformat-cbor--com.fasterxml.jackson.dataformat__jackson-dataformat-cbor__2.6.7.jar
    wget --quiet -O /databricks/jars/jackson-dataformat-cbor-2.9.4.jar https://repo1.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-cbor/2.9.4/jackson-dataformat-cbor-2.9.4.jar
    
    rm /databricks/jars/spark--maven-trees--spark_2.4--com.fasterxml.jackson.datatype--jackson-datatype-joda--com.fasterxml.jackson.datatype__jackson-datatype-joda__2.6.7.jar
    wget --quiet -O /databricks/jars/jackson-datatype-joda-2.9.4.jar https://repo1.maven.org/maven2/com/fasterxml/jackson/datatype/jackson-datatype-joda/2.9.4/jackson-datatype-joda-2.9.4.jar
    
    rm /databricks/jars/spark--maven-trees--spark_2.4--com.fasterxml.jackson.module--jackson-module-paranamer--com.fasterxml.jackson.module__jackson-module-paranamer__2.6.7.jar
    wget --quiet -O /databricks/jars/jackson-module-paranamer-2.9.4.jar https://repo1.maven.org/maven2/com/fasterxml/jackson/module/jackson-module-paranamer/2.9.4/jackson-module-paranamer-2.9.4.jar
    
    # Remove buggy jackson-module-jaxb-annotations-2.9.4-shaded-for-mysql-cdc.jar, which has wrong service load definination, till Databricks fix it.
    rm /databricks/jars/spark--versions--2.4--com.fasterxml.jackson.module__jackson-module-jaxb-annotations__2.9.4_shaded-for-mysql-cdc.jar
    
  6. Upload the file to your Init Scripts directory:
    curl -n -H 'Authorization:Bearer MY_TOKEN' -F contents=@LOCAL_PATH_TO_SAVED_FILE -F path="/DBFS_STAGING_DIRECTORY_NAME/scripts/databricks_spark_2.4.X_patches.sh" "https://<account>.cloud.databricks.com/api/2.0/dbfs/put"