Adding Azure specific properties to access the Azure storage system from Databricks - Cloud - 8.0

Spark Streaming

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Streaming
Last publication date
2024-02-20

Add the Azure specific properties to the Spark configuration of your Databricks cluster so that your cluster can access Azure Storage.

You need to do this only when you want your Talend Jobs for Apache Spark to use Azure Blob Storage or Azure Data Lake Storage with Databricks.

Before you begin

  • Ensure that your Spark cluster in Databricks has been properly created and is running and its version is supported by Talend Studio. If you use Azure Data Lake Storage Gen 2, only Databricks 5.4 is supported.

    For further information, see Create Databricks workspace from Azure documentation.

  • You have an Azure account.
  • The Azure Blob Storage or Azure Data Lake Storage service to be used has been properly created and you have the appropriate permissions to access it. For further information about Azure Storage, see Azure Storage tutorials from Azure documentation.
  • When you are using a Machine Learning component or tMatchPredict, you have set the Databricks Runtime Version setting to X.X LTS ML.

Procedure

  1. On the Configuration tab of your Databricks cluster page, scroll down to the Spark tab at the bottom of the page.

    Example

  2. Click Edit to make the fields on this page editable.
  3. In this Spark tab, enter the Spark properties regarding the credentials to be used to access your Azure Storage system.
    Option Description
    Azure Blob Storage

    When you need to use Azure Blob Storage with Azure Databricks, add the following Spark property:

    • The parameter to provide account key:

      spark.hadoop.fs.azure.account.key.<storage_account>.blob.core.windows.net <key>

      Ensure that the account to be used has the appropriate read/write rights and permissions.

    • If you need to append data to an existing file, add this parameter:

      spark.hadoop.fs.azure.enable.append.support true
    Azure Data Lake Storage (Gen 1) When you need to use Azure Data Lake Storage Gen1 with Databricks, add the following Spark properties, each per line:
    spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential
    spark.hadoop.dfs.adls.oauth2.client.id <your_app_id>
    spark.hadoop.dfs.adls.oauth2.credential <your_authentication_key>
    spark.hadoop.dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/<your_app_TENANT-ID>/oauth2/token
    Azure Data Lake Storage (Gen 2)

    When you need to use Azure Data Lake Storage Gen2 with Databricks, add the following Spark properties, each per line:

    • The parameter to provide an account key:

      spark.hadoop.fs.azure.account.key.<storage_account>.dfs.core.windows.net <key>

      This key is associated with the storage account to be used. You can find it in the Access keys blade of this storage account. Two keys are available for each account and by default, either of them can be used for this access.

      Ensure that the account to be used has the appropriate read/write rights and permissions.

    • If the ADLS file system to be used does not exist yet, add the following parameter:

      spark.hadoop.fs.azure.createRemoteFileSystemDuringInitialization true
    For further information about how to find your application ID and authentication key, see Get application ID and authentication key from the Azure documentation. In the same documentation, you can also find details about how to find your tenant ID at Get tenant ID.
  4. If you need to run Spark Streaming Jobs with Databricks, in the same Spark tab, add the following property to define a default Spark serializer. If you do not plan to run Spark Streaming Jobs, you can ignore this step.
    spark.serializer org.apache.spark.serializer.KryoSerializer
  5. Restart your Spark cluster.
  6. In the Spark UI tab of your Databricks cluster page, click Environment to display the list of properties and verify that each of the properties you added in the previous steps is present on that list.