Moving data from ADLS Gen1 to ADLS Gen2 using Azure Databricks - 7.2

author
Talend Documentation Team
EnrichVersion
7.2
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Cloud storages > Azure components > Azure Data Lake Store components
Data Quality and Preparation > Third-party systems > Cloud storages > Azure components > Azure Data Lake Store components
Design and Development > Designing Jobs > Hadoop distributions > Databricks
Design and Development > Designing Jobs > Serverless > Databricks
Design and Development > Third-party systems > Cloud storages > Azure components > Azure Data Lake Store components
EnrichPlatform
Talend Studio

Moving data from ADLS Gen1 to ADLS Gen2 using Azure Databricks

Mount your Azure Data Lake Storage Gen2 (ADLS Gen2) filesystem to DBFS and use a Talend Job to move your data from ADLS Gen1 to ADLS Gen2.

This article demonstrates how to mount ADLS Gen2 to DBFS and then design a Job to accomplish this move. If you need details about how to mount ADLS Gen1, see Mounting ADLS Gen1 from the Azure databricks documentation.

Grant your application the access to your ADLS Gen2

Before you begin

An Azure subscription is required.

Procedure

  1. Create your Azure Data Lake Storage Gen2 account if you do not have it yet.
  2. Create an Azure Active Directory application on your Azure portal. For more details about how to do this, see the "Create an Azure Active Directory application" section in Azure documentation: Use portal to create an Azure Active Directory application.
  3. Obtain the application ID, object ID and the client secret of the application to be used from the portal.
    1. On the list of the registered applications, click the application you created and registered in the previous step to display its information blade.
    2. Click Overview to open its blade, and from the top section of the blade, copy the Object ID and the application ID displayed as Application (client) ID. Keep them somewhere safe for later use.
    3. Click Certificates & secrets to open its blade and then create the authentication key (client secret) to be used on this blade in the Client secrets section.
  4. Back to the Overview blade of the application to be used, click Endpoints on the top of this blade, copy the value of OAuth 2.0 token endpoint (v1) from the endpoint list that appears and keep it somewhere safe for later use.
  5. Set the read and write permissions to the ADLS Gen2 filesystem to be used for the service principal of your application.
    It is very likely that the administrator of your Azure system has included your account and your applications in the group that has access to a given ADLS Gen2 storage account and a given ADLS Gen2 filesystem. In this case, ask your administrator to ensure that you have the proper access and then ignore this step.
    1. Start your Microsoft Azure Storage Explorer and find your ADLS Gen2 storage account on the Storage Accounts list.
      If you have not installed Microsoft Azure Storage Explorer, you can download it from the Microsoft Azure official site.
    2. Expand this account and the Blob Containers node under it; then click the ADLS Gen2 hierarchical filesystem to be used under this node.

      Example

      The filesystem in this image is for demonstration purposes only. Create the filesystem to be used under the Blob Containers node in your Microsoft Azure Storage Explorer, if you do not have one yet.

    3. On the blade that is opened, click Manage Access to open its wizard.
    4. At the bottom of this wizard, add the object ID of your application to the Add user or group field and click Add.
    5. Select the object ID just added from the Users and groups list and select all the permission for Access and Default.
    6. Click Save to validate these changes and close this wizard.

Mount the Azure Data Lake Storage Gen2 filesystem to be used to DBFS

Before you begin

  • Ensure that you have grant your application the read-and-write permissions to your ADLS Gen2 filesystem.

Procedure

  1. Download the Databricks CLI and install it as described in this documentation: Databricks Command-Line Interface.
  2. Use this Databricks CLI to create a Databricks-backed secret scope. For example, name this scope to talendsadlsgen2. The command to be used is:

    Example

    databricks secrets create-scope --scope talendadlsgen2 --initial-manage-principal users

    This command grant the access permissions to this secret scope to all users.

  3. Add a secret to this scope using the following command:

    Example

    databricks secrets put --scope talendadlsgen2 --key adlscredentials

    In this command, talendadlsgen2 is the name of the secret scope created in the previous step; adlscredentials is the secret to be created.

  4. Once the command in the previous step is run, an text editor displays automatically. Paste the value of the adlscredentials secret in this editor and save and exit the editor. In this step, this value is the client secret of your ADLS Gen2 storage account.

    Example

    # ----------------------------------------------------------------------
    # Do not edit the above line. Everything below it will be ignored.
    # Please input your secret value above the line. Text will be stored in
    # UTF-8 (MB4) form and any trailing new line will be stripped.
    # Exit without saving will abort writing secret.

    This value must be added above this line.

  5. Repeat this process to add, to this talendadlsgen2 secret scope, the following secrets, separately:
    • adlsclientid: the value of this secret is the application ID of your ADLS Gen2 storage account.
    • adlsendpoint: the value of this secret is the oauth 2.0 token endpoint of your ADLS Gen2 storage account.
  6. On your Azure Databricks portal, create a Databricks cluster from the Azure Databricks Workspace. The version of this cluster must be among those supported by Talend.
  7. Once the cluster is created and running, switch back to the Azure Databricks Workspace and click Create a Blank Notebook.

    Example

  8. Add the following Scala code to this Notebook and replace file-system-name, storage-account-name> and mount-name with their actual values:

    Example

    val configs = Map(
    	"fs.azure.account.auth.type" -> "OAuth",
    	"fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    	"fs.azure.account.oauth2.client.id" -> dbutils.secrets.get(scope = "talendadlsgen2", key = "adlsclientid"),
    	"fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope = "talendadlsgen2", key = "adlscredentials"),
    	"fs.azure.account.oauth2.client.endpoint" -> dbutils.secrets.get(scope = "talendadlsgen2", key = "adlsendpoint")
    	)                             
    // Optionally, you can add <directory-name> to the source URI of your mount point.
    dbutils.fs.mount(
    	source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
    	mountPoint = "/mnt/<mount-name>",
    	extraConfigs = configs)
    					
  9. Optionally, append the following lines to the code added in the previous step:

    Example

    val df = spark.read.text("/mnt/<mount-name>/<file-location>")
    df.show();

    These lines allow you to access files in your ADLS Gen2 filesystem as if they were in DBFS.

Results

If the ADLS Gen2 filesystem to be mounted contains some files, run this Notebook. Then you should see the data stored in the file specified by <file-location> in the lines appended in the last step.

Granting the application to be used the access to your ADLS Gen1 folder

Procedure

  1. Create an Azure Active Directory application on your Azure portal to access your ADLS Gen1 folder. For more details about how to do this, see the "Create an Azure Active Directory application" section in Azure documentation: Use portal to create an Azure Active Directory application
  2. Obtain the application ID and the client secret (authentication key) from the portal.
    1. On the list of the registered applications, click the application you created and registered in the previous step to display its information blade.
    2. In the Essentials area, copy its application ID.
    3. Click All settings to display the Settings blade and click Required permissions on that blade.
    4. On the Required permissions blade, click Windows Azure Active Directory to display the Enable Access blade.
    5. Select the permissions to be granted to your application and click Save to close the Enable Access blade. You may need the consent of the administrator of your Azure portal to eventually validate the grant.
    6. Still on the Required permissions blade of your application, click Add and on the Add API access blade, click Select an API.
    7. Click Azure Data Lake and then click Select to validate your selection and automatically open the Enable Access blade of this API.
    8. Select the permission to be granted and click Select to close the Enable Access blade.
    9. On the Add API access blade, click Done to return to the Setting blade of your application.
    10. Click Keys to open the Keys blade.
    11. In the Password area, enter the description of you key, define its duration of validity and then click Save to display the value of your key.
    12. Copy the key value and keep it somewhere you think safe because you are not able to retrieve the key anymore once you leave this blade.
  3. Back to the list of the Azure Data Lake Storage services, select the Data Lake Storage you created at the beginning of the procedure and then click Data Explorer.
  4. On the blade that is opened, click Access to open the Access blade.
  5. Click Add and on the Select User or Group blade, search for your application, select it and click the Select button to open the Select Permission blade.
  6. Select the permission to be assigned to your application and click OK.
    In this example, select all the permissions.
  7. Obtain the Azure OAUTH 2.0 token endpoint by proceeding as follows:
    1. Click Azure Active Directory and on the blade that is displayed, click App registrations.
    2. On the App registrations blade, click Endpoints and on the Endpoints blade, copy the value of the OAUTH 2.0 TOKEN ENDPOINT field.

Creating a Job to move data from ADLS Gen1 to Gen2

Before you begin

  • A Talend Studio with Big Data is started and the Integration perspective is active.
  • You Databricks cluster is running.

Procedure

  1. Right-click the Big Data Batch node under Job Designs and select Create Big Data Batch Job from the contextual menu.
  2. In the New Job wizard, give a name to the Job you are going to create and provide other useful information if needed.
  3. Click Finish to create your Job.
    An empty Job is opened in the Studio.
  4. In the workspace, enter the name of the component to be used and select this component from the list that appears. In this scenario, the components are tAzureFSConfiguration, tFileInputDelimited and tFileOutputDelimited.
  5. Connect tFileInputDelimited to tFileOutputDelimited using the Row > Main link.
    In this example, the data to be migrated is assumed to be delimited data. For this reason, the components specific to delimited data are used.
  6. Leave tAzureFSConfiguration alone without any connection.
  7. Double-click tAzureFSConfiguration to open its Component view.
    Spark uses this component to connect to your ADLS Gen1 storage account from which you migrate data to the mounted ADLS Gen2 filesystem.
  8. From the Azure FileSystem drop-down list, select Azure Datalake Storage.
  9. In the Datalake storage account field, enter the name of the Data Lake Storage account you need to access.
  10. In the Client ID and the Client key fields, enter, respectively, the authentication ID and the authentication key generated upon the registration of the application used to access ADLS Gen1.

    Ensure that the application to be used has appropriate permissions to access Azure Data Lake. You can check this on the Required permissions view of this application on Azure. For further information, see Azure documentation Assign the Azure AD application to the Azure Data Lake Storage account file or folder.

  11. In the Token endpoint field, copy-paste the OAuth 2.0 token endpoint that you can obtain from the Endpoints list accessible on the App registrations page on your Azure portal.
  12. Double-click tFileInputDelimited to open its Component view.

    Example

  13. Select the Define a storage configuration component check box to use the ADLS Gen1 connection configuration from tAzureFSConfiguration.
  14. In the Folder/File field, enter the directory in which the data to be migrated is stored in your ADLS Gen1 folder.
  15. Click the [...] button next to Edit schema to define the schema of the data to be migrated and accept the propagation of the schema to the component that follows, that is to say, tFileOutputDelimited.

    Example

    This image is for demonstration purposes only. In this example schema, the data has only two columns: FirstName and LastName.

  16. In Row separator and Field separator, enter the separators used in your data, respectively.
  17. Double-click tFileOutputDelimited to open its Component view.

    Example

  18. Clear the Define a storage configuration component check box to use the DBFS system of your Databricks cluster.
  19. In the Folder field, enter the directory to be used to store the migrated data in the mounted ADLS Gen2 filesystem. For example, in this /mnt/adlsgen2/fromgen1 directory, adlsgen2 is the mount name specified when the filesystem was mounted and fromgen1 is the folder to be used to store the migrated data.
  20. From the Action drop-down list, select Create if the folder to be used does not exist yet on Azure Data Lake Storage; if this folder already exists, select Overwrite.

Configuring the Spark cluster on Azure Databricks

Procedure

  1. On the Configuration tab of your Databricks cluster page, scroll down to the Spark tab at the bottom of the page.
  2. Click Edit to make the fields on this page editable.
  3. As your Spark cluster uses tAzureFSConfiguration to connect your ADLS Gen1 folder from which you move data to Gen2, in this Spark tab, enter the Spark properties regarding the credentials to be used to access that ADLS Gen1 folder.
    spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential
    spark.hadoop.dfs.adls.oauth2.client.id <your_app_id>
    spark.hadoop.dfs.adls.oauth2.credential <your_client_secret>
    spark.hadoop.dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/<your_app_TENANT-ID>/oauth2/token

    Add these ADLS Gen1 related properties each per line.

  4. Restart your Spark cluster.
  5. In the Spark UI tab of your Databricks cluster page, click Environment to display the list of properties and verify that each of the properties you added in the previous steps is present on that list.
  6. In the Spark configuration tab of the Run view of your Job, enter the basic connection information to Databricks.

    Standalone

    • In the Endpoint field, enter the URL address of your Azure Databricks workspace. This URL can be found in the Overview blade of your Databricks workspace page on your Azure portal. For example, this URL could look like https://westeurope.azuredatabricks.net.

    • In the Cluster ID field, enter the ID of the Databricks cluster to be used. This ID is the value of the spark.databricks.clusterUsageTags.clusterId property of your Spark cluster. You can find this property on the properties list in the Environment tab in the Spark UI view of your cluster.

      You can also easily find this ID from the URL of your Databricks cluster. It is present immediately after cluster/ in this URL.

    • Click the [...] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For further information, see Token management from the Azure documentation.

    • In the DBFS dependencies folder field, enter the directory that is used to store your Job related dependencies on Databricks Filesystem at runtime, putting a slash (/) at the end of this directory. For example, enter /jars/ to store the dependencies in a folder named jars. This folder is created on the fly if it does not exist then.

    • Poll interval when retrieving Job status (in ms): enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the status of your Job. For example, this status could be Pending or Running.

      The default value is 300000, meaning 30 seconds. This interval is recommended by Databricks to correctly retrieve the Job status.

    • Use transient cluster: you can select this check box to leverage the transient Databricks clusters.

      The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime.

      1. Autoscale: select or clear this check box to define the number of workers to be used by your transient cluster.
        1. If you select this check box, autoscaling is enabled. Then define the minimum number of workers in Min workers and the maximum number of worders in Max workers. Your transient cluster is scaled up and down within this scope based on its workload.

          According to the Databricks documentation, autoscaling works best with Databricks runtime versions 3.0 or onwards.

        2. If you clear this check box, autoscaling is deactivated. Then define the number of workers a transient cluster is expected to have. This number does not include the Spark driver node.
      2. Node type and Driver node type: select the node types for the workers and the Spark driver node. These types determine the capacity of your nodes and their pricing by Databricks.

        For details about these node types and the Databricks Units they use, see Supported Instance Types from the Databricks documentation.

      3. Elastic disk: select this check box to enable your transient cluster to automatically scale up its disk space when its Spark workers are running low on disk space.

        For more details about this elastic disk feature, search for the section about autoscaling local storage from your Databricks documentation.

      4. SSH public key: if an SSH access has been set up for your cluster, enter the public key of the generated SSH key pair. This public key is automatically added to each node of your transient cluster. If no SSH access has been set up, ignore this field.

        For further information about SSH access to your cluster, see SSH access to clusters from the Databricks documentation.

      5. Configure cluster log: select this check box to define where to store your Spark logs for a long term. This storage system could be S3 or DBFS.
    • Do not restart the cluster when submitting: select this check box to prevent the Studio restarting the cluster when the Studio is submitting your Jobs. However, if you make changes in your Jobs, clear this check box so that the Studio resarts your cluster to take these changes into account.
  7. Press F6 to run this Job to start the migration.