Before you begin
- Ensure that your Spark cluster in Databricks has been properly created and is running. For further information, see Create Databricks workspace from Azure documentation.
Ensure that you have added the Spark properties regarding the credentials to be used to access your Azure Data Lake Storage Gen1 system, each per line.
spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential spark.hadoop.dfs.adls.oauth2.client.id <your_app_id> spark.hadoop.dfs.adls.oauth2.credential <your_authentication_key> spark.hadoop.dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/<your_app_TENANT-ID>/oauth2/token
- You have an Azure account.
- The Azure Data Lake Storage service to be used has been properly created and your Azure Active Directory application application have the appropriate permissions to access it. You can ask the administrator of your Azure system to be certain of this, or following the procedure described in the section called Granting the application to be used the access to your ADLS Gen1 folder in Moving data from ADLS Gen1 to ADLS Gen2 using Azure Databricks.
Double-click tAzureFSConfiguration to open its Component view.
- From the Azure FileSystem drop-down list, select Azure Datalake Storage. The parameters specific to Azure ADLS Gen2 are displayed.
- In the Client ID and the Client key fields, enter, respectively, the authentication ID and the authentication key (client secret) generated upon the registration of the application that the current Job you are developing uses to access Azure Data Lake Storage.
- In the Token endpoint field, copy-paste the OAuth 2.0 token endpoint that you can obtain from the Endpoints list accessible on the App registrations page on your Azure portal.
Double-click the tFileOutputParquet component which receives the
Its Basic settings view is opened in the lower part of the Studio.
- Select the Define a storage configuration component check box to reuse the configuration provided by tAzureFSConfiguration, so as to connect to the ADLS Gen2 file system to be used.
- In the Folder/File field, enter the directory in which you need to write the result. In this scenario, it is /ychen/movie_library, which receives the records that contain the names of the movie directors.
- Select Overwrite from the Action drop-down list. This way, the target directory is overwritten if it exists.
- Repeat the same operations to configure the other tFileOutputParquet component used to receive the reject link, but set the directory, in the Folder/File field, to /ychen/movie_library/reject.
In the Run view, click the Spark Configuration tab to open its view.
- Clear the Use local mode check box.
- From the Property Type drop-down list, select Repository, then click the ... button and from the Repository Content list, select the movie_library connection metadata you defined previously in Setting up the connection to your Big Data platform.
- Click OK to validate your choice. The fields in the Spark Configuration tab are automatically filled with parameters from this connection metadata.
- Press F6 to run the Job.
The Run view is automatically opened in the lower part of the Studio.
Once done, you can check, for example in your Microsoft Azure Storage Explorer, that the output has been written in the ADLS Gen1 folder.