Two output components are configured to write the expected movie data and
the rejected movie data to different directories in an Azure ADLS Gen1
folder.
Before you begin
- Ensure that your Spark cluster in Databricks has been properly
created and is running. For further information, see Create Databricks workspace from
Azure documentation.
-
Ensure that you have added the Spark properties regarding the
credentials to be used to access your Azure Data Lake Storage Gen1 system,
each per
line.
spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential
spark.hadoop.dfs.adls.oauth2.client.id <your_app_id>
spark.hadoop.dfs.adls.oauth2.credential <your_authentication_key>
spark.hadoop.dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/<your_app_TENANT-ID>/oauth2/token
- You have an Azure account.
- The Azure Data Lake Storage service to be used has been properly
created and your Azure Active Directory application application have the
appropriate permissions to access it. You can ask the administrator of your
Azure system to be certain of this, or following the procedure described in the
section called Granting the application to be used the access to your ADLS Gen1
folder in Moving data from ADLS Gen1 to ADLS Gen2 using Azure Databricks.
Procedure
-
Double-click tAzureFSConfiguration to open its Component view.
Example
-
From the Azure
FileSystem drop-down list, select Azure Datalake Storage. The parameters specific to Azure ADLS
Gen2 are displayed.
-
In the Client ID and
the Client key fields, enter,
respectively, the authentication ID and the authentication key (client secret)
generated upon the registration of the application that the current Job you are
developing uses to access Azure Data Lake Storage.
-
In the Token endpoint
field, copy-paste the OAuth 2.0 token endpoint that you can obtain from the
Endpoints list accessible on the App registrations page on your Azure
portal.
-
Double-click the tFileOutputParquet component which receives the
out1 link.
Its Basic
settings view is opened in the lower part of the Studio.
-
Select the Define a storage
configuration component check box to reuse the configuration
provided by tAzureFSConfiguration, so as to
connect to the ADLS Gen2 file system to be used.
-
In the Folder/File field, enter the directory in which you need to
write the result. In this scenario, it is /ychen/movie_library, which receives the records that contain
the names of the movie directors.
-
Select Overwrite from the Action drop-down list. This way, the target directory is
overwritten if it exists.
-
Repeat the same operations to configure the other
tFileOutputParquet
component used to receive the reject link, but set the directory, in the Folder/File field, to /ychen/movie_library/reject.
-
In the Run view, click the Spark Configuration tab to open its view.
-
Clear the Use local
mode check box.
-
From the Property Type
drop-down list, select Repository, then
click the ... button and from the
Repository Content list, select the
movie_library connection metadata
you defined previously in Setting up the connection to your Big Data platform.
-
Click OK to validate
your choice. The fields in the Spark
Configuration tab are automatically filled with parameters from
this connection metadata.
-
Press F6 to run the Job.
Results
The Run view is automatically opened in the lower part of the Studio.
Once done, you can check, for example in your Microsoft
Azure Storage Explorer, that the output has been written in the ADLS Gen1 folder.