Using HDFS components to work with Azure Data Lake Store (ADLS)
This scenario applies only to a Talend solution with Big Data.
tFixedFlowInput: it provides sample data to the Job.
tLibraryLoad: it loads required libraries to the Job.
tHDFSOutput: it writes sample data to Azure Data Lake Store.
tHDFSInput: it reads sample data from Azure Data Lake Store.
tLogRow: it displays the output of the Job on the console of the Run view of the Job.
Configuring your Azure Data Lake Store
An Azure subscription is required.
- Create your Azure Data Lake Store account. For more details about how to do this, see Azure documentation: Create an Azure Data Lake Store account.
- Create an Azure Active Directory application on your Azure portal. For more details about how to do this, see the "Create an Azure Active Directory application" section in Azure documentation: Use portal to create an Azure Active Directory application.
Obtain the application ID and the authentication key from the portal.
- On the list of the registered applications, click the application you created and registered in the previous step to display its information blade.
- In the Essentials area, copy its application ID.
- Click All settings to display the Settings blade and click Required permissions on that blade.
- On the Required permissions blade, click Windows Azure Active Directory to display the Enable Access blade.
- Select the permissions to be granted to your application and click Save to close the Enable Access blade. You may need the consent of the administrator of your Azure portal to eventually validate the grant.
- Still on the Required permissions blade of your application, click Add and on the Add API access blade, click Select an API.
- Click Azure Data Lake and then click Select to validate your selection and automatically open the Enable Access blade of this API.
- Select the permission to be granted and click Select to close the Enable Access blade.
- On the Add API access blade, click Done to return to the Setting blade of your application.
- Click Keys to open the Keys blade.
- In the Password area, enter the description of you key, define its duration of validity and then click Save to display the value of your key.
- Copy the key value and keep it somewhere you think safe because you are not able to retrieve the key anymore once you leave this blade.
- Back to the list of the Data Lake Store services, select the Data Lake Store you created at the beginning of the procedure and then click Data Explorer.
- On the blade that is opened, click Access to open the Access blade.
- Click Add and on the Select User or Group blade, search for your application, select it and click the Select button to open the Select Permission blade.
Select the permission to be assigned to your application and click OK.
In this example, select all the permissions.
Obtain the Azure OAUTH 2.0 token endpoint by proceeding as follows:
- Click Azure Active Directory and on the blade that is displayed, click App registrations.
- On the App registrations blade, click Endpoints and on the Endpoints blade, copy the value of the OAUTH 2.0 TOKEN ENDPOINT field.
Creating an HDFS Job in the Studio
- On the Integration perspective, drop the following components from the Palette onto the design workspace: tFixedFlowInput, tHDFSOutput, tHDFSInput, tLogRow and three tLibraryLoad.
- Connect tFixedFlowInput to tHDFSOutput using a link.
Do the same to connect tHDFSInput to tLogRow.
Double-click one of the three tLibraryLoad components to open its Component view.
Click the [...] button to open the Module wizard and select the library to be loaded.
In this example, load azure-data-lake-store-sdk-2.1.4.jar. This is one of the libraries required by the HDFS components to work with Azure Data Lake Store. You can find this jar in the MVN repository such as Azure Data Lake Store Java Client SDK
Do the same to use the other two tLibraryLoad components to load the other two libraries.
In this example, these libraries are hadoop-azure-datalake-2.6.0-cdh5.12.1.jar and jackson-core-2.8.4.jar.
Configuring the HDFS components to work with Azure Data Lake Store
Double-click tFixedFlowInput to open its Component view to provide sample data to the Job.
The sample data to be used contains only one row with two column: id and name.
- Click the [...] button next to Edit schema to open the schema editor.
- Click the [+] button to add the two columns and rename them to id and name.
- Click OK to close the schema editor and validate the schema.
In the Mode area, select Use single table.
The id and the name columns automatically appear in the Value table and you can enter the values you want within double quotation marks in the Value column for the two schema values.
- Double-click tHDFSOutput to open its Component view.
- In the Version area, leave the options as they are. These options do not impact your Job.
In the NameNode URI field, enter the NameNode service of
your application. The location of this service is actually the address of your
Data Lake Store.
For example, if your Data Lake Store name is data_lake_store_name, the NameNode URI to be used is adl://data_lake_store_name.azuredatalakestore.net.
In the Advanced settings tab, add the following parameters to the Hadoop properties table, each being put in double quotation marks:
Enter the application ID you obtained in previous steps.
Enter the application key you obtained in previous steps.
Enter the Azure OAUTH token endpoint you obtained in previous steps.
- Do the same configuration for tHDFSInput.
- If you run your Job on Windows, following this procedure to add the winutils.exe program to your Job.
- Press F6 to run your Job.