HDFS components and Azure Data Lake Store (ADLS)

EnrichVersion
6.5
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Open Studio for Big Data
Talend Big Data
task
Data Quality and Preparation > Third-party systems > File components (Integration) > HDFS components
Design and Development > Third-party systems > File components (Integration) > HDFS components
Data Governance > Third-party systems > File components (Integration) > HDFS components
Data Governance > Third-party systems > Cloud storages > Azure components > Azure Data Lake Store components
Design and Development > Third-party systems > Cloud storages > Azure components > Azure Data Lake Store components
Data Quality and Preparation > Third-party systems > Cloud storages > Azure components > Azure Data Lake Store components
EnrichPlatform
Talend Studio

Using HDFS components to work with Azure Data Lake Store (ADLS)

This scenario describes how to use the HDFS components to read data from and write data to Azure Data Lake Store.

This scenario applies only to a Talend solution with Big Data.

  • tFixedFlowInput: it provides sample data to the Job.

  • tLibraryLoad: it loads required libraries to the Job.

  • tHDFSOutput: it writes sample data to Azure Data Lake Store.

  • tHDFSInput: it reads sample data from Azure Data Lake Store.

  • tLogRow: it displays the output of the Job on the console of the Run view of the Job.

Configuring your Azure Data Lake Store

Before you begin

An Azure subscription is required.

Procedure

  1. Create your Azure Data Lake Store account. For more details about how to do this, see Azure documentation: Create an Azure Data Lake Store account.
  2. Create an Azure Active Directory application on your Azure portal. For more details about how to do this, see the "Create an Azure Active Directory application" section in Azure documentation: Use portal to create an Azure Active Directory application.
  3. Obtain the application ID and the authentication key from the portal.
    1. On the list of the registered applications, click the application you created and registered in the previous step to display its information blade.
    2. In the Essentials area, copy its application ID.
    3. Click All settings to display the Settings blade and click Required permissions on that blade.
    4. On the Required permissions blade, click Windows Azure Active Directory to display the Enable Access blade.
    5. Select the permissions to be granted to your application and click Save to close the Enable Access blade. You may need the consent of the administrator of your Azure portal to eventually validate the grant.
    6. Still on the Required permissions blade of your application, click Add and on the Add API access blade, click Select an API.
    7. Click Azure Data Lake and then click Select to validate your selection and automatically open the Enable Access blade of this API.
    8. Select the permission to be granted and click Select to close the Enable Access blade.
    9. On the Add API access blade, click Done to return to the Setting blade of your application.
    10. Click Keys to open the Keys blade.
    11. In the Password area, enter the description of you key, define its duration of validity and then click Save to display the value of your key.
    12. Copy the key value and keep it somewhere you think safe because you are not able to retrieve the key anymore once you leave this blade.
  4. Back to the list of the Data Lake Store services, select the Data Lake Store you created at the beginning of the procedure and then click Data Explorer.
  5. On the blade that is opened, click Access to open the Access blade.
  6. Click Add and on the Select User or Group blade, search for your application, select it and click the Select button to open the Select Permission blade.
  7. Select the permission to be assigned to your application and click OK.
    In this example, select all the permissions.
  8. Obtain the Azure OAUTH 2.0 token endpoint by proceeding as follows:
    1. Click Azure Active Directory and on the blade that is displayed, click App registrations.
    2. On the App registrations blade, click Endpoints and on the Endpoints blade, copy the value of the OAUTH 2.0 TOKEN ENDPOINT field.

Creating an HDFS Job in the Studio

Procedure

  1. On the Integration perspective, drop the following components from the Palette onto the design workspace: tFixedFlowInput, tHDFSOutput, tHDFSInput, tLogRow and three tLibraryLoad.
  2. Connect tFixedFlowInput to tHDFSOutput using a Row > Main link.
  3. Do the same to connect tHDFSInput to tLogRow.
  4. Double-click one of the three tLibraryLoad components to open its Component view.
  5. Click the [...] button to open the Module wizard and select the library to be loaded.

    In this example, load azure-data-lake-store-sdk-2.1.4.jar. This is one of the libraries required by the HDFS components to work with Azure Data Lake Store. You can find this jar in the MVN repository such as Azure Data Lake Store Java Client SDK

  6. Do the same to use the other two tLibraryLoad components to load the other two libraries.

    In this example, these libraries are hadoop-azure-datalake-2.6.0-cdh5.12.1.jar and jackson-core-2.8.4.jar.

Configuring the HDFS components to work with Azure Data Lake Store

Procedure

  1. Double-click tFixedFlowInput to open its Component view to provide sample data to the Job.

    The sample data to be used contains only one row with two column: id and name.

  2. Click the [...] button next to Edit schema to open the schema editor.
  3. Click the [+] button to add the two columns and rename them to id and name.
  4. Click OK to close the schema editor and validate the schema.
  5. In the Mode area, select Use single table.

    The id and the name columns automatically appear in the Value table and you can enter the values you want within double quotation marks in the Value column for the two schema values.

  6. Double-click tHDFSOutput to open its Component view.
  7. In the Version area, select the distribution to be used and the version of this distribution.
  8. In the NameNode URI field, enter the NameNode service of your application.

    For example, if your application is named to my_app, the NameNode URI to be used is adl://my_app.azuredatalakestore.net.

  9. In the Advanced settings tab, add the following parameters to the Hadoop properties table, each being put in double quotation marks:

    dfs.adls.oauth2.access.token.provider.type

    ClientCredential

    fs.adl.impl

    org.apache.hadoop.fs.adl.AdlFileSystem

    fs.AbstractFileSystem.adl.impl

    org.apache.hadoop.fs.adl.Adl

    dfs.adls.oauth2.client.id

    Enter the application ID you obtained in previous steps.

    dfs.adls.oauth2.credential

    Enter the application key you obtained in previous steps.

    dfs.adls.oauth2.refresh.url

    Enter the Azure OAUTH token endpoint you obtained in previous steps.

    dfs.adls.oauth2.access.token.provider

    org.apache.hadoop.fs.adls.oauth2.ConfCredentialBasedAccessTokenProvider

  10. Do the same configuration for tHDFSInput.
  11. Press F6 to run your Job.