Scenario: Iterating on a HDFS directory - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario uses a two-component Job to iterate on a specified directory in HDFS so as to select the files from there towards a local directory.

Preparing the data to be used

  • Create the files to be iterated on in the HDFS you want to use. In this scenario, two files are created in the directory: /user/ychen/data/hdfs/out.

    You can design a Job in the Studio to create the two files. For further information, see tHDFSPut or tHDFSOutput.

Linking the components

  1. In the Integration perspective of Talend Studio, create an empty Job, named HDFSList for example, from the Job Designs node in the Repository tree view.

    For further information about how to create a Job, see the Talend Studio User Guide.

  2. Drop tHDFSList and tHDFSGet onto the workspace.

  3. Connect them using the Row > Iterate link.

Configuring the iteration

  1. Double-click tHDFSList to open its Component view.

  2. In the Version area, select the Hadoop distribution you are connecting to and its version.

  3. In the Connection area, enter the values of the parameters required to connect to the HDFS.

    In the real-world practice, you may use tHDFSConnection to create a connection and reuse it from the current component. For further information, see tHDFSConnection.

  4. In the HDFS Directory field, enter the path to the folder where the files to be iterated on are. In this example, as presented earlier, the directory is /user/ychen/data/hdfs/out/.

  5. In the FileList Type field, select File.

  6. In the Files table, click to add one row and enter * between the quotation marks to iterate on any files existing.

Selecting the files

  1. Double-click tHDFSGet to open its Component view.

  2. In the Version area, select the Hadoop distribution you are connecting to and its version.

  3. In the Connection area, enter the values of the parameters required to connect to the HDFS.

    In the real-world practice, you may have used tHDFSConnection to create a connection; then you can reuse it from the current component. For further information, see tHDFSConnection.

  4. In the HDFS directory field, enter the path to the folder holding the files to be retrieved.

    To do this with the auto-completion list, place the mouse pointer in this field, then, press Ctrl+Space to display the list and select the tHDFSList_1_CURRENT_FILEDIRECTORY variable to reuse the directory you have defined in tHDFSList. In this variable, tHDFSList_1 is the label of the component. If you label it differently, select the variable accordingly.

    Once selecting this variable, the directory reads, for example, ((String)globalMap.get("tHDFSList_1_CURRENT_FILEDIRECTORY")) in this field.

    For further information about how to label a component, see the Talend Studio User Guide.

  5. In the Local directory field, enter the path, or browse to the folder you want to place the selected files in. This folder will be created if it does not exist. In this example, it is C:/hdfsFiles.

  6. In the Overwrite file field, select always.

  7. In the Files table, click to add one row and enter * between the quotation marks in the Filemask column in order to get any files existing.

Executing the Job

  • Press F6 to execute this Job.

Once done, you can check the files created in the local directory.