Scenario: Computing data with Hadoop distributed file system - 6.3

Talend Open Studio for Big Data Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Open Studio for Big Data
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

The following scenario describes a simple Job that creates a file in a defined directory, get it into and out of HDFS, subsequently store it to another local directory, and read it at the end of the Job.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tFileOutputDelimited, tHDFSPut, tHDFSGet, tFileInputDelimited and tLogRow.

  2. Connect tFixedFlowInput to tFileOutputDelimited using a Row > Main connection.

  3. Connect tFileInputDelimited to tLogRow using a Row > Main connection.

  4. Connect tFixedFlowInput to tHDFSPut using an OnSubjobOk connection.

  5. Connect tHDFSPut to tHDFSGet using an OnSubjobOk connection.

  6. Connect tHDFSGet to tFileInputDelimitedusing an OnSubjobOk connection.

Configuring the input component

  1. Double-click tFixedFlowInput to define the component in its Basic settings view.

  2. Set the Schema to Built-In and click the three-dot [...] button next to Edit Schema to describe the data structure you want to create from internal variables. In this scenario, the schema contains one column: content.

  3. Click the plus button to add the parameter line.

  4. Click OK to close the dialog box and accept to propagate the changes when prompted by the studio.

  5. In Basic settings, define the corresponding value in the Mode area using the Use Single Table option. In this scenario, the value is "Hello world!".

Configuring the tFileOutputDelimited component

  1. Double-click tFileOutputDelimited to define the component in its Basic settings view.

  2. Click the [...] button next to the File Name field and browse to the output file you want to write data in, in.txt in this example.

Loading the data from the local file

  1. Double-click tHDFSPut to define the component in its Basic settings view.

  2. Select, for example, Apache 0.20.2 from the Hadoop version list.

  3. In the NameNode URI, the Username and the Group fields, enter the connection parameters to the HDFS.

  4. Next to the Local directory field, click the three-dot [...] button to browse to the folder with the file to be loaded into the HDFS. In this scenario, the directory has been specified while configuring tFileOutputDelimited: C:/hadoopfiles/putFile/.

  5. In the HDFS directory field, type in the intended location in HDFS to store the file to be loaded. In this example, it is /testFile.

  6. Click the Overwrite file field to stretch the drop-down.

  7. From the menu, select always.

  8. In the Files area, click the plus button to add a row in which you define the file to be loaded.

  9. In the File mask column, enter *.txt to replace newLine between quotation marks and leave the New name column as it is. This allows you to extract all the .txt files in the specified directory without changing their names. In this example, the file is in.txt.

Getting the data from the HDFS

  1. Double-click tHDFSGet to define the component in its Basic settings view.

  2. Select, for example, Apache 0.20.2 from the Hadoop version list.

  3. In the NameNode URI, the Username, the Group fields, enter the connection parameters to the HDFS.

  4. In the HDFS directory field, type in location storing the loaded file in HDFS. In this example, it is /testFile.

  5. Next to the Local directory field, click the three-dot [...] button to browse to the folder intended to store the files that are extracted out of the HDFS. In this scenario, the directory is: C:/hadoopfiles/getFile/.

  6. Click the Overwrite file field to stretch the drop-down.

  7. From the menu, select always.

  8. In the Files area, click the plus button to add a row in which you define the file to be extracted.

  9. In the File mask column, enter *.txt to replace newLine between quotation marks and leave the New name column as it is. This allows you to extract all the .txt files from the specified directory in the HDFS without changing their names. In this example, the file is in.txt.

Reading data from the HDFS and saving the data locally

  1. Double-click tFileInputDelimited to define the component in its Basic settings view.

  2. Set property type to Built-In.

  3. Next to the File Name/Stream field, click the three-dot button to browse to the file you have obtained from the HDFS. In this scenario, the directory is C:/hadoopfiles/getFile/in.txt.

  4. Set Schema to Built-In and click Edit schema to define the data to pass on to the tLogRow component.

  5. Click the plus button to add a new column.

  6. Click OK to close the dialog box and accept to propagate the changes when prompted by the studio.

Executing the Job

Save the Job and press F6 to execute it.

The in.txt file is created and loaded into the HDFS.

The file is also extracted from the HDFS by tHDFSGet and is read by tFileInputDelimited.