Writing the output to HDFS

Talend Data Fabric Getting Started Guide

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Data Fabric
task
Design and Development
Installation and Upgrade
Data Quality and Preparation > Profiling data
Data Quality and Preparation > Cleansing data
Two output components are configured to write the expected movie data and the rejected movie data to different directories in HDFS.

Before you begin

  • Ensure that the client machine on which the Talend Jobs are executed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.

    For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.

  • The Hadoop cluster to be used has been properly configured and is running.

  • The administrator of the cluster has given read/write rights and permissions to the username to be used for the access to the related data and directories in HDFS.

Procedure

  1. Double-click the tHDFSOutput component, which receives the out1 link.

    Its Basic settings view is opened in the lower part of the Studio.

  2. In the Folder field, enter or browse to the directory you need to write the result in. In this scenario, it is /user/ychen/output_data/mapreduce/out, which receives the records that contain the names of the movie directors.
  3. Select Overwrite from the Action drop-down list. This way, the target directory is overwritten if it exists.
  4. Select the Merge result to single file check box in order to merge the part- files typically generated by MapReduce into one single file. The Merge file path field is displayed.
  5. In the Merge file path field, enter or browse to the file into which you want the part- files to merge.

    In this scenario, this file is /user/ychen/output_data/mapreduce/out/merged.

  6. Repeat the same operations to configure the tFileOutputDelimited component, which receives the reject link, but set the directory, in the Folder field, to /user/ychen/output_data/mapreduce/reject and leave the Merge result to single file check box clear.
  7. In the Run view, click the Hadoop configuration tab to verify that the Hadoop connection metadata has been properly imported from the Repository.

    You always need to use this Hadoop Configuration tab to define the connection to a given Hadoop distribution for the whole MapReduce Job and this connection is effective on a per-Job basis.

  8. Press F6 to run the Job.

Results

The Run view is automatically opened in the lower part of the Studio and shows the execution progress of this Job.

The Job itself also shows the progress graphically.

Once done, you can check, for example in the web console of your HDFS system, that the output has been written in HDFS.

A merged file has also been created.