Uploading files to HDFS - 6.4

Talend Big Data Platform Getting Started Guide

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Big Data Platform
task
Data Quality and Preparation > Cleansing data
Data Quality and Preparation > Profiling data
Design and Development
Installation and Upgrade

Uploading a file to HDFS allows the Big Data Jobs to read and process it.

In this procedure, you will create a Job that writes data in the HDFS system of the Cloudera Hadoop cluster to which the connection has been set up in the Repository as explained in Setting up Hadoop connection manually. This data is needed for the use case described in Performing data integration tasks for Big Data. For the files needed for the use case, download tpbd_gettingstarted_source_files.zip from the Downloads tab in the left panel of this page.

Before you begin

  • The connection to the Hadoop cluster to be used and the connection to the HDFS system of this cluster have been set up from the Hadoop cluster node in the Repository.

    If you have not done so, see Setting up Hadoop connection manually and then Setting up connection to HDFS to create these connections.

  • The Hadoop cluster to be used has been properly configured and is running and you have the proper access permission to that distribution and the HDFS folder to be used.

  • Ensure that the client machine on which the Talend Jobs are executed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.

    For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.

Procedure

  1. In the Repository tree view, expand the Job Designs node, right-click the Standard node, and select Create folder from the contextual menu.
  2. In the New Folder wizard, name your Job folder getting_started and click Finish to create your folder.
  3. Right-click the getting_started folder and select Create Standard Job from the contextual menu.
  4. In the New Job wizard, give a name to the Job you are going to create and provide other useful information if needed.

    For example, enter write_to_hdfs in the Name field.

    In this step of the wizard, Name is the only mandatory field. The information you provide in the Description field will appear as hover text when you move your mouse pointer over the Job in the Repository tree view.

  5. Click Finish to create your Job.

    An empty Job is opened in the Studio.

  6. Expand the Hadoop cluster node under Metadata in the Repository tree view.
  7. Expand the Hadoop connection you have created and then the HDFS folder under it. In this example, it is the my_cdh Hadoop connection.
  8. Drop the HDFS connection from the HDFS folder into the workspace of the Job you are creating. This connection is cdh_hdfs in this example.

    The Components window is displayed to show all the components that can directly reuse this HDFS connection in a Job.

  9. Select tHDFSPut and click OK to validate your choice.

    This Components window is closed and a tHDFSPut component is automatically placed in the workspace of the current Job, with this component having been labelled using the name of the HDFS connection mentioned in the previous step.

  10. Double-click tHDFSPut to open its Component view.

    The connection to the HDFS system to be used has been automatically configured by using the configuration of the HDFS connection you have set up and stored in the Repository. The related parameters in this tab therefore becomes read-only. These parameters are: Distribution, Version, NameNode URI, Use Datanode Hostname, User kerberos authentication and Username.

  11. In the Local directory field, enter the path, or browse to the folder in which the files to be copied to HDFS are stored.

    The files about movies and their directors are stored in this directory.

  12. In the HDFS directory field, enter the path, or browse to the target directory in HDFS to store the files.

    This directory is created on the fly if it does not exist.

  13. From the Overwrite file drop-down list, select always to overwrite the files if they already exist in the target directory in HDFS.
  14. In the Files table, add one row by clicking the [+] button in order to define the criteria to select the files to be copied.
  15. In the Filemask column, enter an asterisk (*) within the double quotation marks to make tHDFSPut select all the files stored in the folder you specified in the Local directory field.
  16. Leave the New name column empty, that is to say, keep the default double quotation marks as is, so as to make the name of the files unchanged after being uploaded.
  17. Press F6 to run the Job.

    The Run view is opened automatically. It shows the progress of this Job.

Results

When the Job is done, the files you uploaded can be found in HDFS in the directory you have specified.