Uploading files to HDFS

Talend Real-time Big Data Platform Getting Started Guide

EnrichVersion
6.3
EnrichProdName
Talend Cloud
Talend Real-Time Big Data Platform
task
Administration and Monitoring
Installation and Upgrade
Design and Development
Deployment
Data Quality and Preparation
EnrichPlatform
Talend Studio
Talend Installer
Talend Runtime
Talend CommandLine
Talend Administration Center
Talend ESB
Talend DQ Portal

Uploading a file to HDFS allows the Big Data Jobs to read and process it.

Prerequisites:

  • The connection to the Hadoop cluster to be used and the connection to the HDFS system of this cluster have been set up from the Hadoop cluster node in the Repository.

    If you have not done so, see Setting up Hadoop connection manually and then Setting up connection to HDFS to create these connections.

  • The Hadoop cluster to be used has been properly configured and is running and you have the proper access permission to that distribution and the HDFS folder to be used.

  • You have ensured that the client machine on which the Talend Jobs are executed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.

    For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.

  • You have launched your Talend Studio and opened the Integration perspective.

In this procedure, you will create a Job that writes data in the HDFS system of the Cloudera Hadoop cluster to which the connection has been set up in the Repository as explained in Setting up Hadoop connection manually. This data is needed for the use cases described in Talend Real-time Big Data Platform in use. The files needed for the use cases can be download here.

  1. In the Repository tree view, expand the Job Designs node, right-click the Standard node and select Create folder from the contextual menu.

  2. In the [New Folder] wizard, name your Job folder getting_started and click Finish to create your folder.

  3. Right-click the getting_started folder and select Create Standard Job from the contextual menu.

  4. In the [New Job] wizard, enter a name for the Job to be created and other useful information.

    For example, enter write_to_hdfs in the Name field.

    In this step of the wizard, Name is the only mandatory field. The information you provide in the Description field will appear as a tooltip when you move your mouse pointer over the Job in the Repository tree view.

  5. Click Finish to create your Job.

    An empty Job is opened in the Studio.

  6. Expand the Hadoop cluster node under Metadata in the Repository tree view.

  7. Expand the Hadoop connection you have created and then the HDFS folder under it. In this example, it is the my_cdh Hadoop connection.

  8. Drop the HDFS connection from the HDFS folder into the workspace of the Job you are creating. This connection is cdh_hdfs in this example.

    The [Components] window is displayed to show all the components that can directly reuse this HDFS connection in a Job.

  9. Select tHDFSPut and click OK to validate your choice.

    This [Components] window is closed and a tHDFSPut component is automatically placed in the workspace of the current Job, with this component having been labelled using the name of the HDFS connection mentioned in the previous step.

  10. Double-click tHDFSPut to open its Component view.

    The connection to the HDFS system to be used has been automatically configured by using the configuration of the HDFS connection you have set up and stored in the Repository. The related parameters in this tab therefore becomes read-only. These parameters are: Distribution, Version, NameNode URI, Use Datanode Hostname, User kerberos authentication and Username.

  11. In the Local directory field, enter the path, or browse to the folder in which the files to be copied to HDFS are stored.

    The files about movies and their directors are stored in this directory.

  12. In the HDFS directory field, enter the path, or browse to the target directory in HDFS to store the files.

    This directory is created on the fly if it does not exist.

  13. From the Overwrite file drop-down list, select always to overwrite the files if they already exist in the target directory in HDFS.

  14. In the Files table, add one row by clicking the [+] button in order to define the criteria to select the files to be copied.

  15. In the Filemask column, enter an asterisk (*) within the double quotation marks to make tHDFSPut select all the files stored in the folder you specified in the Local directory field.

  16. Leave the New name column empty, that is to say, keep the default double quotation marks as is, so as to make the name of the files unchanged after being uploaded.

  17. Press F6 to run the Job.

    The Run view is opened automatically. It shows the progress of this Job.

When the Job is done, the files you uploaded can be found in HDFS in the directory you have specified.