Uploading a file to HDFS allows the Big Data Jobs to read and process
it.
In this procedure, you will create a Job that writes data in the HDFS system
of the Cloudera Hadoop cluster to which the connection has been set up in the Repository as explained in Setting up Hadoop connection manually. This data is needed for the
use case described in Performing data integration tasks for Big Data.
For the files needed for the use case, download tdf_gettingstarted_source_files.zip from the
Downloads tab in the left panel
of this page.
Before you begin
-
The connection to the Hadoop cluster to be used and the
connection to the HDFS system of this cluster have been set up from the
Hadoop cluster node in the Repository.
If you have not done so, see Setting up Hadoop connection manually and then Setting up connection to HDFS to create these
connections.
-
The Hadoop cluster to be used has been properly configured and is
running and you have the proper access permission to that distribution and
the HDFS folder to be used.
-
Ensure that the client machine on which the Talend Jobs are executed can
recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add
the IP address/hostname mapping entries for the services of that Hadoop cluster in the
hosts file of the client machine.
For example, if the host name of the Hadoop Namenode server is
talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads
192.168.x.x talend-cdh550.weave.local.
-
You have launched your Talend Studio and opened
the Integration perspective.
Procedure
-
In the Repository tree view, expand the Job
Designs node, right-click the Standard
node, and select Create folder from the contextual menu.
-
In the New Folder wizard, name your Job folder
getting_started and click
Finish to create your folder.
-
Right-click the getting_started folder and select
Create Standard Job from the contextual menu.
-
In the New Job wizard, give a
name to the Job you are going to create and provide other useful information if
needed.
For example, enter write_to_hdfs in the
Name field.
In this step of the wizard,
Name is the only mandatory field. The information
you provide in the Description field will appear as
hover text when you move your mouse pointer over the Job in the
Repository tree view.
-
Click Finish to create your Job.
An empty Job is opened in the Studio.
-
Expand the Hadoop cluster node
under Metadata in the Repository tree view.
-
Expand the Hadoop connection you have created and then the
HDFS folder under it. In this example,
it is the my_cdh Hadoop connection.
-
Drop the HDFS connection from the HDFS folder into the workspace of the Job you are creating.
This connection is cdh_hdfs in this example.
The Components window is
displayed to show all the components that can directly reuse this HDFS
connection in a Job.
-
Select tHDFSPut and click OK to validate your choice.
This Components window is
closed and a tHDFSPut component is automatically placed
in the workspace of the current Job, with this component having been
labelled using the name of the HDFS connection mentioned in the previous
step.
-
Double-click tHDFSPut to open its Component view.
The connection to the HDFS system to be used has been
automatically configured by using the configuration of the HDFS connection
you have set up and stored in the Repository. The related parameters in this tab therefore
becomes read-only. These parameters are: Distribution, Version,
NameNode URI, Use Datanode Hostname, User kerberos authentication and Username.
-
In the Local directory field,
enter the path, or browse to the folder in which the files to be copied to HDFS
are stored.
The files about movies and their directors are stored in this
directory.
-
In the HDFS directory field,
enter the path, or browse to the target directory in HDFS to store the
files.
This directory is created on the fly if it does not exist.
-
From the Overwrite file
drop-down list, select always to overwrite
the files if they already exist in the target directory in HDFS.
-
In the Files table, add one
row by clicking the [+] button in order to
define the criteria to select the files to be copied.
-
In the Filemask column, enter
an asterisk (*) within the double quotation marks to make
tHDFSPut select all the files stored in the folder you
specified in the Local directory
field.
-
Leave the New name column
empty, that is to say, keep the default double quotation marks as is, so as to
make the name of the files unchanged after being uploaded.
-
Press F6 to run the Job.
The Run view is opened
automatically. It shows the progress of this Job.
Results
When the Job is done, the files you uploaded can be found in HDFS in the
directory you have specified.