Before you begin
The connection to the Hadoop cluster hosting the HDFS system to be used has been set up from the Hadoop cluster node in the Repository.
For further information about how to create this connection, see Setting up Hadoop connection manually.
The Hadoop cluster to be used has been properly configured and is running and you have the proper access permission to that distribution and its HDFS.
Ensure that the client machine on which the Talend Studio is installed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.
For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.
- Expand the Hadoop cluster node under Metadata in the Repository tree view, right click the Hadoop connection to be used and select Create HDFS from the contextual menu.
In the connection wizard that opens up, fill in the generic
properties of the connection you need create, such as Name, Purpose and Description.
Click Next when completed. The
second step requires you to fill in the HDFS connection data.
The User name property is automatically pre-filled with the value inherited from the Hadoop connection you selected in the previous steps.
The Row separator and the Field separator properties are using the default values.
Select the Set heading row as column
names check box to use the data in the heading rows of the HDFS
file to be used to define the column names of this file.
The Header check box is then automatically selected and the Header field is filled with 1. This means that the first row of the file will be ignored as data body but used as column names of the file.
Click Check to verify your
A message pops up to indicate whether the connection is successful.
- Click Finish to validate these changes.
The new HDFS connection is now available under the Hadoop cluster node in the Repository tree view. You can then use it to define and centralize the schemas of the files stored in the connected HDFS system in order to reuse these schemas in a Talend Job.