Setting up connection to HDFS - 7.0

Talend Data Fabric Getting Started Guide

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Data Fabric
task
Data Quality and Preparation > Cleansing data
Data Quality and Preparation > Profiling data
Design and Development
Installation and Upgrade
EnrichPlatform
Talend Administration Center
Talend DQ Portal
Talend Installer
Talend Runtime
Talend Studio
A connection to HDFS in the Repository allows you to reuse this connection in related Jobs.
  • The connection to the Hadoop cluster hosting the HDFS system to be used has been set up from the Hadoop cluster node in the Repository.

    For further information about how to create this connection, see Setting up Hadoop connection manually.

  • The Hadoop cluster to be used has been properly configured and is running and you have the proper access permission to that distribution and its HDFS.

  • Ensure that the client machine on which the Talend Studio is installed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.

    For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.

  1. Expand the Hadoop cluster node under Metadata in the Repository tree view, right click the Hadoop connection to be used and select Create HDFS from the contextual menu.
  2. In the connection wizard that opens up, fill in the generic properties of the connection you need create, such as Name, Purpose and Description.
  3. Click Next when completed. The second step requires you to fill in the HDFS connection data.

    The User name property is automatically pre-filled with the value inherited from the Hadoop connection you selected in the previous steps.

    The Row separator and the Field separator properties are using the default values.

  4. Select the Set heading row as column names check box to use the data in the heading rows of the HDFS file to be used to define the column names of this file.

    The Header check box is then automatically selected and the Header field is filled with 1. This means that the first row of the file will be ignored as data body but used as column names of the file.

  5. Click Check to verify your connection.

    A message pops up to indicate whether the connection is successful.

  6. Click Finish to validate these changes.

The new HDFS connection is now available under the Hadoop cluster node in the Repository tree view. You can then use it to define and centralize the schemas of the files stored in the connected HDFS system in order to reuse these schemas in a Talend Job.