Setting up Hadoop connection manually

Talend Big Data Getting Started Guide

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Big Data
task
Design and Development
Installation and Upgrade
Setting up the connection to a given Hadoop distribution in the Repository allows you to avoid configuring that connection each time when you need to use the same Hadoop distribution.

Before you begin

  • Ensure that the client machine on which the Talend Studio is installed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.

    For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.

  • The Hadoop cluster to be used has been properly configured and is running.

The Cloudera Hadoop cluster to be used in this example is of the CDH V5.5 in the Yarn mode and applies the default configuration of the distribution without enabling the Kerberos security. For further information about the default configuration of the CDH V5.5 distribution, see Deploy CDH 5 on a cluster and Default ports used in CDH5.

Procedure

  1. In the Repository tree view of your studio, expand Metadata and then right-click Hadoop cluster.
  2. Select Create Hadoop cluster from the contextual menu to open the Hadoop cluster connection wizard.
  3. Fill in generic information about this connection, such as Name and Description and click Next to open the Hadoop configuration import wizard that helps you import the ready-for-use configuration if any.
  4. Select the Enter manually Hadoop services check box to manually enter the configuration information for the Hadoop connection being created.
  5. Click Finish to close this import wizard.
  6. From the Distribution list, select Cloudera and then from the Version list, select Cloudera CDH5.5 (YARN mode).
  7. In the Namenode URI field, enter the URI pointing to the machine used as the NameNode service of the Cloudera Hadoop cluster to be used.

    The NameNode is the master node of a Hadoop system. For example, assume that you have chosen a machinecalled machine1 as the NameNode, then the location to be entered is hdfs://machine1:portnumber.

    On the cluster side, the related property is specified in the configuration file called core-site.xml. If you do not know what URI is to be entered, check the fs.defaultFS property in the core-site.xml file of your cluster.

  8. In the Resource manager field and the Resource manager scheduler field, enter the URIs pointing to these two services, respectively.

    On the cluster side, these two services share the same host machine but use different default portnumbers. For example, if the machine hosting them is resourcemanager.company.com, the location of the Resource manager is resourcemanager.company.com:8032 and the location of the Resource manager scheduler is resourcemanager.company.com:8030.

    If you do not know the name of the hosting machine of these services, check the yarn.resourcemanager.hostname property in the configuration file called yarn-site.xml of your cluster.

  9. In the Job history field, enter the location of the JobHistory service. This service allows the metrics information of the current Job to be stored in the JobHistory server.

    The related property is specified in the configuration file called mapred-site.xml of your cluster. For the value you need to put in this field, check the mapreduce.jobhistory.address property in this mapred-site.xml file.

  10. In the Staging directory field, enter this directory defined in your Hadoop cluster for temporary files created by running programs.

    The related property is specified in the mapred-site.xml file of your cluster. For further information, check the yarn.app.mapreduce.am.staging-dir property in this mapred-site.xml file.

  11. Select the Use datanode hostname check box to allow the Studio to access each Datanode of your cluster via their host names.

    This actually sets the dfs.client.use.datanode.hostname property of your cluster to true.

  12. In the User name field, enter the user authentication name you want the Studio to use to connect to your Hadoop cluster.
  13. Since the Hadoop cluster to be connected to is using the default configuration, leave the other fields or check boxes in this wizard as they are because they are used to define any custom Hadoop configuration.
  14. Click the Check services button to verify that the Studio can connect to the NameNode and the ResourceManager services you have specified.

    A dialog box pops up to indicate the checking process and the connection status.

    If the connection fails, you can click Error log at the end of each progress bar to diagnose the connection issues.

  15. Once this check indicates that the connection is successful, click Finish to validate your changes and close the wizard.

Results

The new connection, called my_cdh in this example, is displayed under the Hadoop cluster folder in the Repository tree view.

You can then continue to create the child connections to different Hadoop elements such as HDFS or Hive based on this connection.