Before you begin
Ensure that the client machine on which the Talend Studio is installed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.
For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.
The Hadoop cluster to be used has been properly configured and is running.
The Cloudera Hadoop cluster to be used in this example is of the CDH V5.5 in the Yarn mode and applies the default configuration of the distribution without enabling the Kerberos security. For further information about the default configuration of the CDH V5.5 distribution, see Deploy CDH 5 on a cluster and Default ports used in CDH5.
- In the Repository tree view of your studio, expand Metadata and then right-click Hadoop cluster.
- Select Create Hadoop cluster from the contextual menu to open the Hadoop cluster connection wizard.
- Fill in generic information about this connection, such as Name and Description and click Next to open the Hadoop configuration import wizard that helps you import the ready-for-use configuration if any.
Select the Enter manually Hadoop
services check box to manually enter the configuration
information for the Hadoop connection being created.
- Click Finish to close this import wizard.
- From the Distribution list, select Cloudera and then from the Version list, select Cloudera CDH5.5 (YARN mode).
In the Namenode URI field, enter the URI pointing to the
machine used as the NameNode service of the Cloudera Hadoop cluster to be used.
The NameNode is the master node of a Hadoop system. For example, assume that you have chosen a machinecalled machine1 as the NameNode, then the location to be entered is hdfs://machine1:portnumber.
On the cluster side, the related property is specified in the configuration file called core-site.xml. If you do not know what URI is to be entered, check the
fs.defaultFSproperty in the core-site.xml file of your cluster.
In the Resource manager field and the
Resource manager scheduler field, enter the URIs
pointing to these two services, respectively.
On the cluster side, these two services share the same host machine but use different default portnumbers. For example, if the machine hosting them is resourcemanager.company.com, the location of the Resource manager is resourcemanager.company.com:8032 and the location of the Resource manager scheduler is resourcemanager.company.com:8030.
If you do not know the name of the hosting machine of these services, check the
yarn.resourcemanager.hostnameproperty in the configuration file called yarn-site.xml of your cluster.
In the Job history field, enter the location of the
JobHistory service. This service allows the metrics information of the current
Job to be stored in the JobHistory server.
The related property is specified in the configuration file called mapred-site.xml of your cluster. For the value you need to put in this field, check the
mapreduce.jobhistory.addressproperty in this mapred-site.xml file.
In the Staging directory field, enter this directory
defined in your Hadoop cluster for temporary files created by running
The related property is specified in the mapred-site.xml file of your cluster. For further information, check the
yarn.app.mapreduce.am.staging-dirproperty in this mapred-site.xml file.
Select the Use datanode hostname check box to allow the
Studio to access each Datanode of your cluster via their host names.
This actually sets the
dfs.client.use.datanode.hostnameproperty of your cluster to true.
- In the User name field, enter the user authentication name you want the Studio to use to connect to your Hadoop cluster.
- Since the Hadoop cluster to be connected to is using the default configuration, leave the other fields or check boxes in this wizard as they are because they are used to define any custom Hadoop configuration.
Click the Check services
button to verify that the Studio can connect to the NameNode and the
ResourceManager services you have specified.
A dialog box pops up to indicate the checking process and the connection status.
If the connection fails, you can click Error log at the end of each progress bar to diagnose the connection issues.
- Once this check indicates that the connection is successful, click Finish to validate your changes and close the wizard.
The new connection, called my_cdh in this example, is displayed under the Hadoop cluster folder in the Repository tree view.
You can then continue to create the child connections to different Hadoop elements such as HDFS or Hive based on this connection.