Setting up the connection to a given Hadoop distribution in the
Repository allows you to avoid configuring that connection
each time when you need to use the same Hadoop distribution.
Before you begin
-
Ensure that the client machine on which the Talend Studio is installed can recognize the
host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP
address/hostname mapping entries for the services of that Hadoop cluster in the
hosts file of the client machine.
For example, if the host name of the Hadoop Namenode server is
talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads
192.168.x.x talend-cdh550.weave.local.
-
The Hadoop cluster to be used has been properly configured and is
running.
The Cloudera Hadoop cluster to be used in this example is of the CDH V5.5
in the Yarn mode and applies the default configuration of the distribution without
enabling the Kerberos security. For further information about the default
configuration of the CDH V5.5 distribution, see Deploy CDH 5 on a cluster and
Default ports used in
CDH5.
Procedure
-
In the Repository tree view of your studio, expand
Metadata and then right-click Hadoop
cluster.
-
Select Create Hadoop cluster from the contextual menu to
open the Hadoop cluster connection wizard.
-
Fill in generic information about this connection, such as
Name and Description and click
Next to open the Hadoop configuration
import wizard that helps you import the ready-for-use
configuration if any.
-
Select the Enter manually Hadoop
services check box to manually enter the configuration
information for the Hadoop connection being created.
-
Click Finish to close this
import wizard.
-
From the Distribution list,
select Cloudera and then from the Version list, select Cloudera CDH5.5 (YARN mode).
-
In the Namenode URI field, enter the URI pointing to the
machine used as the NameNode service of the Cloudera Hadoop cluster to be used.
The NameNode is the master node of a Hadoop system. For example, assume that
you have chosen a machinecalled machine1 as the NameNode, then the location
to be entered is hdfs://machine1:portnumber.
On the cluster side, the related property is specified in the configuration
file called core-site.xml. If you do not know what URI
is to be entered, check the fs.defaultFS
property in the
core-site.xml file of your cluster.
-
In the Resource manager field and the
Resource manager scheduler field, enter the URIs
pointing to these two services, respectively.
On the cluster side, these two services share the same host machine but use
different default portnumbers. For example, if the machine hosting them is
resourcemanager.company.com, the location of the Resource manager
is resourcemanager.company.com:8032 and the
location of the Resource manager scheduler is
resourcemanager.company.com:8030.
If you do not know the name of the hosting machine of these services, check
the yarn.resourcemanager.hostname
property in the
configuration file called yarn-site.xml of your
cluster.
-
In the Job history field, enter the location of the
JobHistory service. This service allows the metrics information of the current
Job to be stored in the JobHistory server.
The related property is specified in the configuration file called
mapred-site.xml of your cluster. For the value you
need to put in this field, check the
mapreduce.jobhistory.address
property in this
mapred-site.xml file.
-
In the Staging directory field, enter this directory
defined in your Hadoop cluster for temporary files created by running
programs.
The related property is specified in the mapred-site.xml
file of your cluster. For further information, check the
yarn.app.mapreduce.am.staging-dir
property in this
mapred-site.xml file.
-
Select the Use datanode hostname check box to allow the
Studio to access each Datanode of your cluster via their host names.
This actually sets the dfs.client.use.datanode.hostname
property of your cluster to true.
-
In the User name field, enter the user authentication
name you want the Studio to use to connect to your Hadoop cluster.
-
Since the Hadoop cluster to be connected to is using the default
configuration, leave the other fields or check boxes in this wizard as they are
because they are used to define any custom Hadoop configuration.
-
Click the Check services
button to verify that the Studio can connect to the NameNode and the
ResourceManager services you have specified.
A dialog box pops up to indicate the checking process and the
connection status.
If the connection fails, you can click Error log at the end of each progress bar to diagnose the
connection issues.
-
Once this check indicates that the connection is successful, click
Finish to validate your changes and
close the wizard.
Results
The new connection, called my_cdh in this example, is displayed under
the Hadoop cluster folder in the Repository tree view.
You can then continue to create the child connections to different
Hadoop elements such as HDFS or Hive based on this connection.