Setting up context-smart Hadoop connections
Setting up a connection to a given Hadoop distribution in Repository allows you to avoid configuring that connection each time when you need to use it in your Jobs.
When defining this connection, you can contextualize the connection parameters using values from different Hadoop environments such as from a test and a production environments, in order to adjust the connection and the Jobs using the connection to the proper environments at runtime by only one click.
The security configuration such as the Kerberos parameters cannot be contextualized. Therefore, ensure that the values you use for security work in all the involved environments among which the Hadoop connection switches.
If available in your Studio, the advanced Spark properties and the advanced Hadoop properties you define cannot be contextualized either. For this reason, ensure that these properties are valid for all the involved environments among which the Hadoop connection switches.
Setting up the Hadoop connection
You need first to set up the connection to a given Hadoop environment.
In this article, a Cloudera distribution is used for demonstration purposes.
Before you begin
-
Ensure that the client machine on which Talend Studio is installed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.
For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.
-
The Hadoop cluster to be used has been properly configured and is running.
-
The Integration perspective is active.
-
Cloudera is the example distribution of the current article. If you are using a different distribution, you may need to bear in mind the particular prerequisites explained as follows:
-
If you need to connect to MapR from the Studio, ensure that you have installed the MapR client in the machine where the Studio is, and added the MapR client library to the PATH variable of that machine. According to the MapR documentation, the library or libraries of a MapR client corresponding to each OS version can be found under MAPR_INSTALL\/hadoop\hadoop-VERSION/lib/native. For example, the library for Windows is \lib\native\MapRClient.dll in the MapR client jar file. For further information, see the following link from MapR: http://www.mapr.com/blog/basic-notes-on-configuring-eclipse-as-a-hadoop-development-environment-for-mapr.
-
If you need to connect to a Google Dataproc cluster, set the path to the Google credentials file associated with the service account to be used in the environment variables of your local machine, so that the Check service feature of the metadata wizard can properly verify your configuration.
For further information how to set the environment variable, see Getting Started with Authentication of Google documentation.
-
Procedure
Contextualizing the Hadoop connection parameters
Before you begin
-
Ensure that the client machine on which Talend Studio is installed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.
For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.
-
The Hadoop cluster to be used has been properly configured and is running.
-
A Hadoop connection has been properly set up following the explanations in Setting up the Hadoop connection.
-
The Integration perspective is active.
-
Cloudera is the example distribution of the current article. If you are using a different distribution, you may need to bear in mind the particular prerequisites explained as follows:
-
If you need to connect to MapR from the Studio, ensure that you have installed the MapR client in the machine where the Studio is, and added the MapR client library to the PATH variable of that machine. According to the MapR documentation, the library or libraries of a MapR client corresponding to each OS version can be found under MAPR_INSTALL\/hadoop\hadoop-VERSION/lib/native. For example, the library for Windows is \lib\native\MapRClient.dll in the MapR client jar file. For further information, see the following link from MapR: http://www.mapr.com/blog/basic-notes-on-configuring-eclipse-as-a-hadoop-development-environment-for-mapr.
-
If you need to connect to a Google Dataproc cluster, set the path to the Google credentials file associated with the service account to be used in the environment variables of your local machine, so that the Check service feature of the metadata wizard can properly verify your configuration.
For further information how to set the environment variable, see Getting Started with Authentication of Google documentation.
-
Procedure
Results
When you reuse these connections via the Property type list in a given component in your Jobs, these contexts are listed in the Run view of the Job at your disposal.
Reusing a contextualized Hadoop connection in a Job
Before you begin
-
An empty Job has been created and opened in the workspace of the Studio.
-
A Hadoop connection and its child connections have been properly set up following the explanations in Setting up the Hadoop connection.
Procedure
Results
The contexts available for use can then be seen in the Run view of the Job at your disposal.
Creating a new Hadoop configuration context outside the Studio (optional)
You can contextualize the Hadoop connection for a Job without using the Studio.
When you do not have a Studio at hand but need to deploy a Job in a Hadoop environment different from the Hadoop environments already defined for this Job, you can take the manual approach to add a new Hadoop connection context.
If a Job is using a contextualized Hadoop connection that has two contexts, for example Default and Dev, after being built out of the Studio, the lib folder of the built artifact (the Job zip) contains two special jars for the given Hadoop environments. The name of these jars follows a pattern: "hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar".
The jar to be used at runtime is defined by the context used in the command you can read from the .bat file or the .sh file of the Job.
The following line is an example of this command, which calls the Default context:
java -Xms256M -Xmx1024M -cp .;../lib/routines.jar;../lib/antlr-runtime-3.5.2.jar;../lib/avro-1.7.6-cdh5.10.1.jar;../lib/commons-cli-1.2.jar;../lib/commons-codec-1.9.jar;../lib/commons-collections-3.2.2.jar;../lib/commons-configuration-1.6.jar;../lib/commons-lang-2.6.jar;../lib/commons-logging-1.2.jar;../lib/dom4j-1.6.1.jar;../lib/guava-12.0.1.jar;../lib/hadoop-auth-2.6.0-cdh5.10.1.jar;../lib/hadoop-common-2.6.0-cdh5.10.1.jar;../lib/hadoop-hdfs-2.6.0-cdh5.10.1.jar;../lib/htrace-core4-4.0.1-incubating.jar;../lib/httpclient-4.3.3.jar;../lib/httpcore-4.3.3.jar;../lib/jackson-core-asl-1.8.8.jar;../lib/jackson-mapper-asl-1.8.8.jar;../lib/jersey-core-1.9.jar;../lib/log4j-1.2.16.jar;../lib/log4j-1.2.17.jar;../lib/org.talend.dataquality.parser.jar;../lib/protobuf-java-2.5.0.jar;../lib/servlet-api-2.5.jar;../lib/slf4j-api-1.7.5.jar;../lib/slf4j-log4j12-1.7.5.jar;../lib/talend_file_enhanced_20070724.jar;mytestjob_0_1.jar; local_project.mytestjob_0_1.myTestJob --context=Default %*
In this example, switching from Default to Dev results in changing the Hadoop configuration which will be loaded in the Job at runtime.
Adding a new Hadoop connection context manually to the built Job
You can manually add a Hadoop environment to the Job, without the help of the Studio.
Following the example described in Creating a new Hadoop configuration context outside the Studio (optional), add a Prod Hadoop environment.
Before you begin
-
This Job must be using contextualized Hadoop connections. This means that your Job is using the Repository property type to reuse the Hadoop connection for which contexts have been defined.
You can search for further information about how to use metadata in a Job on Talend Help Center.
You can search for further information about how to use metadata in a Job on Talend Help Center (https://help.talend.com).
For further information about how to define contexts for a Hadoop connection in the Studio, see Contextualizing the Hadoop connection parameters.
-
The Job you need to deploy must have been properly built from the Studio and is unzipped.
You can search for further information about how to build Jobs to deploy and execute them on any server, independent of Talend Studio, on Talend Help Center.
You can search for further information about how to build Jobs to deploy and execute them on any server, independent of Talend Studio, on Talend Help Center (https://help.talend.com).
Procedure
- In the contexts folder, duplicate the Dev.properties and rename it Prod.properties.
- In the lib folder, duplicate the hadoop-conf-cdh5100_Dev.jar and rename it hadoop-conf-cdh5100_Prod.jar.
- Open the hadoop-conf-cdh5100_Prod.jar and replace the configuration files with those from the production cluster.
Results
You can then use the Prod context in the command to load the Prod configuration in the Job.