Contextualizing the Hadoop connection parameters

Setting up context-smart Hadoop connections

EnrichVersion
6.5
EnrichProdName
Talend Open Studio for Big Data
Talend Data Fabric
Talend Big Data
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs > Hadoop distributions
EnrichPlatform
Talend Studio
Contextualize the Hadoop connection parameters to make this connection portable over different Hadoop environments such as a test environment and a production environment.

Before you begin

  • Ensure that the client machine on which the Talend Studio is installed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.

    For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.

  • The Hadoop cluster to be used has been properly configured and is running.

  • A Hadoop connection has been properly set up following the explanations in Setting up the Hadoop connection.

  • The Integration perspective is active.

  • Cloudera is the example distribution of the current article. If you are using a different distribution, you may need to bear in mind the particular prerequisites explained as follows:
    • If you need to connect to MapR from the Studio, ensure that you have installed the MapR client in the machine where the Studio is, and added the MapR client library to the PATH variable of that machine. According to the MapR documentation, the library or libraries of a MapR client corresponding to each OS version can be found under MAPR_INSTALL\/hadoop\hadoop-VERSION/lib/native. For example, the library for Windows is \lib\native\MapRClient.dll in the MapR client jar file. For further information, see the following link from MapR: http://www.mapr.com/blog/basic-notes-on-configuring-eclipse-as-a-hadoop-development-environment-for-mapr.

    • If you need to connect to a Google Dataproc cluster, set the path to the Google credentials file associated with the service account to be used in the environment variables of your local machine, so that the Check service feature of the metadata wizard can properly verify your configuration.

      For further information how to set the environment variable, see Getting Started with Authentication of Google documentation.

Procedure

  1. In the Repository tree view of your Studio, expand Metadata and Hadoop cluster and then double-click the Hadoop connection you created following Setting up the Hadoop connection.
  2. Click Next to go to the step 2 window of this wizard and click the Export as context button.
  3. In the [Create/Resue a context group] wizard, select Create a new repository context and click Next.
  4. In the step 1 window of the [Create/Resue a context group] wizard, add at least the name you want to use for the context group being created, for example, smart_connection and click Next.

    A read-only view of this context group is created and automatically filled with the parameters of the given Hadoop connection you defined in Setting up the Hadoop connection.

    You may also notice that not all of the connection parameters are added to the context group, meaning that they are not all contextualized, as expected.

  5. Click Finish to validate the creation and switch back to the step 2 window of the Hadoop connection wizard.

    The connection parameters have been automatically set to use the context variables and become read-only.

  6. Click finish to validate these changes.

    The new context group, named smart_connection, has been created under the Contexts node.

  7. In Repository, double-click this new context group to open the [Create/Edit a context group] wizard.
  8. Click Next to pass to step 2 in order to edit the context variables.
  9. Click the [+] button to open the [Configure contexts] wizard, from which you add a new context.
  10. Click New to open the [New context] wizard and enter the name of this new context, for example, prod.
  11. Click OK to validate the changes and close the [New context] wizard. The new context is added to the context list.
  12. Click OK to validate the addition and close the [Configure contexts] wizard to go back to the [Create/Edit a context group] wizard.
  13. Define the new context to contain the connection parameter values for a different Hadoop cluster, for example, your production one.
  14. Click Finish to validate the changes and accept the propagation.
  15. Back to the Hadoop cluster node in Repository, double-click the Hadoop connection you are contextualizing to open its wizard.
  16. In the step 2 window of this wizard, ensure that the Use custom Hadoop configuration check box is selected and click the [...] button next to it to open the [Hadoop configuration] wizard.

    The prod context is displayed in the wizard and the message "Please import the jar." next to it prompts you to import the Hadoop configuration file specific to the Hadoop cluster this prod context is created for.

    You can also notice that the Default context that was the first context generated for this given Hadoop connection, smart_connection, already possesses a Hadoop configuration jar file. This jar file was automatically generated at the end the process of defining this Hadoop connection and creating the Default context for it.

  17. Click the field of this "Please import the jar." message to display the [...] button and click this button to open the [Hadoop configuration import wizard] wizard.

    This step starts the same process as explained in Setting up the Hadoop connection to either automatically or manually set up the Hadoop configuration. But at the end of this process, this step is meant to generate only the appropriate Hadoop configuration jar file for the prod context but not to create a new Hadoop connection item under the Hadoop cluster node.

  18. Click OK to validate the changes and then click Finish to validate the contextualization and close the Hadoop connection wizard.

    If prompted, click Yes to accept the propagation.

  19. The Hadoop connection is now contextualized and you can continue to create child connections to its elements such as HBase, HDFS and Hive etc. based on this connection. Each of the connection wizard contains the Export as context button and use it to contextualize each of these connections.

Results

When you reuse these connections via the Property type list in a given component in your Jobs, these contexts are listed in the Run view of the Job at your disposal.