Centralizing a Hadoop connection - 6.2

Talend Data Fabric Studio User Guide

EnrichVersion
6.2
EnrichProdName
Talend Data Fabric
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Setting up a connection to a given Hadoop distribution in Repository allows you to avoid configuring that connection each time when you need to use the same Hadoop distribution.

You need to define a Hadoop connection before being able to create from the Hadoop cluster node the connections to each individual Hadoop element such as HDFS, Hive or Oozie.

Prerequisites:

  • You have ensured that the client machine on which the Talend Studio is installed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.

    For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.

  • The Hadoop cluster to be used has been properly configured and is running.

  • The Integration perspective is active.

  • The Hadoop cluster to be used has been properly configured and is running.

  • If you need to connect to MapR from the Studio, ensure that you have installed the MapR client in the machine where the Studio is, and added the MapR client library to the PATH variable of that machine. According to MapR's documentation, the library or libraries of a MapR client corresponding to each OS version can be found under MAPR_INSTALL\/hadoop\hadoop-VERSION/lib/native. For example, the library for Windows is \lib\native\MapRClient.dll in the MapR client jar file. For further information, see the following link from MapR: http://www.mapr.com/blog/basic-notes-on-configuring-eclipse-as-a-hadoop-development-environment-for-mapr.

To create a Hadoop connection in the Repository, do the following:

  1. In the Repository tree view of your studio, expand Metadata and then right-click Hadoop cluster.

  2. Select Create Hadoop cluster from the contextual menu to open the [Hadoop cluster connection] wizard.

  3. Fill in generic information about this connection, such as Name and Description and click Next to open the [Hadoop Configuration Import Wizard] window that allows you to select the manual or the automatic mode to configure the connection.

Configuring the Hadoop connection automatically

This automatic mode can only be applied to the Hadoop distributions officially supported by the Studio, that is to say, the distributions you can find in this [Hadoop Configuration Import Wizard] window.

  1. In the Distribution area, select the Hadoop distribution to be used and its version.

  2. Select how you want to set up the configuration from this import wizard.

    • Retrieve configuration from Ambari or Cloudera: if you are using a Hortonworks Data Platform or a Cloudera CDH cluster and your cluster contains its specific management platform: Hortonworks Ambari for Hortonworks Data Platform and Cloudera manager for Cloudera CDH, select this check box to directly import the configuration.

      For further information, see Retrieving configuration from Ambari or Cloudera.

    • Import configuration from local files: when you have obtained or you can obtain the configuration files (mainly the *-site.xml files), for example, from the administrator of the Hadoop cluster or downloaded directly from the Web-based cluster management service, use this option to import the properties directly from those files.

      For further information, see Importing configuration from local files.

Retrieving configuration from Ambari or Cloudera

If you are able to access the Web-based management service of your cluster, that is to say, Ambari for Hortonworks or Cloudera manager for Cloudera, select this Retrieve configuration from Ambari or Cloudera option to import the configuration information directly from that management service.

This image shows an example of this wizard for configuration retrieval.

From this wizard, do the following:

  1. In the area for the credentials, enter the authentication information to login the Web-based management service of the cluster to be used. In this example, it is the Cloudera manager to connect to.

  2. If the certificate system has been set up for the management service you need to connect to, select the Use authentication check box to activate the related fields and then complete them using your TrustStore file.

    If you do not have this TrustStore file at hand, contact the administrator of the cluster.

    Both Hortonworks and Cloudera provide the security-related information around their Web-based management service in their documentation. You can find more details on their websites for documentation:

  3. Click the Connect button to create the connection from the Studio to Ambari or Cloudera manager.

    Then the name of the cluster managed by this cluster management service is displayed on the Discovered clusters list.

  4. Click the Fetch button to retrieve and list the configurations of the services of this cluster in this wizard.

  5. Select the services for which you want to import the configuration information.

  6. Click Finish.

    Then the relevant configuration information is automatically filled in the next step of the [Hadoop cluster connection] wizard.

  7. In this [Hadoop cluster connection] wizard, verify the Use custom Hadoop configurations check box is selected in order to ensure that the entire configuration you have imported is taken into account. If you clear this check box, the Studio uses its default Hadoop configuration (in the form of a jar file) instead to complement the parameters you have explicitly set in this wizard.

    For this reason, it is important to select this check box to make your custom configuration override the default one.

  8. Click the Check services button to verify that the Studio can connect to the NameNode and the ResourceManager services you have specified in this wizard.

    A dialog box pops up to indicate the checking process and the connection status. If it shows that the connection fails, you need to review and update the connection information you have defined in the connection wizard.

  9. Click Finish to validate the changes.

If you need more details about the auto-completed fields in this [Hadoop cluster connection] wizard, see Configuring the connection manually

Importing configuration from local files

Once you have selected Import configuration from local files in the import wizard, the following wizard is opened to help you select the Hadoop configuration files (mainly the *-site.xml files) to be used from the local machine.

From this wizard, proceed as follows:

  1. Click Browse... to access the folder in which the local configuration files to be used are stored and click OK to list the configurations in this wizard.

    It is recommended to store these configuration files using a short access path in the local machine.

    The following image shows some files used for the configuration of HDFS, MapReduce and Yarn in Cloudera. These sample files are downloaded and automatically generated by Cloudera manager.

  2. From the configuration list, select the configurations to be imported, for example, those for HDFS and MAPREDUCE2, and click Finish.

    Then the relevant configuration information is automatically filled in the next step of the [Hadoop cluster connection] wizard.

  3. In this [Hadoop cluster connection] wizard, verify the Use custom Hadoop configurations check box is selected in order to ensure that the entire configuration you have imported is taken into account. If you clear this check box, the Studio uses its default Hadoop configuration (in the form of a jar file) instead to complement the parameters you have explicitly set in this wizard.

    For this reason, it is important to select this check box to make your custom configuration override the default one.

  4. Click the Check services button to verify that the Studio can connect to the NameNode and the ResourceManager services you have specified in this wizard.

    A dialog box pops up to indicate the checking process and the connection status. If it shows that the connection fails, you need to review and update the connection information you have defined in the connection wizard.

  5. Click Finish to validate the changes.

If you need more details about the auto-completed fields in this [Hadoop cluster connection] wizard, see Configuring the connection manually

Configuring the connection manually

Even though importing a given Hadoop configuration is always an efficient way, you may have to set up the connection manually.

  1. In this [Hadoop Configuration Import Wizard] window, select Enter manually Hadoop services and click Finish to go back to the [Hadoop Cluster Connection] wizard.

    This mode allows you to connect to a custom Hadoop distribution. For further information, see Connecting to custom Hadoop distribution.

  2. Fill in the fields that become activated depending on the version info you have selected.

    Note that among these fields, the NameNode URI field and the Resource Manager field have been automatically filled with the default syntax and port number corresponding to the selected distribution. You need to update only the part you need to depending on the configuration of the Hadoop cluster to be used. For further information about these different fields to be filled, see the following list.

    Those fields may be:

    • Namenode URI:

      Enter the URI pointing to the machine used as the NameNode of the Hadoop distribution to be used.

      The NameNode is the master node of a Hadoop system. For example, we assume that you have chosen a machine called machine1 as the NameNode of an Apache Hadoop distribution, then the location to be entered is hdfs://machine1:portnumber.

      If you are using a MapR distribution, you can simply leave maprfs:/// as it is in this field; then the MapR client will take care of the rest on the fly for creating the connection. The MapR client must be properly installed. For further information about how to set up a MapR client, see the following link in MapR's documentation: http://doc.mapr.com/display/MapR/Setting+Up+the+Client.

    • Resource Manager:

      Enter the URI pointing to the machine used as the Resource Manager service of the Hadoop distribution to be used.

      Note that in some older Hadoop distribution versions, you need to set the location of the JobTracker service instead of the Resource Manager service.

      Then you need to set further the addresses of the related services such as the address of the Resourcemanager scheduler. When you use this connection in a Big Data relevant component such as tHiveConnection, you will be able to allocate memory to the Map and the Reduce computations and the ApplicationMaster of YARN in the Advanced settings view. For further information about the Resource Manager, its scheduler and the ApplicationMaster, see the documentation about YARN for your distribution such as

      http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/.

    • Job history:

      Enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.

    • Staging directory:

      Enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

    • Use datanode hostname:

      Select this check box to allow the Job to access datanodes via their hostnames. This actually sets the dfs.client.use.datanode.hostname property to true. If this connection is going to be used by a Job connecting to a S3N filesystem, you must select this check box.

    • Enable Kerberos security:

      If you are accessing a Hadoop distribution running with Kerberos security, select this check box, then, enter the Kerberos principal names for the NameNode in the field activated.

      These principals can be found in the configuration files of your distribution. For example, in a CDH4 distribution, the Resource manager principal is set in the yarn-site.xml file and the Job history principal in the mapred-site.xml file.

      If you need to use a keytab file to log in, select the Use a keytab to authenticate check box. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and in the Keytab field, browse to the keytab file to be used.

      Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

      For further information about setting Kerberos in the Studio with examples, see How to use Kerberos in the Studio.

    • If you are connecting to a MapR cluster V4.0.1 and onwards and the MapR ticket security system of the cluster has been enabled, you need to select the Force MapR Ticket Authentication check box and define the following parameters:

      1. In the Password field, specify the password used by the user for authentication.

        A MapR security ticket is generated for this user by MapR and stored in the machine where the Job you are configuring is executed.

      2. In the Cluster name field, enter the name of the MapR cluster you want to use this username to connect to.

        This cluster name can be found in the mapr-clusters.conf file located in /opt/mapr/conf of the cluster.

      3. In the Ticket duration field, enter the length of time (in seconds) during which the ticket is valid.

      4. Keep the Launch authentication mechanism when the Job starts check box selected in order to ensure that the Job using this connection takes into account the current security configuration when it starts to run.

      If the default security configuration of your MapR cluster has been changed, you need to configure the connection to take this custom security configuration into account.

      MapR specifies its security configuration in the mapr.login.conf file located in /opt/mapr/conf of the cluster. For further information about this configuration file and the Java service it uses behind, see mapr.login.conf and JAAS.

      Proceed as follows to do the configuration:

      1. Verify what has been changed about this mapr.login.conf file.

        You should be able to obtain the related information from the administrator or the developer of your MapR cluster.

      2. If the location of the MapR configuration files has been changed to somewhere else in the cluster, that is to say, the MapR Home directory has been changed, select the Set the MapR Home directory check box and enter the new Home directory. Otherwise, leave this check box clear and the default Home directory is used.

      3. If the login module to be used in the mapr.login.conf file has been changed, select the Specify the Hadoop login configuration check box and enter the module to be called from the mapr.login.conf file. Otherwise, leave this check box clear and the default login module is used.

        For example, enter kerberos to call the hadoop_kerberos module or hybrid to call the hadoop_hybrid module.

    • User name:

      Enter the user authentication name of the Hadoop distribution to be used.

      If you leave this field empty, the Studio will use your login name of the client machine you are working on to access that Hadoop distribution. For example, if you are using the Studio in a Windows machine and your login name is Company, then the authentication name to be used at runtime will be Company.

    • Group:

      Enter the group name to which the authenticated user belongs.

      Note that this field becomes activated depending on the distribution you are using.

    • Hadoop properties:

      If you need to use custom configuration for the Hadoop distribution to be used, click the [...] button to open the properties table and add the property or properties to be customized. Then at runtime, these changes will override the corresponding default properties used by the Studio for its Hadoop engine.

      Note that the properties set in this table are inherited and reused by the child connections you will be able to create based on this current Hadoop connection.

      For further information about the properties of Hadoop, see Apache's Hadoop documentation on http://hadoop.apache.org/docs/current/, or the documentation of the Hadoop distribution you need to use. For example, the following page lists some of the default Hadoop properties: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml.

      For further information about how to leverage this properties table, see Setting reusable Hadoop properties.

    • When the distribution to be used is Microsoft HD Insight, you need to set the WebHCat configuration, the HDInsight configuration and the Window Azure Storage configuration instead of the parameters mentioned above. Apart from the authentication information you need to provide in these configuration areas, you need also to set the following parameters:

      • In the Job result folder field, enter the location in which you want to store the execution result of a Talend Job in the Azure Storage to be used.

      • In the Deployment Blob field, enter the location in which you want to store a Talend Job and its dependent libraries in this Azure Storage account.

      A demonstration video about how to configure this connection is available in the following link: https://www.youtube.com/watch?v=A3QTT6VsNoM.

    • If you are using Cloudera V5.5+, you can select the Use Cloudera Navigator check box to enable the Cloudera Navigator of your distribution to trace your Job lineage to the component level, including the schema changes between components.

      You need then to click the [...] button to open the [Cloudera Navigator Wizard] window to define the following parameters:

      1. Username and Password: this is the credentials you use to connect to your Cloudera Navigator.

      2. URL: enter the location of the Cloudera Navigator to be connected to

      3. Metadata URL: enter the location of the Navigator Metadata.

      4. Client URL: leave the default value as is.

      5. Autocommit: select this check box to make Cloudera Navigator generate the lineage of the current Job at the end of the execution of this Job.

        Since this option actually forces Cloudera Navigator to generate lineages of all its available entities such as HDFS files and directories, Hive queries or Pig scripts, it is not recommended for the production environment because it will slow the Job.

      6. Die on error: select this check box to stop the execution of the Job when the connection to your Cloudera Navigator fails.

        Otherwise, leave it clear to allow your Job to continue to run.

      7. Disable SSL: select this check box to make your Job to connect to Cloudera Navigator without the SSL validation process.

        This feature is meant to facilitate the test of your Job but is not recommended to be used in a production cluster.

      Once the configuration is done, click Finish to validate the settings.

  3. For each distribution officially supported by Talend, a default Hadoop configuration (in the form of a jar file) is automatically loaded by the Studio to complement the parameters you have explicitly set in this wizard.

    If you need to use your custom configuration to replace the default one, select the Use custom Hadoop confs check box and then click the [...] button to open the import wizard to import the configuration from Ambari or Cloudera manager or some local files.

    Note that this import overwrites only the default Hadoop configuration used by the Studio but does not overwrite the parameters you have defined in this [Hadoop cluster connection] wizard.

    For further information about this import, see Retrieving configuration from Ambari or Cloudera and Importing configuration from local files.

  4. Click the Check services button to verify that the Studio can connect to the NameNode and the JobTracker or ResourceManager services you have specified in this wizard.

    A dialog box pops up to indicate the checking process and the connection status. If it shows that the connection fails, you need to review and update the connection information you have defined in the connection wizard.

  5. Click Finish to validate your changes and close the wizard.

    The newly set-up Hadoop connection displays under the Hadoop cluster folder in the Repository tree view. This connection has no sub-folders until you create connections to any element under that Hadoop distribution.