Centralizing a Hadoop connection - 6.1

Talend Big Data Studio User Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
task
Design and Development
EnrichPlatform
Talend Studio

Setting up a connection to a given Hadoop distribution in Repository allows you to avoid configuring that connection each time when you need to use the same Hadoop distribution.

You need to define a Hadoop connection before being able to create from the Hadoop cluster node the connections to each individual Hadoop element such as HDFS, Hive or Oozie.

Prerequisites:

Before carrying on the following procedure to configure your Hadoop connection, make sure that you have the access to that Hadoop distribution to be connected to.

If you need to connect to MapR from the Studio, ensure that you have installed the MapR client in the machine where the Studio is, and added the MapR client library to the PATH variable of that machine. According to MapR's documentation, the library or libraries of a MapR client corresponding to each OS version can be found under MAPR_INSTALL\ hadoop\hadoop-VERSION\lib\native. For example, the library for Windows is \lib\native\MapRClient.dll in the MapR client jar file. For further information, see the following link from MapR: http://www.mapr.com/blog/basic-notes-on-configuring-eclipse-as-a-hadoop-development-environment-for-mapr.

To create a Hadoop connection in the Repository, do the following:

  1. In the Repository tree view of your studio, expand Metadata and then right-click Hadoop cluster.

  2. Select Create Hadoop cluster from the contextual menu to open the [Hadoop cluster connection] wizard.

  3. Fill in generic information about this connection, such as Name and Description and click Next to open the wizard that helps you import the ready-for-use configuration if any.

  4. In the Distribution area, select the Hadoop distribution to be used and its version.

    From the Distribution list, you can select Custom to connect to a Hadoop distribution not officially supported by the Studio. For an example illustrating how to use the Custom option, see Connecting to a custom Hadoop distribution.

    Note that custom versions are not officially supported by Talend. Talend and its community provide you with the opportunity to connect to custom versions from the Studio but cannot guarantee that the configuration of whichever version you choose will be easy, due to the wide range of different Hadoop distributions and versions that are available. As such, you should only attempt to set up such a connection if you have sufficient Hadoop experience to handle any issues on your own.

    With the Custom option, the Authentication list appears. You need to select the appropriate authentication mode required by the Hadoop distribution you need to connect to.

  5. Select how you want to set up the configuration from this import wizard.

    • Retrieve configuration from Ambari or Cloudera: depending on the distribution version you have selected, you access the corresponding wizard to set up the connection to Hortonworks Ambari or Cloudera manager to import the configuration information into the Studio.

      For further information, see Retrieving configuration from Ambari or Cloudera.

    • Import configuration from local files: when you have obtained the configuration files (mainly the *-site.xml files), for example, from the administrator of the Hadoop cluster or downloaded directly from the Web-based cluster management service, use this option to import the properties directly from those files.

      For further information, see Importing configuration from local files.

    • Enter manually Hadoop services: with this option, you manually enter the configuration information in the corresponding wizard to create the connection to the Hadoop cluster to be used.

      For further information, see Manually entering the Hadoop configuration.

Retrieving configuration from Ambari or Cloudera

If you are able to access the Web-based management service of your cluster, that is to say, Ambari for Hortonworks or Cloudera manager for Cloudera, select this Retrieve configuration from Ambari or Cloudera option to import the configuration information directly from that management service.

This image shows an example of this wizard for configuration retrieval.

From this wizard, do the following:

  1. In the area for the credentials, enter the authentication information to login the Web-based management service of the cluster to be used. In this example, it is the Cloudera manager to connect to.

  2. If the certificate system has been set up for the management service you need to connect to, select the Use authentication check box to activate the related fields and then complete them using your TrustStore file.

    If you do not have this TrustStore file at hand, contact the administrator of the cluster.

    Both Hortonworks and Cloudera provide the security-related information around their Web-based management service in their documentation. You can find more details on their websites for documentation:

  3. Click the Connect button to create the connection from the Studio to Ambari or Cloudera manager.

    Then the name of the cluster managed by this cluster management service is dispalyed on the Discovered clusters list.

  4. Click the Fetch button to retrieve and list the configurations of the services of this cluster in this wizard.

  5. Select the services for which you want to import the configuration information.

  6. Click Finish.

    Then the relevant configuration information is automatically filled in the next step of the [Hadoop cluster connection] wizard.

  7. In this [Hadoop cluster connection] wizard, verify the Use custom Hadoop configruations check box is selected in order to ensure that the entire configuration you have imported is taken into account. If you clear this check box, the Studio uses its default Hadoop configuration (in the form of a jar file) instead to complement the parameters you have explicitly set in this wizard.

    For this reason, it is important to select this check box to make your custom configuration override the default one.

  8. Click Finish to validate the changes.

If you need more details about the auto-completed fields in this [Hadoop cluster connection] wizard, see Manually entering the Hadoop configuration

Importing configuration from local files

Once you have selected Import configuration from local files in the import wizard, the following wizard is opened to help you select the Hadoop configuration files (mainly the *-site.xml files) to be used from the local machine.

From this wizard, proceed as follows:

  1. Click Browse... to access the folder in which the local configuration files to be used are stored and click OK to list the configurations in this wizard.

    It is recommended to store these configuration files using a short access path in the local machine.

    The following image shows some files used for the configuration of HDFS, MapReduce and Yarn in Cloudera. These files are downloaded and automatically generated by Cloudera manager.

  2. From the configuration list, select the configurations to be imported, for example, those for HDFS and MAPREDUCE2, and click Finish.

    Then the relevant configuration information is automatically filled in the next step of the [Hadoop cluster connection] wizard.

  3. In this [Hadoop cluster connection] wizard, verify the Use custom Hadoop configruations check box is selected in order to ensure that the entire configuration you have imported is taken into account. If you clear this check box, the Studio uses its default Hadoop configuration (in the form of a jar file) instead to complement the parameters you have explicitly set in this wizard.

    For this reason, it is important to select this check box to make your custom configuration override the default one.

  4. Click Finish to validate the changes.

If you need more details about the auto-completed fields in this [Hadoop cluster connection] wizard, see Manually entering the Hadoop configuration

Manually entering the Hadoop configuration

Even though importing a given Hadoop configuration is always an efficient way, if needs be, you can still select Enter manually Hadoop services to directly enter the parameters in the [Hadoop Cluster Connection] wizard.

From this wizard, do the following:

  1. Fill in the fields that become activated depending on the version info you have selected. Note that among these fields, the NameNode URI field and the JobTracker URI field (or the Resource Manager field) have been automatically filled with the default syntax and port number corresponding to the selected distribution. You need to update only the part you need to depending on the configuration of the Hadoop cluster to be used. For further information about these different fields to be filled, see the following list.

    Those fields may be:

    • Namenode URI:

      Enter the URI pointing to the machine used as the NameNode of the Hadoop distribution to be used.

      The NameNode is the master node of a Hadoop system. For example, we assume that you have chosen a machine called machine1 as the NameNode of an Apache Hadoop distribution, then the location to be entered is hdfs://machine1:portnumber.

      If you are using a MapR distribution, you can simply leave maprfs:/// as it is in this field; then the MapR client will take care of the rest on the fly for creating the connection. The MapR client must be properly installed. For further information about how to set up a MapR client, see the following link in MapR's documentation: http://doc.mapr.com/display/MapR/Setting+Up+the+Client.

    • Resource Manager:

      Enter the URI pointing to the machine used as the Resource Manager service of the Hadoop distribution to be used.

      Note that in some older Hadoop distribution versions, you need to set the location of the JobTracker service instead of the Resource Manager service.

      Then you need to set further the addresses of the related services such as the address of the Resourcemanager scheduler. When you use this connection in a Big Data relevant component such as tHiveConnection, you will be able to allocate memory to the Map and the Reduce computations and the ApplicationMaster of YARN in the Advanced settings view. For further information about the Resource Manager, its scheduler and the ApplicationMaster, see the documentation about YARN for your distribution such as

      http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/.

      Note

      In order to make the host name of the Hadoop server recognizable by the client and the host computers, you have to establish an IP address/hostname mapping entry for that host name in the related hosts files of the client and the host computers. For example, the host name of the Hadoop server is talend-all-hdp, and its IP address is 192.168.x.x, then the mapping entry reads 192.168.x.x talend-all-hdp. For the Windows system, you need to add the entry to the file C:\WINDOWS\system32\drivers\etc\hosts (assuming Windows is installed on drive C). For the Linux system, you need to add the entry to the file /etc/hosts.

    • Job history:

      Enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.

    • Staging directory:

      Enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

    • Use datanode hostname:

      Select this check box to allow the Job to access datanodes via their hostnames. This actually sets the dfs.client.use.datanode.hostname property to true. If this connection is going to be used by a Job connecting to a S3N filesystem, you must select this check box.

    • Enable Kerberos security:

      If you are accessing a Hadoop distribution running with Kerberos security, select this check box, then, enter the Kerberos principal name for the NameNode in the field activated. This enables you to use your user name to authenticate against the credentials stored in Kerberos.

      In addition, since this component performs MapReduce computations, you also need to authenticate the related services such as the Job history server and the Resource manager or Jobtracker depending on your distribution in the corresponding field. These principals can be found in the configuration files of your distribution. For example, in a CDH4 distribution, the Resource manager principal is set in the yarn-site.xml file and the Job history principal in the mapred-site.xml file.

      If you need to use a keytab file to log in, select the Use a keytab to authenticate check box. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and in the Keytab field, browse to the keytab file to be used.

      Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

    • User name:

      Enter the user authentication name of the Hadoop distribution to be used.

      If you leave this field empty, the Studio will use your login name of the client machine you are working on to access that Hadoop distribution. For example, if you are using the Studio in a Windows machine and your login name is Company, then the authentication name to be used at runtime will be Company.

    • Group:

      Enter the group name to which the authenticated user belongs.

      Note that this field becomes activated depending on the distribution you are using.

    • Hadoop properties:

      If you need to use custom configuration for the Hadoop distribution to be used, click the [...] button to open the properties table and add the property or properties to be customized. Then at runtime, these changes will override the corresponding default properties used by the Studio for its Hadoop engine.

      Note that the properties set in this table are inherited and reused by the child connections you will be able to create based on this current Hadoop connection.

      For further information about the properties of Hadoop, see Apache's Hadoop documentation on http://hadoop.apache.org/docs/current/, or the documentation of the Hadoop distribution you need to use. For example, the following page lists some of the default Hadoop properties: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml.

      For further information about how to leverage this properties table, see Setting reusable Hadoop properties.

    • When the distribution to be used is Microsoft HD Insight, you need to set the WebHCat configuration, the HDInsight configuration and the Window Azure Storage configuration instead of the parameters mentioned above. Apart from the authentication information you need to provide in these configuration areas, you need also to set the following parameters:

      • In the Job result folder field, enter the location in which you want to store the execution result of a Talend Job in the Azure Storage to be used.

      • In the Deployment Blob field, enter the location in which you want to store a Talend Job and its dependent libraries in this Azure Storage account.

      A demonstration video about how to configure this connection is available in the following link: https://www.youtube.com/watch?v=A3QTT6VsNoM.

  2. For each distribution officially supported by Talend, a default Hadoop configuration (in the form of a jar file) is automatically loaded by the Studio to complement the parameters you have explicitly set in this wizard.

    If you need to use your custom configuration to replace the default one, select the Use custom Hadoop confs check box and then click the [...] button to open the import wizard to import the configuration from Ambari or Cloudera manager or some local files.

    Note that this import overwrites only the default Hadoop configuration used by the Studio but does not overwrite the parameters you have defined in this [Hadoop cluster connection] wizard.

    For further information about this import, see Retrieving configuration from Ambari or Cloudera and Importing configuration from local files.

  3. Click the Check services button to verify that the Studio can connect to the NameNode and the JobTracker or ResourceManager services you have specified in this wizard.

    A dialog box pops up to indicate the checking process and the connection status. If it shows that the connection fails, you need to review and update the connection information you have defined in the connection wizard.

  4. Click Finish to validate your changes and close the wizard.

    The newly set-up Hadoop connection displays under the Hadoop cluster folder in the Repository tree view. This connection has no sub-folders until you create connections to any element under that Hadoop distribution.

Connecting to a custom Hadoop distribution

When you select the Custom option from the Distribution drop-down list mentioned above, you are connecting to a Hadoop distribution different from any of the Hadoop distributions provided on that Distribution list in the Studio.

After selecting this Custom option, click the button to display the [Import custom definition] dialog box and proceed as follows:

Note that custom versions are not officially supported by Talend. Talend and its community provide you with the opportunity to connect to custom versions from the Studio but cannot guarantee that the configuration of whichever version you choose will be easy, due to the wide range of different Hadoop distributions and versions that are available. As such, you should only attempt to set up such a connection if you have sufficient Hadoop experience to handle any issues on your own.

  1. Depending on your situation, select Import from existing version or Import from zip to configure the custom Hadoop distribution to be connected to.

    • If you have the configuration zip file of the custom Hadoop distribution you need to connect to, select Import from zip. In Talend Exchange, members of Talend community have shared some ready-for-use configuration zip files which you can download from this Hadoop configuration list and directly use them in your connection accordingly. However, because of the ongoing evolution of the different Hadoop-related projects, you might not be able to find the configuration zip corresponding to your distribution from this list; then it is recommended to use the Import from existing version option to take an existing distribution as base to add the jars required by your distribution.

      Note that the zip files are only configuration files and cannot be installed directly from Talend Exchange.

    • Otherwise, select Import from existing version to import an officially supported Hadoop distribution as base so as to customize it by following the wizard. Adopting this approach requires knowledge about the configuration of the Hadoop distribution to be used.

    Note that the check boxes in the wizard allow you to select the Hadoop element(s) you need to import. All the check boxes are not always displayed in your wizard depending on the context in which you are creating the connection. For example, if you are creating this connection for Oozie, then only the Oozie check box appears.

  2. Whether you have selected Import from existing version or Import from zip, verify that each check box next to the Hadoop element you need to import has been selected..

  3. Click OK and then in the pop-up warning, click Yes to accept overwriting any custom setup of jar files previously implemented-.

    Once done, the [Custom Hadoop version definition] dialog box becomes active.

    This dialog box lists the Hadoop elements and their jar files you are importing.

  4. If you have selected Import from zip, click OK to validate the imported configuration.

    If you have selected Import from existing version as base, you should still need to add more jar files to customize that version. Then from the tab of the Hadoop element you need to customize, for example, the HDFS/HCatalog/Oozie tab, click the [+] button to open the [Select libraries] dialog box.

  5. Select the External libraries option to open its view.

  6. Browse to and select any jar file you need to import.

  7. Click OK to validate the changes and to close the [Select libraries] dialog box.

    Once done, the selected jar file appears on the list in the tab of the Hadoop element being configured.

    Note that if you need to share the custom Hadoop setup with another Studio, you can export this custom connection from the [Custom Hadoop version definition] window using the button.

  8. In the [Custom Hadoop version definition] dialog box, click OK to validate the customized configuration. This brings you back to the configuration view in which you have selected the Custom option.

Now that the configuration of the custom Hadoop version has been set up and you are back to the Hadoop connection configuration view, you are able to continue to enter other parameters required by the connection.

If the custom Hadoop version you need to connect to contains YARN and you want to use it, select the Use YARN check box next to the Distribution list.