Centralizing HBase metadata - 6.4

Talend Big Data Platform Studio User Guide

EnrichVersion
6.4
EnrichProdName
Talend Big Data Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

If you often need to use a database table from HBase, then you may want to centralize the connection information to the HBase database and the table schema details in the Metadata folder in the Repository tree view.

Even though you can still do this from the DB connection mode, using the Hadoop cluster node is the alternative that makes better use of the centralized connection properties for a given Hadoop distribution.

Prerequisites:

  • Launch the Hadoop distribution you need to use and ensure that you have the proper access permission to that distribution and its HBase.

  • Create the connection to that Hadoop distribution from the Hadoop cluster node. For further information, see Centralizing a Hadoop connection.

Creating a connection to HBase

  1. Expand the Hadoop cluster node under the Metadata node of the Repository tree, right-click the Hadoop connection to be used and select Create HBase from the contextual menu.

  2. In the connection wizard that opens up, fill in the generic properties of the connection you need create, such as Name, Purpose and Description. The Status field is a customized field that you can define in File > Edit project properties.

  3. Click Next to proceed to the next step, which requires you to fill in the HBase connection details. Among them, DB Type, Hadoop cluster, Distribution, HBase version and Server are automatically pre-filled with the properties inherited from the Hadoop connection you selected in the previous steps.

    Note that if you choose None from the Hadoop cluster list, you are actually switching to a manual mode in which the inherited properties are abandoned and instead you have to configure every property yourself, with the result that the created connection appears under the Db connection node only.

  4. In the Port field, fill in the port number of the HBase database to be connected to.

    Note

    In order to make the host name of the Hadoop server recognizable by the client and the host computers, you have to establish an IP address/hostname mapping entry for that host name in the related hosts files of the client and the host computers. For example, the host name of the Hadoop server is talend-all-hdp, and its IP address is 192.168.x.x, then the mapping entry reads 192.168.x.x talend-all-hdp. For the Windows system, you need to add the entry to the file C:\WINDOWS\system32\drivers\etc\hosts (assuming Windows is installed on drive C). For the Linux system, you need to add the entry to the file /etc/hosts.

  5. In the Column family field, enter the column family if you want to filter columns, and click Check to check your connection

  6. If you are accessing a Hadoop distribution running with Kerberos security, select this check box, then, enter the Kerberos principal name for the NameNode in the field activated. This enables you to use your user name to authenticate against the credentials stored in Kerberos.

    If you need to use a keytab file to log in, select the Use a keytab to authenticate check box. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and in the Keytab field, browse to the keytab file to be used.

    Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

  7. If you need to use custom configuration for the Hadoop or HBase distribution to be used, click the [...] button next to Hadoop properties to open the properties table and add the property or properties to be customized. Then at runtime, these changes will override the corresponding default properties used by the Studio for its Hadoop engine.

    Note a Parent Hadoop properties table is displayed above the current properties table you are editing. This parent table is read-only and lists the Hadoop properties that have been defined in the wizard of the parent Hadoop connection on which the current HBase connection is based.

    For further information about the properties of Hadoop, see Apache's Hadoop documentation on http://hadoop.apache.org/docs/current/, or the documentation of the Hadoop distribution you need to use. For example, the following page lists some of the default Hadoop properties: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml.

    For further information about the properties of HBase, see Apache's documentation for HBase. For example, the following page describes some of the HBase configuration properties: http://hbase.apache.org/book.html#_configuration_files.

    For further information about how to leverage this properties table, see Setting reusable Hadoop properties.

  8. Click Finish to validate the changes.

    The newly created HBase connection appears under the Hadoop cluster node of the Repository tree. In addition, as an HBase connection is a database connection, this new connection appears under the Db connections node, too.

    Note

    This Repository view may vary depending on the edition of the Studio you are using.

If you need to use an environmental context to define the parameters of this connection, click the Export as context button to open the corresponding wizard and make the choice from the following options:

  • Create a new repository context: create this environmental context out of the current Hadoop connection, that is to say, the parameters to be set in the wizard are taken as context variables with the values you have given to these parameters.

  • Reuse an existing repository context: use the variables of a given environmental context to configure the current connection.

If you need to cancel the implementation of the context, click Revert context. Then the values of the context variables being used are directly put in this wizard.

For a step-by-step example about how to use this Export as context feature, see Exporting metadata as context and reusing context parameters to set up a connection.

Retrieving a table schema

Warning

If you are working on an SVN or Git managed project while the Manual lock option is selected in Talend Administration Center, be sure to lock manually your connection in the Repository tree view before retrieving or updating table schemas for it. Otherwise the connection is read-only and the Finish button of the wizard is not operable.

For information on locking and unlocking a project item and on different lock types, see Working collaboratively on project items.

In this step, we will retrieve the table schema of interest from the connected HBase database.

  1. In the Repository view, right-click the newly created connection and select Retrieve schema from the contextual menu, and click Next on the wizard that opens to view and filter different tables in the HBase database.

    You can define the number of columns to be displayed for each column family in the Limit field.

    If you want to set this limit for all the HBase/MapR-DB connection metadata to be defined in the Repository, set the limit in the HBase/MapR-DB scan limit field in Preferences > Talend > Performance.

  2. Expand the relevant database table and column family nodes and select the columns of interest, and click Next to open a new view on the wizard that lists the selected table schema(s). You can select any of them to display its details in the Schema area on the right side of the wizard.

    Warning

    If your source database table contains any default value that is a function or an expression rather than a string, be sure to remove the single quotation marks, if any, enclosing the default value in the end schema to avoid unexpected results when creating database tables using this schema.

  3. Modify the selected schema if needed. You can rename the schema, and customize the schema structure according to your needs in the Schema area.

    The tool bar allows you to add, remove or move columns in your schema.

    To overwrite the modifications you made on the selected schema using its default schema, click Retrieve schema. Note that all your changes to the schema will be lost if you click this button.

  4. Click Finish to complete the HBase table schema creation. All the retrieved schemas are displayed under the related HBase connection in the Repository view.

    If you need to further edit a schema, right-click the schema and select Edit Schema from the contextual menu to open this wizard again and make your modifications.

    Warning

    If you modify the schemas, ensure that the data type in the Type column is correctly defined.

As explained earlier, apart from using the Hadoop cluster node, you can as well create an HBase connection and retrieve schemas from the Db connection node. In either way, you need always to define the specific HBase connection properties. At that step:

  • if you select from the Hadoop cluster list the Repository option to reuse details of an established Hadoop connection, the created HBase connection will eventually be classified under both the Hadoop cluster node and the Db connection node;

  • otherwise, if you select from the Hadoop cluster list the None option in order to enter the Hadoop connection properties yourself, the created HBase connection will appear under the Db connection node only.