Centralizing HCatalog metadata - 6.4

Talend Open Studio for Big Data User Guide

EnrichVersion
6.4
EnrichProdName
Talend Open Studio for Big Data
task
Design and Development
EnrichPlatform
Talend Studio

If you often need to use a table from HCatalog, a table and storage management layer for Hadoop, then you may want to centralize the connection information to a given HCatalog and the table schema details in the Metadata folder in the Repository tree view.

Prerequisites:

  • Launch the HortonWorks Hadoop distribution you need to use and ensure that you have the proper access permission to that distribution and its HCatalog.

  • Create the connection to that Hadoop distribution from the Hadoop cluster node. For further information, see Centralizing a Hadoop connection.

Creating a connection to HCatalog

  1. Expand Hadoop cluster node under Metadata node in the Repository tree view, right-click the Hadoop connection to be used and select Create HCatalog from the contextual menu.

  2. In the connection wizard that opens up, fill in the generic properties of the connection you need create, such as Name, Purpose and Description. The Status field is a customized field you can define in File > Edit project properties.

  3. Click Next when completed. The second step requires you to fill in the HCatalog connection data. Among the properties, Host name is automatically pre-filled with the value inherited from the Hadoop connection you selected in the previous steps. The Templeton Port and the Database are using the default values.

    This database is actually a Hive database and Templeton (WebHcat) is used as a REST-like web API by HCatalog to issue commands. For further information about Templeton (WebHcat), see Apache's documentation on https://cwiki.apache.org/confluence/display/Hive/WebHCat+UsingWebHCat.

    The Principal and the Realm fields are displayed only when the Hadoop connection you are using enables the Kerberos security. They are the properties required by Kerberos to authenticate the HCatalog client and the HCatalog server to each other.

    Note

    In order to make the host name of the Hadoop server recognizable by the client and the host computers, you have to establish an IP address/hostname mapping entry for that host name in the related hosts files of the client and the host computers. For example, the host name of the Hadoop server is talend-all-hdp, and its IP address is 192.168.x.x, then the mapping entry reads 192.168.x.x talend-all-hdp. For the Windows system, you need to add the entry to the file C:\WINDOWS\system32\drivers\etc\hosts (assuming Windows is installed on drive C). For the Linux system, you need to add the entry to the file /etc/hosts.

  4. If necessary, change these default values to those of the port and the database used by the HCatalog you connect to.

  5. If required, enter the Principal and the Realm properties.

  6. If you need to use custom configuration for the Hadoop or HCatalog distribution to be used, click the [...] button next to Hadoop properties to open the properties table and add the property or properties to be customized. Then at runtime, these changes will override the corresponding default properties used by the Studio for its Hadoop engine.

    Note a Parent Hadoop properties table is displayed above the current properties table you are editing. This parent table is read-only and lists the Hadoop properties that have been defined in the wizard of the parent Hadoop connection on which the current HCatalog connection is based.

    For further information about the properties of Hadoop, see Apache's Hadoop documentation on http://hadoop.apache.org/docs/current/, or the documentation of the Hadoop distribution you need to use. For example, the following page lists some of the default Hadoop properties: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml.

    For further information about the properties of HCatalog, see Apache's documentation for HCatalog. For example, the following page describes some of the HCatalog configuration properties: https://cwiki.apache.org/confluence/display/Hive/HCatalog+Configuration+Properties.

    For further information about how to leverage this properties table, see Setting reusable Hadoop properties.

  7. Click Check to test the connection you have just defined. A message pops up to indicate whether the connection is successful.

  8. Click Finish to validate these changes.

    The created HCatalog connection is available under the Hadoop cluster node in the Repository tree view.

    Note

    This Repository view may vary depending the edition of the Studio you are using.

    If you need to use an environmental context to define the parameters of this connection, click the Export as context button to open the corresponding wizard and make the choice from the following options:

    • Create a new repository context: create this environmental context out of the current Hadoop connection, that is to say, the parameters to be set in the wizard are taken as context variables with the values you have given to these parameters.

    • Reuse an existing repository context: use the variables of a given environmental context to configure the current connection.

    If you need to cancel the implementation of the context, click Revert context. Then the values of the context variables being used are directly put in this wizard.

    For a step-by-step example about how to use this Export as context feature, see Exporting metadata as context and reusing context parameters to set up a connection.

  9. Right-click the newly created connection, and select Retrieve schema from the drop-down list in order to load the desired table schema from the established connection.

Retrieving a table schema

  1. When you click Retrieve Schema, a new wizard opens up where you can filter and display different tables in the HCatalog.

  2. In the Name filter field, you can enter the name of the table(s) you are looking for to filter it/them.

    Otherwise, you can directly find and select the table(s) of which you need to retrieve the schema(s).

    Each time when the schema retrieval is done for a table selected, the Creation status of this table becomes Success.

  3. Click Next to open a new view on the wizard that lists the selected table schema(s). You can select any of them to display its details in the Schema area.

  4. Modify the selected schema if needed. You can change the name of the schema and according to your needs, you can also customize the schema structure in the Schema area.

    Indeed, the tool bar allows you to add, remove or move columns in your schema.

    To overwrite the modifications you made on this selected schema with its default one, click Retrieve schema. Note that this overwriting does not retain any custom edits.

  5. Click Finish to complete the HCatalog table schema creation. All the retrieved schemas are displayed under the relevant HCatalog connection node in the Repository view.

    If then you still need to edit a schema, right click this schema under the related HCatalog connection node in the Repository view and from the contextual menu, select Edit Schema to open this wizard again and then make the modifications.

    Note

    If you modify the schemas, ensure that the data type in the Type column is correctly defined.