Centralizing Hive metadata - 6.1

Talend Open Studio for Big Data User Guide

EnrichVersion
6.1
EnrichProdName
Talend Open Studio for Big Data
task
Design and Development
EnrichPlatform
Talend Studio

If you often need to use a database table from Hive, then you may want to centralize the connection information to the Hive database and the table schema details in the Metadata folder in the Repository tree view.

Even though you can still do this from the DB connection mode, using the Hadoop cluster node is the alternative that makes better use of the centralized connection properties for a given Hadoop distribution.

Prerequisites:

  • Launch the Hadoop distribution you need to use and ensure that you have the proper access permission to that distribution and its Hive database.

  • Create the connection to that Hadoop distribution from the Hadoop cluster node. For further information, see Centralizing a Hadoop connection.

If you want to use MapR as the distribution and MapR 2.0.0 or MapR 2.1.2 as the Hive version, do the following before setting up the Hive connection:

  1. Add the MapR client path which varies depending on your operating system (for Windows, it is -Djava.library.path=maprclientpath\lib\native\Windows_7-amd64-64) to the corresponding .ini file of Talend Studio, for example, Talend-Studio-win-x86_64.ini.

  2. For MapR 2.0.0, install the module maprfs-0.1.jar.

    For MapR 2.1.2, install the modules maprfs-0.20.2-2.1.2.jar and maprfs-jni-0.20.2-2.1.2.jar.

  3. Restart the studio to validate your changes.

For more information about how to install modules, see the description about identifying and installing external modules in Talend Installation and Upgrade Guide.

Creating a connection to Hive

  1. Expand the Hadoop cluster node under the Metadata node of the Repository tree, right-click the Hadoop connection to be used and select Create Hive from the contextual menu.

  2. In the connection wizard that opens up, fill in the generic properties of the connection you need create, such as Name, Purpose and Description. The Status field is a customized field that you can define in File > Edit project properties.

  3. Click Next to proceed to the next step, which requires you to fill in the Hive connection details. Among them, DB Type, Hadoop cluster, Distribution, Version, Server, NameNode URL and JobTracker URL are automatically pre-filled with the properties inherited from the Hadoop connection you selected in the previous steps.

    Note that if you choose None from the Hadoop cluster list, you are actually switching to a manual mode in which the inherited properties are abandoned and instead you have to configure every property yourself, with the result that the created Hive connection appears under the Db connection node only.

    The properties to be set vary depending on the Hadoop distribution you connect to.

  4. In the Version info area, select the model of the Hive database you want to connect to. Some Hadoop distributions allow the choice between the Embedded model and the Standalone model, while others provide only one of them.

    Depending on the distribution you select, you are as well able to select from the Hive Server version list Hive Server2 which can better support concurrent connections of multiple clients than Hive Server1. For further information about Hive Server2, see https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2.

  5. Fill in the fields that appear depending on the Hive model you have selected.

    When you leave the Database field empty, selecting the Embedded model allows the Studio to automatically connect to all of the existing databases in Hive while selecting the Standalone model enables the connection to the default Hive database only.

  6. If you are accessing a Hadoop distribution running with Kerberos security, select the Use Kerberos authentication check box. Then fill the following fields according to the configurations on the Hive server end:

    • Enter the Kerberos principal name in the Hive principal field activated,

    • Enter the Metastore database URL in the Metastore URL field,

    • Click the [...] next to the Driver jar field and browse to the Metastore database driver JAR file,

    • Click the [...] button next to the Driver class field, and select the right class, and

    • Enter the user name and password in the Username and Password fields.

    If you need to use a keytab file to log in, select the Use a keytab to authenticate check box, enter the principal to be used in the Principal field and in the Keytab field, browse to the keytab file to be used.

    A keytab file contains pairs of Kerberos principals and encrypted keys. Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

  7. If you are using one of the following distributions, you can select Tez as the framework of your Jobs from the Execution engine list.

    • Hortonworks: V2.1 and V2.2.

    • MapR: V4.0.1.

    • Custom: this option allows you connect to a distribution supporting Tez but not officially supported by Talend.

    Then when you reuse this connection in a Hive component, you need to configure the access to the Tez libraries via that component's Advanced settings view. For further information, see the Hive related documentation in Talend Open Studio for Big Data Components Reference Guide, such as tHiveConnection.

  8. If you need to use custom configuration for the Hadoop or Hive distribution to be used, click the [...] button next to Hadoop properties or Hive Properties accordingly to open the corresponding properties table and add the property or properties to be customized. Then at runtime, these changes will override the corresponding default properties used by the Studio for its Hadoop engine.

    For further information about the properties of Hadoop, see Apache's Hadoop documentation on http://hadoop.apache.org/docs/current/, or the documentation of the Hadoop distribution you need to use. For example, the following page lists some of the default Hadoop properties: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml.

    For further information about the properties of Hive, see Apache's documentation for Hive. For example, the following page describe some of the Hive configuration properties: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties.

    For further information about how to leverage these properties tables, see Setting reusable Hadoop properties.

  9. Click the Check button to verify if your connection is successful.

  10. If needed, define the properties of the database in the corresponding fields in the Database Properties area.

  11. Click Finish to validate your changes and close the wizard.

    The created connection to the specified Hive database displays under the DB Connections folder in the Repository tree view. This connection has four sub-folders among which Table schema can group all schemas relative to this connection.

    If you need to use an environmental context to define the parameters of this connection, click the Export as context button to open the corresponding wizard and make the choice from the following options:

    • Create a new repository context: create this environmental context out of the current Hadoop connection, that is to say, the parameters to be set in the wizard are taken as context variables with the values you have given to these parameters.

    • Reuse an existing repository context: use the variables of a given environmental context to configure the current connection.

    If you need to cancel the implementation of the context, click Revert context. Then the values of the context variables being used are directly put in this wizard.

    For a step-by-step example about how to use this Export as context feature, see Exporting metadata as context and reusing context parameters to set up a connection.

  12. Right-click the Hive connection you created and then select Retrieve Schema to extract all schemas in the defined Hive database.

Retrieving a Hive table schema

In this step, you will retrieve the table schema of interest from the connected Hive database.

  1. In the Repository view, right-click the Hive connection of interest and select Retrieve schema from the contextual menu, and click Next on the wizard that opens to view and filter different tables in that Hive database.

  2. Expand the nodes of the database tables you need to use and select the columns to be retrieved, and click Next to open a new view on the wizard that lists the selected table schema(s). You can select any of them to display its details in the Schema area on the right side of the wizard.

    Warning

    If your source database table contains any default value that is a function or an expression rather than a string, be sure to remove the single quotation marks, if any, enclosing the default value in the end schema to avoid unexpected results when creating database tables using this schema.

    For more information, see https://help.talend.com/display/KB/Verifying+default+values+in+a+retrieved+schema.

  3. Modify the selected schema if needed. You can rename the schema, and customize the schema structure according to your needs in the Schema area.

    The tool bar allows you to add, remove or move columns in your schema.

    To overwrite the modifications you made on the selected schema using its default schema, click Retrieve schema. Note that all your changes to the schema will be lost if you click this button.

  4. Click Finish to complete the Hive table schema retrieval. All the retrieved schemas are displayed under the related Hive connection in the Repository view.

    If you need to further edit a schema, right-click the schema and select Edit Schema from the contextual menu to open this wizard again and make your modifications.

    Warning

    If you modify the schemas, ensure that the data type in the Type column is correctly defined.

As explained earlier, in addition to using the Hadoop cluster node, you can as well start from the Db connection node to create an Hive connection and retrieve schemas. In either way, you need always to define the specific Hive connection properties. At that step:

  • if you select from the Hadoop cluster list the Repository option to reuse details of an established Hadoop connection, the created Hive connection will eventually be classified under both the Hadoop cluster node and the Db connection node;

  • otherwise, if you select from the Hadoop cluster list the None option in order to enter the Hadoop connection properties yourself, the created Hive connection will appear under the Db connection node only.