Scenario: creating a partitioned Hive table - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario illustrates how to use tHiveConnection, tHiveCreateTable and tHiveLoad to create a partitioned Hive table and write data in it.

Note that tHiveCreateTable and tHiveLoad are available only when you are using one of the Talend solutions with Big Data.

The sample data to be used in this scenario is employee information of a company, reading as follows:

1;Lyndon;Fillmore;21-05-2008;US
2;Ronald;McKinley;15-08-2008
3;Ulysses;Roosevelt;05-10-2008
4;Harry;Harrison;23-11-2007
5;Lyndon;Garfield;19-07-2007
6;James;Quincy;15-07-2008
7;Chester;Jackson;26-02-2008
8;Dwight;McKinley;16-07-2008
9;Jimmy;Johnson;23-12-2007
10;Herbert;Fillmore;03-04-2008
				

The information contains some employees' names and the dates when they are registered in a HR system. Since these employees work for the US subsidiary of the company, you will create a US partition for this sample data.

Before starting to replicate this scenario, ensure that you have appropriate rights and permissions to access the Hive database to be used.

Note that if you are using the Windows operating system, you have to create a tmp folder at the root of the disk where the Studio is installed.

Then proceed as follows:

Linking the components

  1. In the Integration perspective of the Studio, create an empty Job from the Job Designs node in the Repository tree view.

    For further information about how to create a Job, see the chapter describing how to designing a Job in Talend Studio User Guide.

  2. Drop tHiveConnection, tHiveCreateTable and tHiveLoad onto the workspace.

  3. Connect them using the Trigger > On Subjob OK link.

Configuring the connection to Hive

Configuring tHiveConnection

  1. Double-click tHiveConnection to open its Component view.

  2. From the Property type list, select Built-in. If you have created the connection to be used in Repository, then select Repository, click the button to open the [Repository content] dialog box and select that connection. This way, the Studio will reuse that set of connection information for this Job.

    For further information about how to create an Hadoop connection in Repository, see the chapter describing the Hadoop cluster node of the Talend Big Data Getting Started Guide.

  3. In the Version area, select the Hadoop distribution to be used and its version. If you cannot find from the list the distribution corresponding to yours, select Custom so as to connect to a Hadoop distribution not officially supported in the Studio.

    For a step-by-step example about how to use this Custom option, see Connecting to a custom Hadoop distribution.

  4. In the Connection area, enter the connection parameters to the Hive database to be used.

  5. In the Name node field, enter the location of the master node, the NameNode, of the distribution to be used. For example, talend-hdp-all:50300.

  6. In the Job tracker field, enter the location of the JobTracker of your distribution. For example, hdfs://talend-hdp-all:8020.

    Note that the notion Job in this term JobTracker designates the MR or the MapReduce jobs described in Apache's documentation on http://hadoop.apache.org/.

Creating the Hive table

Defining the schema

  1. Double-click tHiveCreateTable to open its Component view.

  2. Select the Use an existing connection check box and from Component list, select the connection configured in the tHiveConnection component you are using for this Job.

  3. Click the button next to Edit schema to open the schema editor.

  4. Click the button four times to add four rows and in the Column column, rename them to Id, FirstName, LastName and Reg_date, respectively.

    Note that you cannot use the Hive reserved keywords to name the columns, such as location or date.

  5. In the Type column, select the type of the data in each column. In this scenario, Id is of the Integer type, Reg_date is of the Date type and the others are of the String type.

  6. In the DB type column, select the Hive type of each column corresponding to their data types you have defined. For example, Id is of INT and Reg_date is of TIMESTAMP.

  7. In the Data pattern column, define the pattern corresponding to that of the raw data. In this example, use the default one.

  8. Click OK to validate these changes.

Defining the table settings

  1. In Table name field, enter the name of the Hive table to be created. In this scenario, it is employees.

  2. From the Action on table list, select Create table if not exists.

  3. From the Format list, select the data format that this Hive table in question is created for. In this scenario, it is TEXTFILE.

  4. Select the Set partitions check box to add the US partition as explained at the beginning of this scenario. To define this partition, click the button next to Edit schema that appears.

  5. Leave the Set file location check box clear to use the default path for Hive table.

  6. Select the Set Delimited row format check box to display the available options of row format.

  7. Select the Field check box and enter a semicolon (;) as field separator in the field that appears.

  8. Select the Line check box and leave the default value as line separator.

Writing data to the table

Configuring tHiveLoad

  1. Double-click tHiveLoad to open its Component view.

  2. Select the Use an existing connection check box and from Component list, select the connection configured in the tHiveConnection component you are using for this Job.

  3. From the Load action field, select LOAD to write data from the file holding the sample data that is presented at the beginning of this scenario.

  4. In the File path field, enter the directory where the sample data is stored. In this example, the data is stored in the HDFS system to be used. In the real-world practice, you can use tHDFSOutput to write data into the HDFS system and you need to ensure that the Hive application has the appropriate rights and permissions to read or even move the data.

    For further information about tHDFSOutput, see tHDFSOutput; for further information about the related rights and permissions, see the documentation or contact the administrator of the Hadoop cluster to be used.

    Note if you need to read data from a local file system other than the HDFS system, ensure that the data to be read is stored in the local file system of the machine in which the Job is run and then select the Local check box in this Basic settings view. For example, when the connection mode to Hive is Standalone, the Job is run in the machine where the Hive application is installed and thus the data should be stored in that machine.

  5. In the Table name field, enter the name of the target table you need to load data in. In this scenario, it is employees.

  6. From the Action on file list, select APPEND.

  7. Select the Set partitions check box and in the field that appears, enter the partition you need to add data to. In this scenario, this partition is country='US'.

Executing the Job

Then you can press F6 to run this Job.

Once done, the Run view is opened automatically, where you can check the execution process.

You can as well verify the results in the web console of the Hadoop distribution used.

If you need to obtain more details about the Job, it is recommended to use the web console of the Jobtracker provided by the Hadoop distribution you are using.