Scenario: Joining table columns and writing them into Hive - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario uses a four-component Job to join the columns selected from two Hive tables and write them into another Hive table.

Preparing the Hive tables

  1. Create the Hive table you want to write data in. In this scenario, this table is named as agg_result, and you can create it using the following statement in tHiveRow:

    create table agg_result (id int, name string, address string, sum1 string, postal string, state string, capital string, mostpopulouscity string) partitioned by (type string) row format delimited fields terminated by ';' location '/user/ychen/hive/table/agg_result'

    In this statement, '/user/ychen/hive/table/agg_result' is the directory used in this scenario to store this created table in HDFS. You need to replace it with the directory you want to use in your environment.

    For further information about tHiveRow, see tHiveRow.

  2. Create two input Hive tables containing the columns you want to join and aggregate these columns into the output Hive table, agg_result. The statements to be used are:

    create table customer (id int, name string, address string, idState int, id2 int, regTime string, registerTime string, sum1 string, sum2 string) row format delimited fields terminated by ';' location '/user/ychen/hive/table/customer'

    and

    create table state_city (id int, postal string, state string, capital int, mostpopulouscity string) row format delimited fields terminated by ';' location '/user/ychen/hive/table/state_city'
  3. Use tHiveRow to load data into the two input tables, customer and state_city. The statements to be used are:

    "LOAD DATA LOCAL INPATH 'C:/tmp/customer.csv' OVERWRITE INTO TABLE customer"

    and

    "LOAD DATA LOCAL INPATH 'C:/tmp/State_City.csv' OVERWRITE INTO TABLE state_city"

    The two files, customer.csv and State_City.csv, are two local files we created for this scenario. You need to create your own files to provide data to the input Hive tables. The data schema of each file should be identical with their corresponding table.

    You can use tRowGenerator and tFileOutputDelimited to create these two files easily. For further information about these two components, see tRowGenerator and tFileOutputDelimited.

    For further information about the Hive query language, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual.

Linking the components

  1. In the Integration perspective of Talend Studio, create an empty Job from the Job Designs node in the Repository tree view.

    For further information about how to create a Job, see Talend Studio User Guide.

  2. Drop two tELTHiveInput components and tELTHiveMap and tELTHiveOutput onto the workspace.

  3. Connect them using the Row > Main link.

    Each time when you connect two components, a wizard pops up to prompt you to name the link you are creating. This name must be the same as that of the Hive table you want the active component to process. In this scenario, the input tables the two tELTHiveInput components will handle are customer and state_city and the output table tELTHiveOutput will handle is agg_result.

Configuring the input schemas

  1. Double-click the tELTHiveInput component using the customer link to open its Component view.

  2. Click the [...] button next to Edit schema to open the schema editor.

  3. Click the button as many times as required to add columns and rename them to replicate the schema of the customer table we created earlier in Hive.

    To set up this schema, you can as well use the customer schema you retrieve and store in the Repository. For further information about how to set up a connection to Hive and retrieve and store the schema in Repository, see Talend Studio User Guide.

  4. In the Default table name field, enter the name of the input table, customer, to be processed by this component.

  5. Double-click the other tELTHiveInput component using the state_city link to open its Component view.

  6. Click the [...] button next to Edit schema to open the schema editor.

  7. Click the button as many times as required to add columns and rename them to replicate the schema of the state_city table we created earlier in Hive.

  8. In the Default table name field, enter the name of the input table, state_city, to be processed by this component.

Mapping the input and the output schemas

Configuring the connection to Hive

  1. Click tELTHiveMap, then, click Component to open its Component view.

  2. In the Version area, select the Hadoop distribution you are using and the Hive version.

  3. In the Connection mode list, select the connection mode you want to use. If your distribution is HortonWorks, this mode is Embedded only.

  4. In the Host field and the Port field, enter the authentication information for the component to connect to Hive. For example, the host is talend-hdp-all and the port is 9083.

  5. Select the Set Jobtracker URI check box and enter the location of the Jobtracker. For example, talend-hdp-all:50300.

  6. Select the Set NameNode URI check box and enter the location of the NameNode. For example, hdfs://talend-hdp-all:8020.

Mapping the schemas

  1. Click ELT Hive Map Editor to map the schemas

  2. On the input side (left in the figure), click the Add alias button to add the table to be used.

  3. In the pop-up window, select the customer table, then click OK.

  4. Repeat the operations to select the state_city table.

  5. Drag and drop the idstate column from the customer table onto the id column of the state_city table. Thus an inner join is created automatically.

  6. On the output side (the right side in the figure), the agg_result table is empty at first. Click at the bottom of this side to add as many columns as required and rename them to replicate the schema of the agg_result table you created earlier in Hive.

    Note

    The type column is the partition column of the agg_result table and should not be replicated in this schema. For further information about the partition column of the Hive table, see the Hive manual.

  7. From the customer table, drop id, name, address, and sum1 to the corresponding columns in the agg_result table.

  8. From the state_city table, drop postal, state, capital and mostpopulouscity to the corresponding columns in the agg_result table.

  9. Click OK to validate these changes.

Configuring the output schema

  1. Double-click tELTHiveOutput to open its Component view.

  2. If this component does not have the same schema of the preceding component, a warning icon appears. In this case, click the Sync columns button to retrieve the schema from the preceding one and once done, the warning icon disappears.

  3. In the Default table name field, enter the output table you want to write data in. In this example, it is agg_result.

  4. In the Field partition table, click to add one row. This allows you to write data in the partition column of the agg_result table.

    This partition column was defined the moment we created the agg_result table using partitioned by (type string) in the Create statement presented earlier. This partition column is type, which describes the type of a customer.

  5. In Partition column, enter type without any quotation marks and in Partition value, enter prospective in single quotation marks.

Executing the Job

  • Press F6 to run this Job.

Once done, verify agg_result in Hive using, for example,

select * from agg_result;

This figure present only a part of the table. You can find that the selected input columns are aggregated and written into the agg_result table and the partition column is filled with the value prospective.