Scenario: HCatalog table management on Hortonworks Data Platform - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a six-component Job that includes the common operations for the HCatalog table management on Hortonworks Data Platform. Sub-sections in this scenario covers DB operations including:

  • Creating a table to the database in HDFS;

  • Writing data to the HCatalog managed table;

  • Writing data to the partitioned table using tHCatalogLoad;

  • Reading data from the HCatalog managed table;

  • Outputting the data read from the table in HDFS.

Note

Knowledge of Hive Data Definition Language and HCatalog Data Definition Language is required. For further information about Hive Data Definition Language, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL. For further information about HCatalog Data Definition Language, see https://cwiki.apache.org/confluence/display/HCATALOG/Design+Document+-+Java+APIs+for+HCatalog+DDL+Commands.

Setting up the Job

  1. Drop the following components from the Palette to the design workspace: tHCatalogOperation, tHCatalogLoad, tHCatalogInput, tHCatalogOutput, tFixedFlowInput, and tFileOutputDelimited.

  2. Right-click tHCatalogOperation to connect it to tFixedFlowInput component using a Trigger>OnSubjobOk connection.

  3. Right-click tFixedFlowInput to connect it to tHCatalogOutput using a Row > Main connection.

  4. Right-click tFixedFlowInput to connect it to tHCatalogLoad using a Trigger > OnSubjobOk connection.

  5. Right-click tHCatalogLoad to connect it to the tHCatalogInput component using a Trigger > OnSubjobOk connection.

  6. Right-click tHCatalogInput to connect it to tFileOutputDelimited using a Row > Main connection.

Creating a table in HDFS

  1. Double-click tHCatalogOperation to open its Basic settings view.

  2. Click Edit schema to define the schema for the table to be created.

  3. Click [+] to add at least one column to the schema and click OK when you finish setting the schema. In this scenario, the columns added to the schema are: name, country and age.

  4. Fill the Templeton hostname field with URL of the Templeton webservice you are using. In this scenario, fill this field with "192.168.0.131".

  5. Fill the Templeton port field with the port for Templeton hostname. By default, the value for this field is "50111"

  6. Select Table from the Operation on list and Drop if exist and create from the Operation list to create a table in HDFS.

  7. Fill the Database field with an existing database name in HDFS. In this scenario, the database name is "talend".

  8. Fill the Table field with the name of the table to be created. In this scenario, the table name is "Customer".

  9. Fill the Username field with the username for the DB authentication.

  10. Select the Set the user group to use check box to specify the user group. The default user group is "root", you need to specify the value for this field according to real practice.

  11. Select the Set the permissions to use check box to specify the user permission. The default value for this field is "rwxrwxr-x".

  12. Select the Set partitions check box to enable the partition schema.

  13. Click the Edit schema button next to the Set partitions check box to define the partition schema.

  14. Click [+] to add one column to the schema and click OK when you finish setting the schema. In this scenario, the column added to the partition schema is: match_age.

Writing data to the existing table

  1. Double-click tFixedFlowInput to open its Basic settings view.

  2. Click Edit schema to define a same schema as the one you defined in tHCatalogOperation.

  3. Fill the Number of rows field with integer 8.

  4. Select Use Inline Table in the Mode area.

  5. Click [+] to add new lines in the inline table.

  6. Double-click tHCatalogOutput to open its Basic settings view.

  7. Click Sync columns to retrieve the schema defined in the preceding component.

  8. Fill the NameNode URI field with the URI to the NameNode. In this scenario, this URL is "192.168.0.131".

  9. Fill the File name field with the HDFS location of the file you write data to. In this scenario, the file location is "/user/hdp/Customer/Customer.csv".

  10. Select Overwrite from the Action list.

  11. Fill the Templeton hostname field with URL of the Templeton webservice you are using. In this scenario, fill this field with "192.168.0.131".

  12. Fill the Templeton port field with the port for Templeton hostname. By default, the value for this field is "50111"

  13. Fill the Database field, the Table field, the Username field with the same value you specified in tHCatalogOperation.

  14. Fill the Partition field with "match_age=27".

  15. Fill the File location field with the HDFS location to which the table will be saved. In this example, use "hdfs://192.168.0.131:8020/user/hdp/Customer".

Writing data to the partitioned table using tHCatalogLoad

  1. Double-click tHCatalogLoad to open its Basic settings view.

  2. Fill the Partition field with "match_age=26".

  3. Do the rest of the settings in the same way as configuring tHCatalogOperation.

Reading data from the table in HDFS

  1. Double-click tHCatalogInput to open its Basic settings view.

  2. Click Edit schema to define the schema of the table to be read from the database.

  3. Click [+] to add at least one column to the schema. In this scenario, the columns added to the schema are age and name.

  4. Fill the Partition field with "match_age=26".

  5. Do the rest of the settings in the same way as configuring tHCatalogOperation.

Outputting the data read from the table in HDFS to the console

  1. Double-click tLogRow to open its Basic settings view.

  2. Click Sync columns to retrieve the schema defined in the preceding component.

  3. Select Table from the Mode area.

Job execution

Press CTRL+S to save your Job and F6 to execute it.

The data of the restricted table read from the HDFS is displayed onto the console.

Type in http://talend-hdp:50075/browseDirectory.jsp?dir=/user/hdp/Customer&namenodeInfoPort=50070 to the address bar of your browser to view the table you created:

Click the Customer.csv link to view the content of the table you created.