Scenario: Merging two datasets in HDFS - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario illustrates how to use tSqoopMerge to merge two datasets that are sequentially imported to HDFS from the same MySQL table, with modifications of a record in between.

The first dataset (the old one before the modifications) to be used in this scenario reads as follows:

id,wage,mod_date
0,2000,2008-06-26 04:25:59
1,2300,2011-06-12 05:29:45
2,2500,2007-01-15 11:59:13
3,3000,2010-05-02 15:34:05
			

The path to it in HDFS is /user/ychen/target_old.

The second dataset (the new one after the modifications) to be used reads as follows:

id,wage,mod_date
0,2000,2008-06-26 04:25:59
1,2300,2011-06-12 05:29:45
2,2500,2007-01-15 11:59:13
3,4000,2013-10-14 18:00:00
			

The path to it in HDFS is /user/ychen/target_new.

These datasets were both imported by tSqoopImport. For a scenario about how to use tSqoopImport, see Scenario: Importing a MySQL table to HDFS.

The Job in this scenario merges these two datasets with the newer record overwriting the older one.

Before starting to replicate this scenario, ensure that you have appropriate rights and permissions to access the Hadoop distribution to be used. Then proceed as follows:

Dropping the component

  1. In the Integration perspective of the Studio, create an empty Job from the Job Designs node in the Repository tree view.

    For further information about how to create a Job, see Talend Studio User Guide.

  2. Drop tSqoopMerge onto the workspace.

    In this scenario, the required jar file for the merge is not available, you then need to use tSqoopMerge to generate it at runtime from the source MySQL table..

Configuring tSqoopMerge

  1. Double-click tSqoopMerge to open its Component view.

  2. In the Mode area, select Use Java API.

  3. In the Version area, select the Hadoop distribution to be used and its version. If you cannot find from the list the distribution corresponding to yours, select Custom so as to connect to a Hadoop distribution not officially supported in the Studio.

    For a step-by-step example about how to use this Custom option, see Connecting to a custom Hadoop distribution.

  4. In the NameNode URI field, enter the location of the master node, the NameNode, of the distribution to be used. For example, hdfs://talend-cdh4-namenode:8020.

  5. In the JobTracker Host field, enter the location of the JobTracker of your distribution. For example, talend-cdh4-namenode:8021.

    Note that the notion Job in this term JobTracker designates the MR or the MapReduce jobs described in Apache's documentation on http://hadoop.apache.org/.

  6. If the distribution to be used requires Kerberos authentication, select the Use Kerberos authentication check box and complete the authentication details. Otherwise, leave this check box clear.

    If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field.

    Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

  7. In the Old data directory and the New data directory fields, enter the path, or browse to the directory in HDFS where the older and the newer datasets are stored, respectively.

  8. In the Target directory field, enter the path, or browse to the folder you need to store the merge result in.

  9. In the Merge key field, enter the column to be used as the key for the merge. In this scenario, the column is id.

  10. Select Need to generate the JAR file to display the connection parameters to the source database table.

  11. In the Connection field, enter the URI of the MySQL database where the source table is stored. For example, jdbc:mysql://10.42.10.13/mysql.

  12. In the Table Name field, enter the name of the source table. In this scenario, it is sqoopmerge.

  13. In Username and Password, enter the authentication information.

  14. Under the Driver JAR table, click the [+] button to add one row, then in this row, click the [...] button to display the drop-down list and select the jar file to be used from that list. In this scenario, it is mysql-connector-java-5.1.30-bin.jar.

    If the [...] button does not appear, click anywhere in this row to make it displayed.

  15. If the field delimiter of the source table is not comma (,), you still need to specify the delimiter in the Additional Arguments table in the Advanced settings tab. The argument to be used is codegen.output.delimiters.field for the Use Java API mode or --fields-terminated-by for the Use Commandline mode.

Executing the Job

Then you can press F6 to run this Job.

During the execution, the jar file and the class for the merge are generated in the local machine.

Once done, you can verify the results in the target directory you have specified, in the web console of the Hadoop distribution used.

If you need to obtain more details about the Job, it is recommended to use the web console of the Jobtracker provided by the Hadoop distribution you are using.

If you continue to import updated datasets to HDFS from the same source table, you can reuse the generated merge class to merge the datasets.