Scenario: Comparing four columns using different matching methods and collecting encountered duplicates - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a four-component job aiming at collecting in two separate files all unique entries and all duplicate entries from few defined processed columns based on the Levenshtein and Double Metaphone matching types.

The input file in this example looks like the following:

ID;Status;FirstName;Email;City;Initial;ZipCode
1;married;Paul;pnewman@comp.com;New York;P.N.;55677
2;single;Raul;rnewman@comp.com;New Ork;R.N.;55677
3;single;Mary;mnewman@comp.com;Chicago;M.N;66898

Setting up the Job

  1. Drop tFileInputDelimited, tFuzzyUniqRow, and two tFileOutputExcel components from the Palette onto the design workspace.

  2. Connect tFileInputDelimited to tFuzzyUniqRow using the Main link, tFuzzyUniqRow to the tFileOutputExcel components using the Uniques link with one and the Duplicates link with the other.

Configuring the components

  1. Double-click tFileInputDelimited to open its Basic settings view and define its properties.

  2. Click the three-dot button next to the File Name field to browse to the file holding the input data.

  3. If needed, set Header, Footer, and Limit.

    For this scenario, set Header to 1. Footer and limit for the number of processed rows are not set.

  4. Click Edit schema to open a dialog box where you can describe the data structure of the source delimited file.

    In this scenario, the source schema is made of the following columns: ID, Status, FirstName, Email, City, Initial, and ZipCode.

  5. Double click tFuzzyUniqRow to display its Basic settings view and define its properties.

  6. In the Key Attribute column, select the check boxes next to the columns you want to check using the defined matching method, Firstname, Email, City, and ZipCode in this example.

  7. In the Matching Type column, set the matching methods you want to use on each of the selected columns.

    In this example, Leveshtein is to be used as the matching method for the FirstName, Email, and ZipCode columns, Double Metaphone is to be used as the matching method for the City column.

    Then set the minimum and maximum distances for the Levenshtein method. In this method, the distance is the number of character changes (insertion, deletion or substitution) that needs to be carried out in order for the entry to fully match the reference. In this example, we want the min. distance to be 0 and the max. distance to be 2. This will output all entries in the FirstName, Email, and ZipCode columns that exactly match or that have maximum two character changes. There is no minimum nor maximum distance to set for Double Metaphone because this matching method is based on phonetic discrepancies in the input data.

  8. Double click the first tFileOutputExcel to display its Basic settings view and define its properties.

  9. Set the destination file name as well as the Sheet name and select the Include header check box.

  10. Do the same for the second tFileOutputExcel.

Executing the Job

  • Save your Job and click F6 to execute it.

tFuzzyUniqRow uses the Levenshtein method to compare each of the three defined columns separately, it uses the Double Metaphone method to compare data in the City column, and finally passes the unique and duplicate rows to the defined output files. In our example, the first two rows match, hence the second row will go in the "duplicates" output.

The generated FID column gives a reference identifier of the original record which the current record refers to.

The third row is unique and will go in the "uniques" output.

The generated UID column is an identifier generated for the main record.