Scenario 1: Levenshtein distance of 0 in first names - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a four-component Job aiming at checking the edit distance between the First Name column of an input file with the data of the reference input file. The output of this Levenshtein type check is displayed along with the content of the main flow on a table

Setting up the Job

  1. Drag and drop the following components from the Palette to the design workspace: tFileInputDelimited (x2), tFuzzyMatch, tLogRow.

  2. Link the first tFileInputDelimited component to the tFuzzyMatch component using a Row > Main connection.

  3. Link the second tFileInputDelimited component to the tFuzzyMatch using a Row > Main connection (which appears as a Lookup row on the design workspace).

  4. Link the tFuzzyMatch component to the standard output tLogRow using a Row > Main connection.

Configuring the components

  1. Define the first tFileInputDelimited in its Basic settings view. Browse the system to the input file to be analyzed.

  2. Define the schema of the component. In this example, the input schema has two columns, firstname and gender.

  3. Define the second tFileInputDelimited component the same way.

    Warning

    Make sure the reference column is set as key column in the schema of the lookup flow.

  4. Double-click the tFuzzyMatch component to open its Basic settings view, and check its schema.

    The Schema should match the Main input flow schema in order for the main flow to be checked against the reference.

    Note that two columns, Value and Matching, are added to the output schema. These are standard matching information and are read-only.

  5. Select the method to be used to check the incoming data. In this scenario, Levenshtein is the Matching type to be used.

  6. Then set the distance. In this method, the distance is the number of char changes (insertion, deletion or substitution) that needs to be carried out in order for the entry to fully match the reference.

    In this use case, we set both the minimum distance and the maximum distance to 0. This means only the exact matches will be output.

  7. Also, clear the Case sensitive check box.

  8. Check that the matching column and look up column are correctly selected.

  9. Leave the other parameters as default.

Executing the Job

  • Save the Job and press F6 to execute the Job.

As the edit distance has been set to 0 (min and max), the output shows the result of a regular join between the main flow and the lookup (reference) flow, hence only full matches with Value of 0 are displayed.

A more obvious example is with a minimum distance of 1 and a maximum distance of 2, see Scenario 2: Levenshtein distance of 1 or 2 in first names