Scenario 1: Matching entries using the Q-grams and Levenshtein algorithms - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a six-component Job that aims at:

  • matching entries in the name column against the entries in the reference input file by dividing strings into letter blocks of length q, where q is 3, in order to create a number of q length grams. The matching result is given as the number of q-gram matches over possible q-grams,

  • checking the edit distance between the entries in the email column of an input file against those of the reference input file.

The outputs of these two matching types are written in three output files: the first for match values, the second for possible match values and the third for the values for which there are no matches in the lookup file.

In this scenario, we have already stored the main and reference input schemas in the Repository. For more information about storing schema metadata in the Repository, see Talend Studio User Guide.

The main input table contains seven columns: code, name, address, zipcode, city, email and col7. We want to carry the fuzzy match on two columns: name and email.

Setting up the Job

  1. In the Repository tree view, expand Metadata - DB Connections where you have stored the main input schemas and drop the relevant file onto the design workspace.

    The [Components] dialog box appears.

  2. Select tMysqlInput and click OK to drop the tMysqlInput component onto the workspace.

    The input table used in this scenario is called person. It holds several columns including the two columns name and email we want to do the fuzzy match on. The following capture shows the basic properties of the main input component:

  3. Do the same for the second input table you want to use as a reference, customer in this scenario.

    The following capture shows the basic properties of the reference input component:

  4. Drop the following components from the Palette onto the design workspace: tRecordMatching and three tLogRow.

  5. Connect the main and reference input components to tRecordMatching using Main links. The link between the reference input table and tRecordMatching displays as a Lookup link on the design workspace.

  6. Connect tRecordMatching to the three tLogRow components using the Matches, Possible Matches and Non Matches links.

Configuring the components

  1. Double-click tRecordMatching to display its Basic settings view and define its properties.

  2. Click the Edit schema button to open a dialog box. Here you can define the data you want to pass to the output components.

    In this example we want to pass to the tRecordMatching component the name and email columns from the first tMysqlInput component, and the ref_name and ref_email columns from the second tMysqlInput component.

    The MATCHING_DISTANCE and the MATCHING_WEIGHT columns in the output schema are defined by default.

    The MATCHING_WEIGHT column is always between 0 and 1. It is a global distance between sets of columns (defined by the columns to be matched).

    The MATCHING_DISTANCE column will print a distance for each of the columns on which we use an algorithm. The results will be separated by a vertical bar (pipe).

    Click OK to close the dialog box and proceed to the next step.

  3. In the Key Definition area of the Basic settings view of tRecordMatching, click the plus button to add two columns to the list.

  4. Select the input columns and the output columns you want to do the fuzzy matching on from the Input key attribute and Lookup key attribute lists respectively.

    In this example, select name and email as input attributes and ref-name and ref_email as lookup attributes.

    Note

    When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.

    For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.

  5. Click in the Matching type column and select from the list q-gram, the method to be used on the first column to check the incoming data against the reference data.

  6. Set the matching type for the second column, Levenshtein in this example.

    The minimum and maximum possible match values are defined in the Advanced settings view. You can change the by-default values.

  7. In the Confidence Weight column, set a numerical weight for each of the columns used as key attributes.

  8. Click in the cell of the Handle Null column and select the null operator you want to use to handle null attributes in the columns.

  9. If required, click the plus button below the Blocking Selection table to add one or more lines in the table and then click in the line and select from the list the column you want to use as a blocking value.

    Using a blocking value reduces the number of pairs of records that needs to be examined. The input data is partitioned into exhaustive blocks based on the blocking value. This will decrease the number of pairs to compare as comparison is restricted to record pairs within each block. Check Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key for a use case of the blocking value.

  10. Double-click the first tLogRow component to display its Basic settings view, and select Table in the Mode area to display the source file and the tRecordMatching results together to be able to compare them.

  11. Do the same for the other two tLogRow components.

Executing the Job

  • Save your Job and press F6 to execute it.

    Three output tables are written on the console. The first shows the match entries, the second show the possible match entries and the third shows the non match entries according to the used matching method in the defined columns.

    The figure below illustrates extractions of the three output tables.

    The first table lists all the names and emails that could be matched with identical entries in the reference table. Thus the matching distance and the matching weight are equal to "1.0".

    The second table lists all the names and emails that have a possible match in the reference table. The matching distance column prints the distances for the name and email columns and separate them by a vertical bar.

    The third table lists all the names and emails that do not have match in the reference table.