Configuring the tBlockedFuzzyJoin component - 7.0

Fuzzy matching

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
EnrichPlatform
Talend Studio

Procedure

  1. Double-click tBlockedFuzzyJoin to display its Basic settings view and define its properties.
  2. Click the Edit schema button to open a dialog box. Here you can define the data you want to pass to the output components.

    In this example we want to pass the four input columns to the output components in addition to the new column ref_firstname.

  3. Click OK to close the dialog box and proceed to the next step.
  4. In the Key definition area of the Basic settings view of tBlockedFuzzyJoin, click the plus button to add two columns to the list.
  5. Select the input columns and the output columns you want to do the fuzzy matching on from the Input key attribute and Lookup key attribute lists respectively, grp and firstname in this example.
  6. Click in the first cell of the Matching type column and select from the list the method to be used to check the incoming data against the reference data, Exact match in this example. There is no minimum nor maximum distance to set.
  7. Set the matching type for the second column, Levenshtein in this example.
  8. Then set the minimum and maximum distances. In this method, the distance is the number of character changes (insertion, deletion or substitution) that needs to be carried out in order for the entry to fully match the reference. In this example, we want the min. distance to be 0 and the max. distance to be 2. This will output all entries in the firstname column that exactly match or that have maximum two character changes.