Choosing metrics and defining matching rules - Cloud - 8.0

Data matching with Talend tools

Version
Cloud
8.0
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Last publication date
2024-02-06

After blocking data into similar sized group, you can create match rules and test them before using them in the tMatchGroup component.

For more information about creating a match analysis, see Creating a match analysis.

Matching functions in the tMatchGroup component

tMatchGroup helps you create groups of similar data records in any source of data including large volumes of data by using one or several match rules.

Each created group is made up of a master record and records similar to this master record. The matching functions used to compute similarity measures between similar records and the master record include the following ones:
  • Phonetic algorithms, such as Soundex or Metaphone, are used to match names.
  • The Levensthein distance calculates the minimum number of edits required to transform one string to another.
  • The Jaro distance matches processed entries according to spelling deviations.
  • The Jaro-Winkler distance is a variant of Jaro giving more importance to the beginning of the string.

For more information on how to use the tMatchGroup component in standard and Map/Reduce Jobs, , see Classical matching.

The Simple VSR Matcher and the T-Swoosh algorithms

You can choose between two algorithms when using the tMatchGroup component:
  • Simple VSR Matcher
  • T-Swoosh

For more information about match analyses, see "Create a match rule" on Talend Help Center.

When do records match?

Two records match when the following conditions are met:
  • When using the T-Swoosh algorithm, the score returned for each matching function must be higher than the threshold you set.
  • The global score, computed as a weighted score of the different matching functions, must be higher than the match threshold.

Multiple passes

In general, different partitioning schemes are necessary. This requires using sequentially tMatchGroup components to match data against different blocking keys.

For an example of how to match data through multiple passes, see Classical matching.

Working with the tRecordMatching component

tRecordMatching joins compared columns from the main flow with reference columns from the lookup flow. According to the matching strategy you define, tRecordMatching outputs the match data, the possible match data and the rejected data. When arranging your matching strategy, the user-defined matching scores are critical to determine the match level of the data of interest.

For more information about the tRecordMatching component, see Classical matching.