Skip to main content Skip to complementary content

Choosing metrics and defining matching rules

After blocking data into similar sized group, you can create match rules and test them before using them in the tMatchGroup component.

For more information about creating a match analysis, see Creating a match analysis.

Matching functions in the tMatchGroup component

tMatchGroup helps you create groups of similar data records in any source of data including large volumes of data by using one or several match rules.

Each created group is made up of a master record and records similar to this master record. The matching functions used to compute similarity measures between similar records and the master record include the following ones:
  • Phonetic algorithms, such as Soundex or Metaphone, are used to match names.
  • The Levensthein distance calculates the minimum number of edits required to transform one string to another.
  • The Jaro distance matches processed entries according to spelling deviations.
  • The Jaro-Winkler distance is a variant of Jaro giving more importance to the beginning of the string.

For more information on how to use the tMatchGroup component in standard and Map/Reduce Jobs, see tMatchGroup.

The Simple VSR Matcher and the T-Swoosh algorithms

You can choose between two algorithms when using the tMatchGroup component:
  • Simple VSR Matcher
  • T-Swoosh

For more information about match analyses, see "Create a match rule" on Talend Help Center.

When do records match?

Two records match when the following conditions are met:
  • When using the T-Swoosh algorithm, the score returned for each matching function must be higher than the threshold you set.
  • The global score, computed as a weighted score of the different matching functions, must be higher than the match threshold.

Multiple passes

In general, different partitioning schemes are necessary. This requires using sequentially tMatchGroup components to match data against different blocking keys.

For an example of how to match data through multiple passes, see Matching customer data through multiple passes.

Working with the tRecordMatching component

tRecordMatching joins compared columns from the main flow with reference columns from the lookup flow. According to the matching strategy you define, tRecordMatching outputs the match data, the possible match data and the rejected data. When arranging your matching strategy, the user-defined matching scores are critical to determine the match level of the data of interest.

For more information about the tRecordMatching component, see tRecordMatching.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!