The T-Swoosh algorithm - Cloud - 8.0

Data matching with Talend tools

Version
Cloud
8.0
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Last publication date
2024-02-06

The T-Swoosh algorithm is based on the same idea as the Simple VSR Matcher algorithm, but it creates a master record instead of considering existing records to be master records.

The order of the input records does not impact the matching process.

To create master records, you can design survivorship rules to decide which attribute will survive.

There are two types of survivorship rules:

  • The rules related to matching keys: each attribute used as a matching key can have a specific survivorship rule.
  • The default rules: they are applied to all the attributes of the same data type (Boolean, String, Date, Number).

If a column is a matching key, the rule related to matching keys specific to this column is applied.

If the column is not a matching key, the default survivorship rule for this data type is applied. If the default survivorship rule is not defined for the data type, the Most common survivorship function is used.

Each time two records are merged to create a new master record, this new master record is added to the queue of records to be examined. The two records that are merged are removed from the lookup table.

For example, take the following set of records as input:

id fullName
1 John Doe
2 Donna Lewis
3 John B. Doe
4 Johnnie B. Doe

The survivorship rule uses the Concatenate function with "," as a parameter to separate values.

At the beginning of the process, the queue contains all the input records and the lookup is empty. To process the input records, the algorithm iterates until the queue is empty:

  1. The algorithm takes record 1 and compares it with an empty set of records. Since record 1 does not match any record, it is added to the set of master records. The queue contains now record 2, record 3 and record 4. The lookup contains record 1.
  2. The algorithm takes record 2 and compares it with record 1. Since record 2 does not match any record, it is added to the set of master records. The queue contains now record 3 and record 4. The lookup contains record 1 and record 2.
  3. The algorithm takes record 3 and compares it with record 1. Record 3 matches record 1. So, record 1 and record 3 are merged to create a new master record called record 1,3. The queue contains now record 4 and record 1,3. The lookup contains record 2.
  4. The algorithm takes record 4 and compares it with record 2. Since it is not a match, record 4 is added to the set of master records. The queue contains now record 1,3. The lookup table contains record 2 and record 4.
  5. The algorithm takes record 1,3 and compares it with record 2 and record 4. Record 1,3 matches record 4. So, record 1,3 and record 4 are merged to create a new master record called record 1,3,4. Record 4 is removed from the lookup table. Since record 1,3 was the result of a previous merge, it is removed from the table. The queue now contains record 1,3,4. The lookup contains record 2.
  6. The algorithm takes record 1,3,4 and compares it with record 2. Since it is not a match, record 1,3,4 is added to the set of master records. The queue is now empty. The lookup contains records 1,3,4 and record 2.

The output will look like this:

id fullName GRP_ID GRP_SIZE MASTER SCORE GRP_QUALITY
1,3,4 John Doe, John B. Doe, Johnnie B. Doe 0 3 true 1.0 0.449
1 John Doe 0 0 false 0.72 0
3 John B. Doe 0 0 false 0.72 0
4 Johnnie B. Doe 0 0 false 0.78 0
2 Donna Lewis 1 1 true 1.0 1.0

As you can see in this example, the value in the GRP_QUALITY column can be less than the Match Threshold parameter. That is because a group is created from record pairs with a matching score greater than or equal to the Match Threshold but the records are not all compared to each other; whereas GRP_QUALITY takes into account all record pairs in the group.