The T-Swoosh algorithm - 7.0

Data matching

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
EnrichPlatform
Talend Studio
The T-Swoosh algorithm is based on the same idea as the Simple VSR Matcher algorithm, but it creates a master record instead of considering existing records to be master records.

To create master records, you can design survivorship rules to decide which attribute will survive.

There are two types of survivorship rules:

  • The rules related to matching keys: each attribute used as a matching key can have a specific survivorship rule.
  • The default rules: they are applied to all the attributes of the same data type (Boolean, String, Date, Number).

If a column is a matching key, the rule related to matching keys specific to this column is applied.

If the column is not a matching key, the default survivorship rule for this data type is applied. If the default survivorship rule is not defined for the data type, the Most common survivorship function is used.

Each time two records are merged to create a new master record, this new master record is added to the queue of records to be examined. The two records that are merged are removed from the lookup table.

For example, take the following set of records as input:

id fullName
1 John Doe
2 Donna Lewis
3 John B. Doe
4 Johnnie B. Doe

The survivorship rule uses the Concatenate function with "," as a parameter to separate values.

At the beginning of the process, the queue contains all the input records and the lookup is empty. To process the input records, the algorithm iterates until the queue is empty:

  1. The algorithm takes record 1 and compares it with an empty set of records. Since record 1 does not match any record, it is added to the set of master records. The queue contains now record 2, record 3 and record 4. The lookup contains record 1.
  2. The algorithm takes record 2 and compares it with record 1. Since record 2 does not match any record, it is added to the set of master records. The queue contains now record 3 and record 4. The lookup contains record 1 and record 2.
  3. The algorithm takes record 3 and compares it with record 1. Record 3 matches record 1. So, record 1 and record 3 are merged to create a new master record called record 1,3. The queue contains now record 4 and record 1,3. The lookup contains record 2.
  4. The algorithm takes record 4 and compares it with record 2. Since it is not a match, record 4 is added to the set of master records. The queue contains now record 1,3. The lookup table contains record 2 and record 4.
  5. The algorithm takes record 1,3 and compares it with record 2 and record 4. Record 1,3 matches record 4. So, record 1,3 and record 4 are merged to create a new master record called record 1,3,4. Record 4 is removed from the lookup table. Since record 1,3 was the result of a previous merge, it is removed from the table. The queue now contains record 1,3,4. The lookup contains record 2.
  6. The algorithm takes record 1,3,4 and compares it with record 2. Since it is not a match, record 1,3,4 is added to the set of master records. The queue is now empty. The lookup contains records 1,3,4 and record 2.

The output will look like this:

id fullName GRP_ID GRP_SIZE MASTER SCORE GRP_QUALITY
1,3,4 John Doe, John B. Doe, Johnnie B. Doe 0 3 true 1.0 0.72
1 John Doe 0 0 false 0.72 0
3 John B. Doe 0 0 false 0.72 0
4 Johnnie B. Doe 0 0 true 0.78 0
2 Donna Lewis 1 1 true 1.0 1.0