The T-Swoosh algorithm - 6.5

Using tMatchGroup with the Simple VSR Matcher and T-Swoosh algorithms

author
Talend Documentation Team
EnrichVersion
6.5
task
Data Governance > Third-party systems > Data Quality components > Matching components
Data Quality and Preparation > Matching data
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components
Design and Development > Third-party systems > Data Quality components > Matching components
EnrichPlatform
Talend Studio

The T-Swoosh algorithm is based on the same idea than the Simple VSR Matcher algorithm, but it creates a master record instead of considering that existing records are master records. To create master records, users can design survivorship rules to decide which attribute will survive.

There are two types of rules:

  • The rules related to matching keys: each attribute used as a matching key can have a specific survivorship rule.

  • The default rules: these rules are applied to all the attributes of the same data type (Boolean, String, Date, Number).

If a column is a matching key, the rule related to matching keys specific to this column is applied.

If the column not is a matching key, the default survivorship rule for this data type is applied. If the default survivorship rule is not defined for the data type, the most common survivorship function will be used.

Each time two records are merged to create a new master record, this new master record is added to the queue of records to be examined. The two records that are merged are removed from the lookup table.

For example, take the following set of records as input:

id

fullName

1

John Doe

2

Donna Lewis

3

John B. Doe

4

Louis Armstrong

The survivorship rule uses the concatenate function with , as a parameter to separate values.

At the beginning of the process, the queue contains all the input records and the lookup is empty. To process the input records, the algorithm iterates until the queue is empty:

  1. The algorithm takes record 1 and compares it with an empty set of records. Since it does not match any record, record 1 is added to the set of master records. The queue contains now record 2, record 3 and record 4. The lookup contains record 1.

  2. The algorithm takes record 2 and compares it with record 1. Since it does not match any record. So, record 2 is added to the set of master records. The queue contains now records 3 and record 4. The lookup contains records 1 and record 2.

  3. The algorithm takes record 3 and compares it with record 1. Record 3 matches record 1. So, record 1 and record 3 are merged to create a new master record called record 1,3. The queue contains now record 4 and record 1,3. The lookup contains record 2.

  4. The algorithm takes record 4 and compares it with record 2. It is not a match, so record 4 is also a master record. The queue contains now record 1,3. The lookup table contains record 2 and record 4.

  5. The algorithm takes record 1,3 and compares it with record 2 and record 4. Record 1,3 matches record 4. So, record 1,3 and record 4 are merged to create a new master record called record 1,3,4. Record 4 is removed from the lookup table. Since record 1,3 was the result of a previous merge, it is removed from the table. The queue now contains record 1,3,4. The lookup contains record 2.

  6. The algorithm takes record 1,3,4 and compares it with record 2. It is not a match. The queue is now empty. The lookup contains records 1,3,4 and record 2.

The output will look like this:

id

fullName

GRP_ID

GRP_SIZE

MASTER

SCORE

GRP_QUALITY

1,3,4

John Doe, John B. Doe, Johnnie B. Doe

0

3

true

1.0

0.72

1

John Doe

0

0

false

0.72

0

3

John B. Doe

0

0

false

0.72

0

4

Johnnie B. Doe

0

0

true

0.78

0

2

Donna Lewis

1

1

true

1.0

1.0