The Simple VSR Matcher algorithm - 6.5

Using tMatchGroup with the Simple VSR Matcher and T-Swoosh algorithms

author
Talend Documentation Team
EnrichVersion
6.5
task
Data Governance > Third-party systems > Data Quality components > Matching components
Data Quality and Preparation > Matching data
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components
Design and Development > Third-party systems > Data Quality components > Matching components
EnrichPlatform
Talend Studio
The Simple VSR Matcher algorithm compares each record within same block with the previous master records that play the role of lookup table.

If a record does not match any of the previous master records, it is considered as a new master record and added to the lookup table. It means that the first record is necessary a master record.

When a record matches a master record, the algorithm does not look further to match with other master records because all the master records in the lookup table are not similar. So, once a record match a master record, the chance of matching another master record is low.

This means a record can only exist in one group of records and be linked to one master record.

For example, take the following set of records as input:

id

fullName

1

John Doe

2

Donna Lewis

3

John B. Doe

4

Louis Armstrong

The algorithm processes the input records as follows:

  1. The algorithm takes record 1 and compares it with an empty set of records. Since record 1 does not match any record, it is added to the lookup table.
  2. The algorithm takes record 2 and compares it with record 1. It is not a match. So, record 2 is also added in the lookup table.
  3. The algorithm takes record 3 and compare it with records 1 and 2. Record 3 matches with record 1. So, record 3 is added to the group of record 1.

  4. The algorithm takes record 4 and compares it with records 1 and 2 but not with record 3, which is not a master record. It is not a match. So, record 4 is also added in the lookup table.

The output will look like this:

id

fullName

Grp_ID

Grp_Size

Master

Score

GRP_QUALITY

1

John Doe

0

2

true

1.0

0.72

3

John B. Doe

0

0

false

0.72

0

2

Donna Lewis

1

1

true

1.0

1.0

4

Louis Armstrong

2

1

true

1.0

1.0