Multi-pass matching - 6.5

Using tMatchGroup with the Simple VSR Matcher and T-Swoosh algorithms

author
Talend Documentation Team
EnrichVersion
6.5
task
Data Governance > Third-party systems > Data Quality components > Matching components
Data Quality and Preparation > Matching data
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components
Design and Development > Third-party systems > Data Quality components > Matching components
EnrichPlatform
Talend Studio

Users can design a Job with consecutive tMatchGroup components to create data partitions based on different blocking keys.

For example, you want to find duplicates having either the same city or the same zip code in a customer database. In this particular case, you can use two consecutive tMatchGroup to process the data partitions:

  • One tMatchGroup in which the column "city" is defined as a blocking key.

  • One tMatchGroup in which the column "ZipCode" as a blocking key.

The idea behind multi-pass matching is to reuse the master records defined in the previous pass as the input of the current tMatchGroup component. Multi-pass matching is more effective if the blocking keys are almost not correlated. For example, it is not relevant to define the column "country" as a blocking key and the column "city" as another blocking key because all the comparisons done with the blocking key "city" will also be done with blocking key "country".

When using multi-pass matching with the VSR algorithm, only master record of size 1 - records that did not match any record - are compared with master records of any size. There are no comparisons between two master records that are derived from at least two children each.

In the following example, the dataset contains 4 records. It is assumed that the first tMatchGroup component has a blocking key on the zip code column, and the second tMatchGroup component has a blocking key on the city column. The attribute "name" is used as a matching key.

id

name

city

ZipCode

1

John Doe

Nantes

44000

2

John B. Doe

Nantes

3

Jon Doe

Nantes

44000

4

John Doe

Nantes

After the first pass, records 1 and 3 are grouped (record 1 is the master), records 2 and 4 are grouped (2 is the master). In the second tMatchGroup, only the master records from the first pass, record 1 and record 2, are compared. Since their group size is strictly greater than 1, they are not compared. So, the results are the following:

id

name

city

ZipCode

GID

GRP_SIZE

MASTER

SCORE

GRP_QUALITY

1

John Doe

Nantes

44000

0

2

true

1.0

0.875

3

Jon Doe

Nantes

44000

0

0

false

0.85

0

2

John B. Doe

Nantes

1

2

true

1.0

0.72

4

John Doe

Nantes

1

0

false

0.72

0

On the contrary, if running the T-Swoosh algorithm with the same parameters and the Most common survivorship function, the following results are returned:

id

name

city

ZipCode

GID

GRP_SIZE

MASTER

SCORE

GRP_QUALITY

1

John Doe

Nantes

44000

0

4

true

1.0

0.72

1

John Doe

Nantes

44000

0

0

true

0.875

0

3

Jon Doe

Nantes

44000

0

0

false

0.875

0

2

John B. Doe

Nantes

0

0

true

0.72

0

4

John Doe

Nantes

1

0

false

0.72

0