Multi-pass matching - 6.4

Matching data

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Quality and Preparation > Matching data
EnrichPlatform
Talend Studio

You can design a Job with consecutive tMatchGroup components to create data partitions based on different blocking keys.

For example, you want to find duplicates having either the same city or the same zip code in a customer database. In this particular case, you can use two consecutive tMatchGroup to process the data partitions:

  • One tMatchGroup in which the column "city" is defined as a blocking key.

  • One tMatchGroup in which the column "ZipCode" as a blocking key.

What is multi-pass matching?

The idea behind multi-pass matching is to reuse the master records defined in the previous pass as the input of the current tMatchGroup component. Multi-pass matching is more effective if the blocking keys are almost not correlated. For example, it is not relevant to define the column "country" as a blocking key and the column "city" as another blocking key because all the comparisons made with the blocking key "city" will also be done with blocking key "country".

When using multi-pass matching with the VSR algorithm, only master records of size 1 - records that did not match any record - are compared with master records of any size. There are no comparisons between two master records that are derived from at least two children each.

An example of multi-pass matching

In the following example, the dataset contains four records. It is assumed that the first tMatchGroup component has a blocking key on the column "ZipCode", and the second tMatchGroup component has a blocking key on the column "city". The attribute "name" is used as a matching key.

id

name

city

ZipCode

1

John Doe

Nantes

44000

2

John B. Doe

Nantes

3

Jon Doe

Nantes

44000

4

John Doe

Nantes

After the first pass, records 1 and 3 are grouped, and records 2 and 4 are grouped. In these groups, record 1 and record 2 are master records.

In the second tMatchGroup, only the master records from the first pass, record 1 and record 2, are compared. Since their group size is strictly greater than 1, they are not compared.

The following results are returned:

id

name

city

ZipCode

GID

GRP_SIZE

MASTER

SCORE

GRP_QUALITY

1

John Doe

Nantes

44000

0

2

true

1.0

0.875

3

Jon Doe

Nantes

44000

0

0

false

0.85

0

2

John B. Doe

Nantes

1

2

true

1.0

0.72

4

John Doe

Nantes

1

0

false

0.72

0

When running the T-Swoosh algorithm with the same parameters and the Most common survivorship function, the following results are returned:

id

name

city

ZipCode

GID

GRP_SIZE

MASTER

SCORE

GRP_QUALITY

1

John Doe

Nantes

44000

0

4

true

1.0

0.72

1

John Doe

Nantes

44000

0

0

true

0.875

0

3

Jon Doe

Nantes

44000

0

0

false

0.875

0

2

John B. Doe

Nantes

0

0

true

0.72

0

4

John Doe

Nantes

1

0

false

0.72

0