Multi-pass matching - Cloud

Multi-pass matching - Cloud - 8.0

Data matching with Talend tools

Version

Cloud

8.0

Language

English

Product

Talend Big Data Platform

Talend Data Fabric

Talend Data Management Platform

Talend Data Services Platform

Talend MDM Platform

Talend Real-Time Big Data Platform

Module

Talend Studio

Content

Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components

Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components

Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components

Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components

Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components

Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components

Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components

Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components

Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components

Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components

Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components

Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components

Last publication date

2024-02-06

You can design a Job with consecutive tMatchGroup components to create data partitions based on different blocking keys.

For example, you want to find duplicates having either the same city or the same zip code in a customer database. In this particular case, you can use two consecutive tMatchGroup to process the data partitions:

One tMatchGroup component in which the column "city" is defined as a blocking key.
One tMatchGroup component in which the column "ZipCode" is defined as a blocking key.

What is multi-pass matching?

The idea behind multi-pass matching is to reuse the master records defined in the previous pass as the input of the current tMatchGroup component. Multi-pass matching is more effective if the blocking keys are almost not correlated. For example, it is not relevant to define the column "country" as a blocking key and the column "city" as another blocking key because all the comparisons made with the blocking key "city" will also be done with blocking key "country".

When using multi-pass matching with the Simple VSR matcher algorithm, only master records of size 1 - records that did not match any record - are compared with master records of any size. There are no comparisons between two master records that are derived from at least two children each.

An example of multi-pass matching

In the following example, the dataset contains four records. It is assumed that the first tMatchGroup component has a blocking key on the column "ZipCode", and the second tMatchGroup component has a blocking key on the column "city". The attribute "name" is used as a matching key.

id	name	city	ZipCode
1	John Doe	Nantes	44000
2	John B. Doe	Nantes
3	Jon Doe	Nantes	44000
4	John Doe	Nantes

After the first pass, records 1 and 3 are grouped, and records 2 and 4 are grouped. In these groups, record 1 and record 2 are master records.

In the second tMatchGroup, only the master records from the first pass, record 1 and record 2, are compared. Since their group size is strictly greater than 1, they are not compared. Then, the order in which the input records are sorted is very important.

The following results are returned:

id	name	city	ZipCode	GID	GRP_SIZE	MASTER	SCORE	GRP_QUALITY
1	John Doe	Nantes	44000	0	2	true	1.0	0.875
3	Jon Doe	Nantes	44000	0	0	false	0.85	0
2	John B. Doe	Nantes		1	2	true	1.0	0.727
4	John Doe	Nantes		1	0	false	0.72	0

When running the T-Swoosh algorithm with the same parameters and the Most common survivorship function, the following results are returned:

id	name	city	ZipCode	GID	GRP_SIZE	MASTER	SCORE	GRP_QUALITY
1	John Doe	Nantes	44000	0	4	true	1.0	0.727
1	John Doe	Nantes	44000	0	0	true	0.875	0
3	Jon Doe	Nantes	44000	0	0	false	0.875	0
2	John B. Doe	Nantes		0	0	true	0.72	0
4	John Doe	Nantes		1	0	false	0.72	0