How does tMatchPairing compute the sample of suspect duplicate pairs?

Matching with machine learning from an algorithmic standpoint

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Data Fabric
Talend Big Data Platform
Talend Real-Time Big Data Platform
task
Design and Development > Third-party systems > Data Quality components > Matching components
Data Governance > Third-party systems > Data Quality components > Matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
EnrichPlatform
Talend Studio

The list of suspect duplicate pairs can be very large. You label only a subset of this list to identify the potential groups of duplicates.

You can then use machine learning to predict labels for the whole list. Then, it is possible to output a sample of this list, with a size fixed manually. The sample is chosen randomly.

For an example of how to label suspect pairs in a Grouping campaign created in Talend Data Stewardship, see Handling grouping tasks to decide on relationship among pairs of records.