Computing suspect pairs and writing a sample in Talend Data Stewardship - 7.0

Matching with machine learning

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
EnrichPlatform
Talend Data Stewardship
Talend Studio

This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.

For more technologies supported by Talend, see Talend components.

Finding duplicate records is hard and time consuming especially when you are dealing with huge volume of data. In this example, tMatchPairing uses a blocking key to compute the pairs of suspect duplicates in a long list of early childhood education centers in Chicago coming from ten different sources.

It also computes a sample of the suspect duplicates and writes it in the form of tasks into a Grouping campaign on the Talend Data Stewardship server. Authorized data stewards can then intervene on the data sample and decide if suspect pairs are duplicates.

You can then use the labeled sample to compute a matching model and apply it on all suspect duplicates in the context of machine learning on Spark.

Before setting up the Job, make sure:
  • You have been assigned in Talend Administration Center the Campaign Owner role which grants you access to the campaigns on the server.

  • You have created the Grouping campaign in Talend Data Stewardship and defined the schema which corresponds to the structure of the education centers file.

    For further information, see Adding a Grouping campaign to identify duplicate pairs.