Skip to main content

Computing suspect pairs and writing a sample in Talend Data Stewardship

This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.

Finding duplicate records is hard and time consuming especially when you are dealing with huge volume of data. In this example, tMatchPairing uses a blocking key to compute the pairs of suspect duplicates in a long list of early childhood education centers in Chicago coming from ten different sources.

It also computes a sample of the suspect duplicates and writes it in the form of tasks into a Grouping campaign in Talend Data Stewardship. Authorized data stewards can then intervene on the data sample and decide if suspect pairs are duplicates.

You can then use the labeled sample to compute a matching model and apply it on all suspect duplicates in the context of machine learning on Spark.

To replicate the example described below, download the tmatchpairing_load_suspect_pairs_in_tds.zip file.

Before setting up the Job, make sure:
  • You have been assigned in Talend Administration Center the Campaign Owner role which grants you access to the campaigns on the server.

  • You have created the Grouping campaign in Talend Data Stewardship and defined the schema which corresponds to the structure of the education centers file.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!