For more technologies supported by Talend, see Talend components.
Finding duplicate records is hard and time consuming especially when you are dealing with huge volume of data. In this example, tMatchPairing uses a blocking key to compute the pairs of suspect duplicates in a long list of early childhood education centers in Chicago coming from ten different sources.
It also computes a sample of the suspect duplicates and writes it in the form of tasks into a Grouping campaign in Talend Data Stewardship. Authorized data stewards can then intervene on the data sample and decide if suspect pairs are duplicates.
You can then use the labeled sample to compute a matching model and apply it on all suspect duplicates in the context of machine learning on Spark.
To replicate the example described below, retrieve the tmatchpairing_load_suspect_pairs_in_tds.zip file from the Downloads tab in the left pane of this help page.
You have been assigned in Talend Administration Center the Campaign Owner role which grants you access to the campaigns on the server.
- You have created the Grouping campaign in Talend Data Stewardship and defined the schema which corresponds to the structure of the education centers file.