For more technologies supported by Talend, see Talend components.
Finding duplicate records is hard and time consuming especially when you are dealing with huge volume of data. In this example, tMatchPairing uses a blocking key to compute the pairs of suspect duplicates in a long list of early childhood education centers in Chicago coming from ten different sources.
It also computes a sample of the suspect duplicates and writes it in the form of tasks into a Grouping campaign on the Talend Data Stewardship server. Authorized data stewards can then intervene on the data sample and decide if suspect pairs are duplicates.
You can then use the labeled sample to compute a matching model and apply it on all suspect duplicates in the context of machine learning on Spark.
You have been assigned in Talend Administration Center the Campaign Owner role which grants you access to the campaigns on the server.
- You have created the Grouping campaign in Talend Data Stewardship and defined
the schema which corresponds to the structure of the education centers file.
For further information, see Adding a Grouping campaign to identify duplicate pairs.