Skip to main content

Computing suspect pairs and suspect sample from source data

This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.

In this example, tMatchPairing uses a blocking key to compute the pairs of suspect duplicates in a list of early childhood education centers in Chicago.

The use case described here uses:

  • a tFileInputDelimited component to read the source file, which contains a list of early childhood education centers in Chicago coming from ten different sources;

  • a tMatchPairing component to pre-analyze the data, compute pairs of suspect duplicates and generate a pairing model which is used by the tMatchPredict component;

  • three tFileOutputDelimited components to output the suspect duplicates, a sample of suspect pairs and the unique records; and

  • a tLogRow component to output the exact duplicates.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!