Skip to main content

tMatchPairing

Enables you to compute pairs of suspect duplicates from any source data including large volumes in the context of machine learning on Spark.

This component reads a data set row by row, excludes unique rows and exact duplicates in separate files, computes pairs of suspect records based on a blocking key definition and creates a sample of suspect records representative of the data set.

You can label suspect pairs manually or load them into a Grouping campaign which is already defined in Talend Data Stewardship.

This component runs with Apache Spark 1.6.0 and later versions.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!