Skip to main content

tMatchPairing

Enables you to compute pairs of suspect duplicates from any source data including large volumes in the context of machine learning on Spark.

This component reads a data set row by row, excludes unique rows and exact duplicates in separate files, computes pairs of suspect records based on a blocking key definition and creates a sample of suspect records representative of the data set.

You can label suspect pairs manually or load them into a Grouping campaign which is already defined in Talend Data Stewardship.

In local mode, Apache Spark 2.4.0 and later versions are supported.

This component is not shipped with your Talend Studio by default. You need to install it using the Feature Manager. For more information, see Installing features using the Feature Manager.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!