Enables you to compute pairs of suspect duplicates from any source data including large volumes in the context of machine learning on Spark.
This component reads a data set row by row, excludes unique rows and exact duplicates in separate files, computes pairs of suspect records based on a blocking key definition and creates a sample of suspect records representative of the data set.
You can label suspect pairs manually or load them into a Grouping campaign which is already defined in Talend Data Stewardship.
This component runs with Apache Spark 1.6.0 and later versions.
For more technologies supported by Talend, see Talend components.