Enables you to compute pairs of suspect duplicates from any source data including large volumes in the context of machine learning on Spark.
This component reads a data set row by row, excludes unique rows and exact duplicates in separate files, computes pairs of suspect records based on a blocking key definition and creates a sample of suspect records representative of the data set.
You can label suspect pairs manually or load them into a Grouping campaign which is already defined on the Talend Data Stewardship server.
-
Spark 1.6: CDH5.7, CDH5.8, HDP2.4.0, HDP2.5.0, MapR5.2.0, EMR4.5.0, EMR4.6.0.
-
Spark 2.0: EMR5.0.0.
For more technologies supported by Talend, see Talend components.