tMatchPairing - 7.0

Matching with machine learning

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
EnrichPlatform
Talend Data Stewardship
Talend Studio

Enables you to compute pairs of suspect duplicates from any source data including large volumes in the context of machine learning on Spark.

This component reads a data set row by row, excludes unique rows and exact duplicates in separate files, computes pairs of suspect records based on a blocking key definition and creates a sample of suspect records representative of the data set.

You can label suspect pairs manually or load them into a Grouping campaign which is already defined on the Talend Data Stewardship server.

This component can run only with the following Hadoop distributions with Spark 1.6+ and Spark 2.0:
  • Spark 1.6: CDH5.7, CDH5.8, HDP2.4.0, HDP2.5.0, MapR5.2.0, EMR4.5.0, EMR4.6.0.

  • Spark 2.0: EMR5.0.0.

For more technologies supported by Talend, see Talend components.