Matching on Spark - 7.0

Matching with machine learning

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
EnrichPlatform
Talend Data Stewardship
Talend Studio

Matching on Spark applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.

Using Talend Studio , you can match very high volume of data using machine learning on Spark. This feature helps you to match very big number of records with a minimal human intervention.

Machine learning with Spark is usually two phases: the first phase computes a model (i.e. teaches the machine) based on historical data and mathematical heuristics, and the second phase applies the model on new data. In the Studio, the first phase is implemented by two Jobs, one with the tMatchPairing component and the second with the tMatchModel component. While the second phase is implemented by a third Job with the tMatchPredict component.

Two workflows are possible when matching on Spark with the Studio.

In the first workflow, tMatchPairing:
  • compute pairs of suspect records based on a blocking key definition,

  • creates a sample of suspect records representative of the data set,

  • can optionally write this sample of suspect records into a Grouping campaign defined on the Talend Data Stewardship server,

  • separates unique records from exact match records,

  • generates a pairing model to be used with tMatchPredict.

You can then manually label the sample suspect records by resolving tasks in a Grouping campaign defined on the Talend Data Stewardship server, which is the recommended method, or by editing the files manually.

Next, you can use the sample suspect records you labeled with tMatchModel in the second Job where tMatchModel:
  • computes similarities between the records in each suspect pair,

  • trains a classification model based on the Random Forest algorithm.

tMatchPredict labels suspect records automatically and groups suspect records which match the label(s) set in the component properties.

While in the second workflow, tMatchPredict uses directly on the new data set the pairing model generated by tMatchPairing and the matching model generated by tMatchModel, and:
  • labels suspect records automatically,

  • groups suspect records which match the label(s) set in the component properties,

  • separates the exact duplicates from unique records.