Matching on Spark - Cloud - 8.0

Data matching with Talend tools

Version
Cloud
8.0
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Last publication date
2024-02-06

Matching on Spark applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.

Using Talend Studio, you can match very high volume of data using machine learning on Spark. This feature helps you to match very big number of records with a minimal human intervention.

Machine learning with Spark is usually two phases: the first phase computes a model (i.e. teaches the machine) based on historical data and mathematical heuristics, and the second phase applies the model on new data. In Talend Studio, the first phase is implemented by two Jobs, one with the tMatchPairing component and the second with the tMatchModel component. While the second phase is implemented by a third Job with the tMatchPredict component.

Two workflows are possible when matching on Spark with Talend Studio.

In the first workflow, tMatchPairing:
  • Computes pairs of suspect records based on a blocking key definition.

  • Creates a sample of suspect records representative of the dataset.

  • Can optionally write this sample of suspect records into a Grouping campaign defined on the Talend Data Stewardship server.

  • Separates unique records from exact match records.

  • Generates a pairing model to be used with tMatchPredict.

You can then manually label the sample suspect records by resolving tasks in a Grouping campaign defined on the Talend Data Stewardship server, which is the recommended method, or by editing the files manually.

Next, you can use the sample suspect records you labeled with tMatchModel in the second Job where tMatchModel:
  • Computes similarities between the records in each suspect pair,

  • Trains a classification model based on the Random Forest algorithm.

tMatchPredict labels suspect records automatically and groups suspect records which match the labels set in the component properties.

While in the second workflow, tMatchPredict uses directly on the new dataset the pairing model generated by tMatchPairing and the matching model generated by tMatchModel, and:
  • Labels suspect records automatically.

  • Groups suspect records which match the labels set in the component properties.

  • Separates the exact duplicates from unique records.