Matching on Spark

Matching on Spark applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.

Using Talend Studio, you can match very high volume of data using machine learning on Spark. This feature helps you to match very big number of records with a minimal human intervention.

Machine learning with Spark is usually two phases: the first phase computes a model (i.e. teaches the machine) based on historical data and mathematical heuristics, and the second phase applies the model on new data. In Talend Studio, the first phase is implemented by two Jobs, one with the tMatchPairing component and the second with the tMatchModel component. While the second phase is implemented by a third Job with the tMatchPredict component.

Two workflows are possible when matching on Spark with Talend Studio.

In the first workflow, tMatchPairing:

Computes pairs of suspect records based on a blocking key definition.
Creates a sample of suspect records representative of the dataset.
Can optionally write this sample of suspect records into a Grouping campaign defined on the Talend Data Stewardship server.
Separates unique records from exact match records.
Generates a pairing model to be used with tMatchPredict.

You can then manually label the sample suspect records by resolving tasks in a Grouping campaign defined on the Talend Data Stewardship server, which is the recommended method, or by editing the files manually.

Next, you can use the sample suspect records you labeled with tMatchModel in the second Job where tMatchModel:

Computes similarities between the records in each suspect pair,
Trains a classification model based on the Random Forest algorithm.

tMatchPredict labels suspect records automatically and groups suspect records which match the labels set in the component properties.

While in the second workflow, tMatchPredict uses directly on the new dataset the pairing model generated by tMatchPairing and the matching model generated by tMatchModel, and:

Labels suspect records automatically.
Groups suspect records which match the labels set in the component properties.
Separates the exact duplicates from unique records.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here