The machine learning approach is useful when you want to match very high volume of data.
The data matching process can be automated by making a model learn and predict matches.
The advantages of the machine learning approach over the classical approach are the following:
The different blocking mechanism permits faster and more scalable computation. In the machine learning approach, blocking is not partitioning: a record can belong to different blocks and the size of the block is clearly delimited, which may not be the case with the tGenKey component.
The rules learnt and stored by the machine learning model can be much more complex and less arbitrary than human-designed matching rules.
- Configuring components is more simple. The machine learning model learns automatically matching distances and similarity threshold, among other things.
The first step consists of pre-analyzing a data set using the tMatchPairing component. Unique records, exact match records, suspect match pairs and a sample of the suspect match pairs are outputted by tMatchPairing.
For examples of how to compute suspect pairs, see Scenario 1: Computing suspect pairs and writing a sample in Talend Data Stewardship and Scenario 2: Computing suspect pairs and suspect sample from source data.
The second step consists of labeling the suspect match pairs from the sample as "match" or "no-match" manually. You can leverage Talend Data Stewardship to make the labeling task easier.
For further information about how to add a Grouping campaign to identify duplicates in a data sample in Talend Data Stewardship, see Adding a Grouping campaign to identify duplicate pairs.
In Talend Data Stewardship, grouping tasks allow authorized data stewards to validate a relationship between pairs or groups of records. The outcome of a grouping task is the list of records associated to each other.
You can use more than two classes, for example “match”, “potential match” and “different”.
For further information on grouping tasks in Talend Data Stewardship, see Handling grouping tasks to decide on relationship among pairs of records.
The third step consists of submitting the suspect match pairs you labeled to the tMatchModel component for learning and outputting a classifier model.
For examples of how to generate a matching model, see Scenario 1: Generating a matching model from a Grouping campaign and Scenario 2: Generating a matching model.What is a good sample?
The sample should be well-balanced: the number of records in each class - "match" and "no match" - should be approximately the same. An imbalanced data sample yields an unsatisfactory model.
The sample should be diverse: the more diverse the examples in the sample are, the more effective the rules learnt by the model will be.
The sample should be the right size: if you have a large data set with millions of records, then a few hundreds or thousands of examples may be enough. If your data set contains less than 10 000 records, then the sample size should be between 1 and 10% of the full data set.
How does tMatchModel generate a model?
The machine learning algorithm computes different measures, which are called features, to get as much information as possible on the defined columns.
To generate the model, tMatchModel analyzes the data using the Random Forest algorithm. A random forest is a collection of decision trees used to solve a classification problem. In a decision tree, each node corresponds to a question about the features associated to the input data. A random forest grows many decision trees to improve the accuracy of the classification and to generate a model.
The fourth step consists of labeling suspect pairs for large data sets automatically using the model computed by tMatchModel with the tMatchPredict component.
For an example of predict labels on suspect pairs, see Scenario: Labeling suspect pairs with assigned labels.
For further information on the machine learning approach, see Matching on Spark.