Improving a matching model - Cloud - 8.0

Data matching with Talend tools

Version
Cloud
8.0
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Last publication date
2024-02-06

You can improve a matching model by changing the settings of the tMatchModel component.

As the result depends on your database, there is no ideal settings. The purpose of the following tests is to show you that setting up the parameters differently can improve the model quality.

Important: Changing the settings can also affect the model quality.
In the following examples, we use a database of childcare centers that contains the following input data:
  • The site name,
  • The address and
  • The source of the previous data.

The reference settings are:

To perform these tests, the following method was applied: parameters were set differently one at a time. If the model quality increased, the setting was kept and another parameter was set differently. This is a good method to see how a parameter impacts the model.

Only the settings changed. As tested in Analyzing the heat map, changing the matching key impacts the model quality. Address and Site name were set as the matching keys.

For more information on the parameters, see their description in the tMatchModel properties.

After running multiple Jobs, the highest model quality is: 0.942.

The following table shows the settings that have been tested:
Parameters Reference setting Tested settings The model quality is better when set to
Number of trees range 1 5 to 15

5 to 20, 5 to 30, 5 to 50, 5 to 100

5 to 30, 5 to 50 or 5 to 100
Subsampling Rate 1.0 0.5 1.0
Impurity Gini Entropy Entropy
Max Bins 32 15 and 79 79
Subset strategy auto All (auto, all, sqrt and log2) auto
Min Instances per Node 1 3 and 10 1
1 The larger is the range of the hyper-parameters (number of trees and tree depth), the longer is the Job duration.

Notice that the Evaluation metric type parameter has not been changed. It remained set to F1. As the calculation is different from an evaluation metric type to another, changing this setting is irrelevant in those examples.

During the tests, no particular setting made the model quality increase from 0.917 to 0.942 but the combination of the different settings did.

The preceding results apply to a specific database. Depending on your database, changing the settings as above does not have the same impact. The purpose is to show you that, even if a model quality is satisfying, you can try other settings to improve the matching model.