Tuning hyper-parameters and using K-fold cross-validation to improve the matching model - Cloud - 8.0

Data matching with Talend tools

Version
Cloud
8.0
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Last publication date
2024-02-06

Testing the model using the K-fold cross-validation technique

The K-fold cross-validation technique consists of assessing how good the model will be on an independent dataset.

To test the model, the dataset is split into k subsets and the Random forest algorithm is ran k times:

  • At each iteration, one of the k subsets is retained as the validation set and the remaining k-1 subsets are the training set.
  • A score for each of the k runs is computed and then the scores obtained are averaged to calculate a global score.

Tuning the Random forest algorithm hyper-parameters using grid search

You can specify values for the two Random forest algorithm hyper-parameters:

  • The number of decision trees
  • The maximum depth of a decision tree

To improve the quality of the model and tune the hyper-parameters, grid search builds models for each combination of the two Random forest algorithm hyper-parameter values within the limits you specified.

For example:

  • The number of trees ranges from 5 to 50 with a step of 5; and
  • the tree depth goes from 5 to 10 with a step of 1.

In this example, there will be 60 different combinations (10 × 6).

Only the best combination of the two hyper-parameters values used to train the best model is retained. This measure is reported by the K-fold cross-validation.