Tuning hyper-parameters and using K-fold cross-validation to improve the matching model

Matching with machine learning from an algorithmic standpoint

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
EnrichPlatform
Talend Studio

Testing the model using the K-fold cross-validation technique

The K-fold cross-validation technique consists of assessing how good the model will be on an independent dataset.

To test the model, the dataset is split into k subsets and the Random forest algorithm is ran k times:

  • At each iteration, one of the k subsets is retained as the validation set and the remaining k-1 subsets are the training set.

  • A score for each of the k runs is computed and then the scores obtained are averaged to calculate a global score.

Tuning the Random forest algorithm hyper-parameters using grid search

You can specify values for the two Random forest algorithm hyper-parameters:

  • The number of decision trees

  • The maximum depth of a decision tree

To improve the quality of the model and tune the hyper-parameters, grid search builds models for each combination of the two Random forest algorithm hyper-parameter values within the limits you specified.

For example:

  • The number of trees ranges from 5 to 50 with a step of 5; and

  • the tree depth goes from 5 to 10 with a step of 1.

In this example, there will be 60 different combinations (10 × 6).

Only the best combination of the two hyper-parameters values used to train the best model is retained. This measure is reported by the K-fold cross-validation.