Extracting matching features using tMatchModel

Matching with machine learning from an algorithmic standpoint

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Data Fabric
Talend Big Data Platform
Talend Real-Time Big Data Platform
task
Design and Development > Third-party systems > Data Quality components > Matching components
Data Governance > Third-party systems > Data Quality components > Matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
EnrichPlatform
Talend Studio
You can use the labeled sample of suspect duplicate pairs as the input of the tMatchModel component.

You have to specify the set of columns the model will be built on and the column specifying the label. The algorithm will compute different measures, called features, to catch as much information as possible on this set of columns.

The tMatchModel component uses the Random forest algorithm to build the model. This algorithm is a generalization of decision trees.