Generating the matching model - 7.0

Matching with machine learning

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
EnrichPlatform
Talend Data Stewardship
Talend Studio

Procedure

  1. Double-click tMatchModel to display the Basic settings view and define the component properties.
  2. In the Matching Key table, click the [+] button to add rows in the table and select the columns on which you want to base the match computation.
    The Original_Id column is ignored in the computation of the matching model.
  3. Select the Save the model on file system check box and in the Folder field, set the path to the local folder where you want to generate the matching model file.
  4. Select the Integration with Data Stewardship check box and set the connection parameters to the Talend Data Stewardship server.
    1. In the URL field, enter the address of the application suffixed with /data-stewardship/, for example http://localhost:19999/data-stewardship/.

      If you are working with Talend Cloud Data Stewardship, use one of the following addresses to access the application:

      • https://tds.us.cloud.talend.com/data-stewardship for the US data center.
      • https://tds.eu.cloud.talend.com/data-stewardship for the EU data center.
    2. Enter your login information to the server in the Username and Password fields.
      To enter your password, click the [...] button next to the Password field, enter your password between double quotes in the dialog box that opens and click OK.
    3. Click Find a campaign to open a dialog box which lists the campaigns defined in Talend Data Stewardship and for which you are the owner or you have the access rights.
    4. Select the campaign from which to read the grouping tasks, Sites deduplication in this example, and click OK.
  5. Click Advanced settings and set the below parameters:
    1. Set the maximum number of the tokens to be used in the phonetic comparison in the corresponding field.
    2. In the Random Forest hyper parameters tuning, enter the ranges for the decision trees you want to build and their depth.
      These parameters are important for the accuracy of the model.
    3. Leave the other by-default parameters unchanged.
  6. In the Batch Size field, set the number of tasks you want to have in each commit.
    There are no limits for the batch size in Talend Data Stewardship (on premises). However, do not exceed 200 tasks per commit in Talend Cloud Data Stewardship, otherwise the Job fails.
  7. Press F6 to execute the Job and generate the matching model in the output folder.

Results

You can now use this model with the tMatchPredict component to label all the duplicates computed by tMatchPairing.

For further information, see Labeling suspect pairs with assigned labels.