Defining the match rule - 7.0

Data matching

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
EnrichPlatform
Talend Studio

Procedure

  1. In the tMatchGroup basic settings, click Preview to open the configuration wizard and define the matching key and the survivorship function.
    You can use the configuration wizard to import match rules created and tested in the studio and stored in the repository, and use them in your match Jobs. For further information, see Importing match rules from the studio repository.
    It is important to have the same type of the matching algorithm selected in the basic settings of the component and defined in the configuration wizard. Otherwise the Job runs with default values for the parameters which are not compatible between the two algorithms.
  2. Define the match rule as the following:
    • In the Key definition table, click the [+] button to add a line in the table. Click in the Input Key Attribute column and select the column on which you want to do the matching operation, first_name in this scenario.

    • Click in the Matching Function column and select Soundex from the list. This method matches processed entries according to a standard English phonetic algorithm which indexes strings by sound, as pronounced in English.

    • From the Tokenized measure list, select not to use a tokenized distance for the selected algorithm.

    • Set the Threshold to 0.8 and the Confidence Weight to 1.

    • Select Null Match None in the Handle Null column in order to have matching results where null values have minimal effect.

    • Select Most common in the Survivorship Function column. This method validates the most frequent name value in each group of duplicates.

  3. Define the default survivorship rule as the following:
    • In the Default Survivorship Rules table, click the [+] button to add a line in the table. Click in the Data Type column and select Number.

    • Click in the Survivorship Function column and select Largest (for numbers) from the list. This method validates the largest numerical value in each group.

  4. Set the Hide groups of less than parameter in order to decide what groups to show in the result chart and matching table. This parameter enables you to hide groups of small group size.
  5. Click the Chart button in the wizard to execute the Job in the defined configuration and have the results directly in the wizard.
    The matching chart gives a global picture about the duplicates in the analyzed data. The matching table indicates the details of the items in each group, colors the groups in accordance with their color in the matching chart and indicates with true the records which are master records. The master record in each group is the result of merging two similar records according to the phonetic algorithm and survivorship rule. The master record is a new record that does not exist in the input data.
  6. Click OK to close the wizard.