Defining a Match Rule - 6.2

Talend MDM Platform Studio User Guide

EnrichVersion
6.2
EnrichProdName
Talend MDM Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

In the Match Rule Editor, you define the different characteristics of your Match Rule.

In the Match Definition Metadata section, most of the fields are automatically populated when you create the Match Rule. You can edit any of the metadata if required, and the set the Status by selecting development, testing or production from the drop-down list.

  1. In the Record linkage algorithm section, select T-Swoosh. The Simple VSR Matcher is for use with Talend Data Quality only.

  2. In the Match and Survivor section, you define the criteria to use when matching data records. Click the [+] button to add a new rule, and then set the following criteria.

    • Match Key Name: Enter the name of your choice for the match key.

    • Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you want to use an external user-defined matching algorithm.

    • Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-defined algorithm.

    • Threshold: Specify the match score (between 0 and 1) above which two values should be considered a match.

    • Confidence Weight: Set a numerical weight (between 1 and 10) to the column you want to use as a match key. This value is used to give greater or lesser importance to certain columns when performing the match.

    • Handle Null: Specify how to deal with data records which contain null values.

      • nullMatchNull: If both records contain null values, consider this a match.

      • nullMatch None: If one record contains a null, do not consider this a match.

      • nullMatch All: If one record contains a null, consider this a match.

    • Survivorship Function: Select how two similar records will be merged from the drop-down list.

      • Concatenate: It adds the content of the first record and the content of the second record together - for example, Bill and William will be merged into BillWilliam. In the Parameter field, you can specify a separator to be used to separate values.

      • Prefer True (for booleans): It always set booleans to True in the merged record, unless all booleans in the source records are False.

      • Prefer False (for booleans): It always sets booleans to False in the merged record, unless all booleans in the source records are True.

      • Most common: It validates the most frequently-occurring field value in each duplicates group.

      • Most recent or Most ancient: The former validates the earliest date value and the latter the latest date value in each duplicates group. The relevant reference column must be of the Date type.

      • Longest or Shortest: The former validates the longest field value and the latter the shortest field value in each duplicates group.

      • Largest or Smallest: The former validates the largest numerical value and the latter the smallest numerical value in a duplicates group.

        Warning

        Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric type.

      • Most trusted source: It takes the data coming from the source which has been defined as being most trustworthy. The most trusted data source is set in the Parameter field.

    • Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator you want to use for concatenating data.

  3. In the Match threshold field, enter the match probability threshold.

    Two data records match when the probability is above this value.

    In the Confident match threshold field, set a numerical value between the current Match threshold and 1.

  4. In the Default Survivorship Rules section, you define how to survive matches for certain data types: Boolean, Data, Number and String. If you do not specify the behavior for any or all data types, the default behavior is applied.

    • Click the [+] button to add a new row for each data type.

    • In the Data Type column, select the relevant data type from the drop-down list.

    • In the Survivorship Function column, select how two similar records will be merged from the drop-down list. Note that, depending on the data type, only certain choices may be relevant.

      Warning

      Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric type.

    • Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator you want to use for concatenating data.

  5. Save your changes.