Creating a match rule - 6.3

Talend Data Fabric Studio User Guide

EnrichVersion
6.3
EnrichProdName
Talend Data Fabric
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

From the Profiling perspective of the studio, you can create match rules with the VSR or the T-Swoosh algorithm and save them in the studio repository. Once centralized in the repository, you can import them in the match analysis editor and test them on your data to group duplicate records. For further information about the match analysis, see Creating a match analysis.

You can also import rules defined with the VSR algorithm in the tMatchGroup configuration wizard and in other match components, including the Hadoop components, and use the rules in match Jobs. For further information, see the tMatchGroup documentation in the Talend Components Reference Guide

The two algorithms produce different match results because of two reasons:

  • first, the master record is simply selected to be the first input record with the VSR algorithm. Therefore, the list of match groups may depend on the order of the input records,

  • second, the output records do not change with the VSR algorithm, whereas the T-Swoosh algorithm creates new records.

How to define the rule

Prerequisite(s): You have selected the Profiling perspective of the studio.

  1. In the DQ Repository tree view, expand Libraries > Rules.

  2. Right-click Match and select New Match Rule.

  3. In the [New Match Rule] wizard, enter a name and set other metadata, if required.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

    Consider as an example that you want to create a rule to match customer full names.

  4. Click Finish.

    A match rule editor opens in the studio and the new match rule is listed under Libraries > Rule > Match in the DQ Repository tree view.

    In the Record Linkage algorithm view, the Simple VSR Matcher algorithm is selected by default.

  5. Start defining the match rule items as described in Rules with the VSR algorithm and Rules with the T-Swoosh algorithm.

How to duplicate a rule

To avoid creating a match rule from scratch, you can duplicate an existing one and work around its metadata and definition to have a new rule.

Prerequisite(s): You have selected the Profiling perspective of the studio.

To duplicate a rule, do the following:

  1. In the DQ Repository tree view, expand Libraries > Rules > Match.

  2. Browse through the match rule list to reach the rule you want to duplicate.

  3. Right-click its name and select Duplicate.

    The duplicated rule is created under the Match folder in the DQ Repository tree view.

  4. Double-click the duplicated rule to open it and modify its metadata and/or definition as needed.

Rules with the VSR algorithm

The VSR algorithm takes a set of records as input and groups similar encountered duplicates together according to defined match rules. It compares pairs of records and assigns them to groups. The first processed record of each group is the master record of the group. The VSR algorithm compares each record with the master of each group and uses the computed distances, from master records, to decide to what group the record should go.

In the match analysis and matching components, the matching results of the VSR algorithm may vary depending on the order of the input records. If possible, put the records in which you have more confidence first in the input flow, to have better algorithm accuracy.

Note that matching components, including the Hadoop matching components, run only with rules configured with the VSR algorithm.

Defining a blocking key

Defining a blocking key is not mandatory but advisable. Using a blocking key partitions data in blocks and thus reduces the number of records to be examined, as comparisons are restricted to record pairs within each block. Using blocking key(s) is very useful when you are processing big data set.

  1. In the rule editor and in the Generation of Blocking Key section, click the [+] button to add a row to the table.

  2. Set the parameters of the blocking key as the following:

    • Blocking Key Name: Enter a name for the column you want to use to reduce the number of record pairs that need to be compared.

    • Pre-algorithm: Select from the drop-down list an algorithm and set its value where necessary.

      Defining a pre-algorithm is not mandatory. This algorithm is used to clean or standardize data before processing it with the match algorithm and thus improve the results of data matching.

    • Algorithm: Select from the drop-down list the match algorithm you want to use and set its value where necessary.

    • Post-algorithm: Select from the drop-down list an algorithm and set its value where necessary

      Defining a post-algorithm is not mandatory. This algorithm is used to clean or standardize data after processing it with the match algorithm and thus improve the outcome of data matching.

  3. If required, follow the same steps to add as many blocking keys as needed.

    When you import a rule with many blocking keys into the match analysis editor, only one blocking key will be generated and listed in the BLOCK_KEY column in the Data table.

    For further information about the blocking key parameters, see the tGenKey documentation in the Talend Components Reference Guide.

Defining a matching key

  1. In the rule editor and in the Matching Key table, click the [+] button to add a row to the table.

  2. Set the parameters of the matching key as the following:

    • Match Key Name: Enter the name of your choice for the match key.

    • Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you want to use an external user-defined matching algorithm.

      In this example two match keys are defined, you want to use the Levenshtein and Jaro-Winkler match methods on first names and last names respectively and get the duplicate records.

    • Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-defined algorithm.

    • Confidence Weight: Set a numerical weight (between 1 and 10) to the column you want to use as a match key. This value is used to give greater or lesser importance to certain columns when performing the match.

    • Handle Null: Specify how to deal with data records which contain null values.

    For further information about the match rule parameters, see the tMatchGroup documentation in the Talend Components Reference Guide.

  3. In the Match threshold field, enter the match probability threshold. Two data records match when the probability is above this value.

    In the Confident match threshold field, set a numerical value between the current Match threshold and 1. Above this threshold, you can be confident about the quality of the group.

  4. To define a second match rule, place your cursor on the top right corner of the Matching Key table and then click the [+] button.

    Follow the steps to create a match rule.

    When you define multiple conditions in the match rule editor, an OR match operation is conducted on the analyzed data. Records are evaluated against the first rule and the records that match are not evaluated against the second rule.

  5. If required, put your cursor on the top right corner of the table and click the button then replace the default names of the rules with names of your choice.

    You can also use the up and down arrows in the dialog box to change the rule order and thus decide what rule to execute first.

  6. Click OK.

    The rules are named and ordered accordingly in the Matching Key table.

  7. Save the match rule settings.

    The match rule is saved and centralized under Libraries > Rule > Match in the DQ Repository tree view.

You can import and test the rule on your data in the match analysis editor. For further information, see How to import or export match rules.

You can also import the rule in the tMatchGroup configuration wizard and in other match components, including the Hadoop components, and use the rule in match Jobs. For further information, see the tMatchGroup documentation in the Talend Components Reference Guide.

Rules with the T-Swoosh algorithm

You can use the T-Swoosh algorithm to find duplicates and to define how two similar records are merged to create a master record, using a survivorship function. These new merged records are used to finds new duplicates. The difference with the VSR algorithm is that the master record is in general a new record that does not exist in the list of input records.

  1. In the Record linkage algorithm section, select T-Swoosh. The Simple VSR Matcher is for use with Talend Data Quality only.

  2. In the Match and Survivor section, you define the criteria to use when matching data records. Click the [+] button to add a new rule, and then set the following criteria.

    • Match Key Name: Enter the name of your choice for the match key.

    • Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you want to use an external user-defined matching algorithm.

    • Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-defined algorithm.

    • Threshold: Specify the match score (between 0 and 1) above which two values should be considered a match.

    • Confidence Weight: Set a numerical weight (between 1 and 10) to the column you want to use as a match key. This value is used to give greater or lesser importance to certain columns when performing the match.

    • Handle Null: Specify how to deal with data records which contain null values.

      • nullMatchNull: If both records contain null values, consider this a match.

      • nullMatch None: If one record contains a null, do not consider this a match.

      • nullMatch All: If one record contains a null, consider this a match.

    • Survivorship Function: Select how two similar records will be merged from the drop-down list.

      • Concatenate: It adds the content of the first record and the content of the second record together - for example, Bill and William will be merged into BillWilliam. In the Parameter field, you can specify a separator to be used to separate values.

      • Prefer True (for booleans): It always set booleans to True in the merged record, unless all booleans in the source records are False.

      • Prefer False (for booleans): It always sets booleans to False in the merged record, unless all booleans in the source records are True.

      • Most common: It validates the most frequently-occurring field value in each duplicates group.

      • Most recent or Most ancient: The former validates the earliest date value and the latter the latest date value in each duplicates group. The relevant reference column must be of the Date type.

      • Longest or Shortest: The former validates the longest field value and the latter the shortest field value in each duplicates group.

      • Largest or Smallest: The former validates the largest numerical value and the latter the smallest numerical value in a duplicates group.

        Warning

        Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric type.

      • Most trusted source: It takes the data coming from the source which has been defined as being most trustworthy. The most trusted data source is set in the Parameter field.

    • Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator you want to use for concatenating data.

  3. In the Match threshold field, enter the match probability threshold.

    Two data records match when the probability is above this value.

    In the Confident match threshold field, set a numerical value between the current Match threshold and 1.

  4. In the Default Survivorship Rules section, you define how to survive matches for certain data types: Boolean, Data, Number and String. If you do not specify the behavior for any or all data types, the default behavior is applied.

    • Click the [+] button to add a new row for each data type.

    • In the Data Type column, select the relevant data type from the drop-down list.

    • In the Survivorship Function column, select how two similar records will be merged from the drop-down list. Note that, depending on the data type, only certain choices may be relevant.

      Warning

      Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric type.

    • Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator you want to use for concatenating data.

  5. Save your changes.