Creating a match key - 7.1

Talend Real-time Big Data Platform Studio User Guide

author
Talend Documentation Team
EnrichVersion
7.1
EnrichProdName
Talend Real-Time Big Data Platform
task
Design and Development
EnrichPlatform
Talend Studio

Procedure

  1. In the Record linkage algorithm section, select T-Swoosh.
  2. In the Match and Survivor section, you define the criteria to use when matching data records. Click the [+] button to add a new rule, and then set the following criteria.
    • Match Key Name: Enter the name of your choice for the match key.

    • Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you want to use an external user-defined matching algorithm.

    • Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-defined algorithm.

    • Threshold: Specify the match score (between 0 and 1) above which two values should be considered a match.

    • Confidence Weight: Set a numerical weight (between 1 and 10) to the column you want to use as a match key. This value is used to give greater or lesser importance to certain columns when performing the match.

    • Handle Null: Specify how to deal with data records which contain null values.
      • nullMatchNull: If both records contain null values, consider this a match.

      • nullMatch None: If one record contains a null, do not consider this a match.

      • nullMatch All: If one record contains a null, consider this a match.

    • Survivorship Function: Select how two similar records will be merged from the drop-down list.
      • Concatenate: It adds the content of the first record and the content of the second record together - for example, Bill and William will be merged into BillWilliam. In the Parameter field, you can specify a separator to be used to separate values.

      • Prefer True (for booleans): It always set booleans to True in the merged record, unless all booleans in the source records are False.

      • Prefer False (for booleans): It always sets booleans to False in the merged record, unless all booleans in the source records are True.

      • Most common: It validates the most frequently-occurring field value in each duplicates group.

      • Most recent or Most ancient: The former validates the earliest date value and the latter the latest date value in each duplicates group. The relevant reference column must be of the Date type.

      • Longest or Shortest: The former validates the longest field value and the latter the shortest field value in each duplicates group.

      • Largest or Smallest: The former validates the largest numerical value and the latter the smallest numerical value in a duplicates group.

        Warning: Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric type.
    • Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator you want to use for concatenating data.

  3. In the Match threshold field, enter the match probability threshold.

    Two data records match when the probability is above this value.

    In the Confident match threshold field, set a numerical value between the current Match threshold and 1.

  4. In the Survivorship Rules For Columns section, define how data records survive for certain columns. Click the [+] button to add a new rule, and then set the following criteria:
    • Input Column: Enter the column to which you want to apply the survivorship rule.

    • Survivorship Function: Select how two similar records will be merged from the drop-down list.

    • Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator to use for concatenating data.

    If you specify the survivorship function for a match key in the Match And Survivor section and also specify the survivorship function for the match key as an input column in the Survivorship Rules For Columns section, the survivorship function selected in the Match And Survivor section is applied to the column.

  5. In the Default Survivorship Rules section, you define how to survive matches for certain data types: Boolean, Date, Number and String.
    1. Click the [+] button to add a new row for each data type.
    2. In the Data Type column, select the relevant data type from the drop-down list.
    3. In the Survivorship Function column, select how two similar records will be merged from the drop-down list. Note that, depending on the data type, only certain choices may be relevant.
      Warning: Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric type.
    4. Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator you want to use for concatenating data.

    If you specify the survivorship function for a column in the Survivorship Rules For Columns section and also specify the survivorship function for the data type of the column in the Default Survivorship Rules section, the suvivorship function selected in the Survivorship Rules For Columns is applied to the column.

    If you do not specify the behavior for any or all data types, the default behavior (the Most common survivorship function) will be applied, that is, the most frequently-occurring field value in each duplicates group will be validated.

  6. Save your changes.