Skip to main content Skip to complementary content

Defining and creating a matching key with the T-Swoosh algorithm

Procedure

Make sure first to select the columns on which to apply the match algorithm either from the Data section by using the Select Matching Key tab, or directly from the Matching Key table.

Creating a match key

Procedure

  1. In the Record linkage algorithm section, select T-Swoosh.
  2. In the Match and Survivor section, you define the criteria to use when matching data records. Click the [+] button to add a new rule, and then set the following criteria.
    • Match Key Name: Enter the name of your choice for the match key.

    • Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you want to use an external user-defined matching algorithm.

    • Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-defined algorithm.

    • Threshold: Specify the match score (between 0 and 1) above which two values should be considered a match.

    • Confidence Weight: Set a numerical weight (between 1 and 10) to the column you want to use as a match key. This value is used to give greater or lesser importance to certain columns when performing the match.

    • Handle Null: Specify how to deal with data records which contain null values.
      • nullMatchNull: If both records contain null values, consider this a match.

      • nullMatch None: If one record contains a null, do not consider this a match.

      • nullMatch All: If one record contains a null, consider this a match.

    • Survivorship Function: Select how two similar records will be merged from the drop-down list.
      • Concatenate: It adds the content of the first record and the content of the second record together.

        For example, Bill and William will be merged into BillWilliam. In the Parameter field, you can specify a separator to be used to separate values.

      • Prefer True (for booleans): It always set booleans to True in the merged record, unless all booleans in the source records are False.

      • Prefer False (for booleans): It always sets booleans to False in the merged record, unless all booleans in the source records are True.

      • Most common: It validates the most frequently-occurring field value in each duplicates group.

      • Most recent or Most ancient: The former validates the earliest date value and the latter the latest date value in each duplicates group. The relevant Reference column must be of the Date type.

      • Longest or Shortest: The former validates the longest field value and the latter the shortest field value in each duplicates group.

      • Largest or Smallest: The former validates the largest numerical value and the latter the smallest numerical value in a duplicates group.

        Information noteWarning: Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric type.
      • Most trusted source: It takes the data coming from the source which has been defined as being most trustworthy. The most trusted data source is set in the Parameter field.

    • Reference column: If you set Survivor Function to Most recent or Most ancient, this item is used to select the reference column.
    • Parameter: For the Concatenate survivorship function, this item is used to specify a separator you want to use for concatenating data.

    • Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator you want to use for concatenating data.

  3. In the Match threshold field, enter the match probability threshold.

    Two data records match when the probability is above this value.

    In the Confident match threshold field, set a numerical value between the current Match threshold and 1.

  4. In the Survivorship Rules For Columns section, define how data records survive for certain columns. Click the [+] button to add a new rule, and then set the following criteria:
    • Input Column: Enter the column to which you want to apply the survivorship rule.

    • Survivorship Function: Select how two similar records will be merged from the drop-down list.

    • Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator to use for concatenating data.

    If you specify the survivorship function for a match key in the Match And Survivor section and also specify the survivorship function for the match key as an input column in the Survivorship Rules For Columns section, the survivorship function selected in the Match And Survivor section is applied to the column.

  5. In the Default Survivorship Rules section, you define how to survive matches for certain data types: Boolean, Date, Number and String.
    1. Click the [+] button to add a new row for each data type.
    2. In the Data Type column, select the relevant data type from the drop-down list.
    3. In the Survivorship Function column, select how two similar records will be merged from the drop-down list. Depending on the data type, only certain choices may be relevant.
      Information noteWarning: Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric type.
    4. Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator you want to use for concatenating data.

    If you specify the survivorship function for a column in the Survivorship Rules For Columns section and also specify the survivorship function for the data type of the column in the Default Survivorship Rules section, the survivorship function selected in the Survivorship Rules For Columns is applied to the column.

    If you do not specify the behavior for any or all data types, the default behavior (the Most common survivorship function) will be applied, that is, the most frequently-occurring field value in each duplicates group will be validated.

  6. Save your changes.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!