Rules with the VSR algorithm

The VSR algorithm takes a set of records as input and groups similar encountered duplicates together according to defined match rules.

This algorithm compares pairs of records and assigns them to groups. The first processed record of each group is the master record of the group. So, the order of the records is important and can have an impact on the creation process of the master records.

The VSR algorithm compares each record with the master of each group and uses the computed distances, from master records, to decide to what group the record should go.

In the match analysis and matching components, the matching results of the VSR algorithm may vary depending on the order of the input records. If possible, put the records in which you have more confidence first in the input flow, to have better algorithm accuracy.

Note that matching components, including the Hadoop matching components, run only with rules configured with the VSR algorithm.

You can import and test the rule on your data in the match analysis editor. For further information, see Importing match rules from the repository.

You can also import the rule in the tMatchGroup configuration wizard and in other match components, including the Hadoop components, and use the rule in match Jobs. For further information, see the tMatchGroup documentation.

Defining a blocking key from the match analysis

About this task

Defining a blocking key is not mandatory but advisable. Using a blocking key partitions data in blocks and thus reduces the number of records to be examined, as comparisons are restricted to record pairs within each block. Using blocking key(s) is very useful when you are processing big data set.

Procedure

In the rule editor and in the Generation of Blocking Key section, click the [+] button to add a row to the table.
Set the parameters of the blocking key as the following:
- Blocking Key Name: Enter a name for the column you want to use to reduce the number of record pairs that need to be compared.
- Pre-algorithm: Select from the drop-down list an algorithm and set its value where necessary.
  
  Defining a pre-algorithm is not mandatory. This algorithm is used to clean or standardize data before processing it with the match algorithm and thus improve the results of data matching.
- Algorithm: Select from the drop-down list the match algorithm you want to use and set its value where necessary.
- Post-algorithm: Select from the drop-down list an algorithm and set its value where necessary
  
  Defining a post-algorithm is not mandatory. This algorithm is used to clean or standardize data after processing it with the match algorithm and thus improve the outcome of data matching.
If required, follow the same steps to add as many blocking keys as needed.
When you import a rule with many blocking keys into the match analysis editor, only one blocking key will be generated and listed in the BLOCK_KEY column in the Data table.

For further information about the blocking key parameters, see the tGenKey documentation.

Defining a matching key

Procedure

In the rule editor and in the Matching Key table, click the [+] button to add a row to the table.
Set the parameters of the matching key as the following:
- Match Key Name: Enter the name of your choice for the match key.
- Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you want to use an external user-defined matching algorithm.
  
  In this example two match keys are defined, you want to use the Levenshtein and Jaro-Winkler match methods on first names and last names respectively and get the duplicate records.
- Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-defined algorithm.
- Confidence Weight: Set a numerical weight (between 1 and 10) to the column you want to use as a match key. This value is used to give greater or lesser importance to certain columns when performing the match.
- Handle Null: Specify how to deal with data records which contain null values.
For further information about the match rule parameters, see the tMatchGroup documentation.
In the Match threshold field, enter the match probability threshold. Two data records match when the probability is above this value.
In the Confident match threshold field, set a numerical value between the current Match threshold and 1. Above this threshold, you can be confident about the quality of the group.
To define a second match rule, place your cursor on the top right corner of the Matching Key table and then click the [+] button.
Follow the steps to create a match rule.

When you define multiple conditions in the match rule editor, an OR match operation is conducted on the analyzed data. Records are evaluated against the first rule and the records that match are not evaluated against the second rule.
Optional: To replace the default names of the rules, click on the top right corner of the table.
You can also use the up and down arrows in the dialog box to change the rule order and thus decide what rule to execute first.
Click OK.
The rules are named and ordered accordingly in the Matching Key table.
Save the match rule settings.
The match rule is saved and centralized under Libraries > Rule > Match in the DQ Repository tree view.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here