The main steps of the process are the following ones:
For each row, the values of the different columns used as a blocking key are concatenated. Then, all the suffixes of length equal or greater than the value set for the Min suffix length parameter are generated. By default, the value is set to 3 in Talend Studio.
For example, the first_name and last_name columns are used as a blocking key. The first_name column contains the value
Johnand the last_name column contains the value
Doe. Then, the suffixes generated are
The remaining suffixes are sorted alphabetically. If two consecutive suffixes are too similar, they are merged.
If the number of rows having a given suffix exceeds the value set for the Max block size parameter, this suffix is considered to be too frequent. It is then removed.
For each suffix, all the possible pairs of rows with records having the same suffix are generated. The value set for the Max block size parameter should not be too large because the number of combinations can dramatically increase. By default, the value for the Max block size parameter is set to 10.
The last step is the filtering step. It consists of removing the pairs that are less likely to match. A score for each pair is computed and added in the output schema. To compute a score for each pair of suspect duplicates, a sample of fixed size is generated. The default size is set to 10000. The two following steps are applied to the sample:
Computing different measures - Levenshtein, Jaro-Winkler and Exact without case - for each pair and each column.
Computing the percentile for each pair of suspect duplicates and each column.
It is now possible to give a very good approximate value for the percentile, given a value for a measure. On the global dataset, the different measures are computed for each pair and the percentile is retrieved for each measure. Then the score is computed using two steps:
Computing the maximum percentile over measures for each column
Computing the minimum percentile over columns.
If the score is lower than the threshold, the pair is filtered. This guarantees that each column has at least one measure above the threshold, meaning that all the columns are matching with respect to at least one measure.