Blocking - 6.5

Using tMatchGroup with the Simple VSR Matcher and T-Swoosh algorithms

author
Talend Documentation Team
EnrichVersion
6.5
task
Data Governance > Third-party systems > Data Quality components > Matching components
Data Quality and Preparation > Matching data
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components
Design and Development > Third-party systems > Data Quality components > Matching components
EnrichPlatform
Talend Studio
To avoid doing a two-by-two comparison of all the input records, users can define one or many blocking keys. The dataset will be split into smaller datasets called blocks.

Inside each block, the blocking keys must have the same value. Then, each block will be processed independently.

Using blocking keys reduce the time needed by the Simple VSR Matcher and the T-Swoosh algorithms to process data. For example, if 100,000 records are split into 100 blocks of 1,000 records each, the number of comparisons will be reduced by a factor 100. This means the algorithm will run around 100 times faster.

It is recommended to use the tGenKey component to generate blocking keys and to get an idea of the profile of the blocks. In a Job, you can right-click the tGenKey component and select View Key Profile in the contextual menu to visualize the distribution of the number of blocks according to their size.

In this example, the average block size is around 40.

For the 13 blocks with 38 rows, there will be 18,772 comparisons in these 13 blocks (13 x 382). If records are compared with four columns, this means there will be 75,088 string comparisons in these 13 blocks (18,772 x 4).