Blocking - Cloud

Blocking - Cloud - 8.0

Data matching with Talend tools

Version

Cloud

8.0

Language

English

Product

Talend Big Data Platform

Talend Data Fabric

Talend Data Management Platform

Talend Data Services Platform

Talend MDM Platform

Talend Real-Time Big Data Platform

Module

Talend Studio

Content

Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components

Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components

Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components

Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components

Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components

Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components

Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components

Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components

Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components

Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components

Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components

Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components

Last publication date

2024-02-06

To avoid doing a two-by-two comparison of all the input records, you can define one or many blocking keys to split the input dataset into smaller datasets called blocks.

In each block, the blocking keys must have the same value. Then, each block is processed independently.

Using blocking keys reduces the time needed by the Simple VSR Matcher and the T-Swoosh algorithms to process data. For example, if 100,000 records are split into 100 blocks of 1,000 records each, the number of comparisons are reduced by a factor 100. This means the algorithm runs around 100 times faster.

It is recommended to use the tGenKey component to generate blocking keys and to visualize the statistics regarding the number of blocks. In a Job, right-click the tGenKey component and select View Key Profile in the contextual menu to visualize the distribution of the number of blocks according to their size.

In this example, the average block size is around 40.

For the 13 blocks with 38 rows, there are 18,772 comparisons in these 13 blocks (13 × 382). If records are compared with four columns, this means there will be 75,088 string comparisons in these 13 blocks (18,772 × 4).