Defining a blocking key - Cloud - 7.3

Talend Studio User Guide

Version
Cloud
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-02-13
Available in...

Big Data Platform

Cloud API Services Platform

Cloud Big Data Platform

Cloud Data Fabric

Cloud Data Management Platform

Data Fabric

Data Management Platform

Data Services Platform

MDM Platform

Real-Time Big Data Platform

About this task

Defining a blocking key is not mandatory but strongly advisable. Using a blocking key to partition data in blocks reduces the number of records that need to be examined as comparisons are restricted to record pairs within each block. Using blocking column(s) is very useful when you are processing a big dataset.

Procedure

  1. In the Data section, click the Select Blocking Key tab.
  2. Click the name of the column(s) you want to use to partition the processed data in blocks.
    Blocking keys that have the exact name of the selected columns are listed in the Blocking Key table.
    You can define more than one column in the table, but only one blocking key will be generated and listed in the BLOCK_KEY column in the Data table.
    For example, if you use an algorithm on the country and lnamecolumns to process records that have the same first character, data records that have the same first letter in the country and last names are grouped together in the same block. Comparison is restricted to record within each block.
    To remove a column from the Blocking key table, right-click it and select Delete or click on its name in the Data table.
  3. Select an algorithm for the blocking key, and set the other parameters in the Blocking Key table as needed.
    In this example, only one blocking key is used. The first character of each word in the country column is retrieved and listed in the BLOCK_KEY column.
  4. Click Chart to compute the generated key, group the sample records in the Data table and display the results in a chart.
    This chart allows you to visualize the statistics regarding the number of blocks and to adapt the blocking parameters according to the results you want to get.