Blocking by partitions

Record linkage is a demanding task because each record must be compared to the other ones from the data set. To improve the efficiency of this task, the blocking technique is a required step most of the time.

Blocking consists of sorting data into similar sized partitions which have the same attribute. The objective is to restrict comparisons to the records grouped within the same partition.

To create efficient partitions, you need to find attributes which are unlikely to change, such as a person's first name or last name. By doing this, you improve the reliability of the blocking step and the computation speed of the task.

It is recommended to use the tGenKey component to generate blocking keys and to view the distribution of the blocks.

For more information on generating blocking keys, see Identification.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here