Run the analysis with different probability distributions - 7.3

Data privacy

Version
7.3
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Data privacy components
Data Quality and Preparation > Third-party systems > Data Quality components > Data privacy components
Design and Development > Third-party systems > Data Quality components > Data privacy components
Last publication date
2024-04-03

Procedure

  1. Switch back to the Integration perspective, select Poisson distribution in the basic settings of tDuplicateRow and run the Job.
  2. In the Profiling perspective, click Chart below the Matching Key table to show the duplicates generated according to the Poisson distribution.
  3. Run the Job with the Geometric distribution, then click the Chart in the Profiling to show the duplicates generated according to the Geometric distribution.
    The table below shows how results of the generated duplicates differ according to the probability distribution you select in the tDuplicateRow component.

    Probability distribution

    Duplicate results

    Description

    Bernoulli distribution

    The curve is symmetrical. The groups of duplicates are distributed evenly on each side of an average value, 4 in this example. This average value is the average number of duplicates in a group of duplicates and this value is the number you set in the Average group size field in the basic settings of the tDuplicateRow component.

    Poisson distribution

    The curve is not symmetrical. The groups of duplicates are distributed unevenly.

    Geometric distribution

    The form of the curve is decided by the percentage you set for the duplicated records in the tDuplicateRow basic settings. The higher the percentage is, the fewer groups with many records you will have.

    In this example the percentage for the duplicate records is set to 80%. This is why many groups with two-record duplicates are generated (148 groups), while there is only one group that has 14, 15 and 16 duplicates.