Removing duplicate values - 7.3

Data Quality Job and Analysis Examples

Version
7.3
Language
English (United States)
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Open Studio for Data Quality
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Quality and Preparation

After analyzing the email and postal columns using simple statistics indicators, the analysis results show the number of duplicate records in the columns. You can generate a ready-to-use Job on the analysis results. This Job removes duplicate values in the selected column.

You can follow the same procedure to remove duplicates from the Email or Phone columns.

Procedure

  1. In the Profiling perspective, click Analysis Results at the bottom of the editor.
  2. In the Simple Statistics results of the Email or Phone column, right-click the duplicate count bar in the chart and select Remove duplicates.

    This example uses the outcome of the simple statistics used on the Email column.

    The Integration perspective opens showing the generated Job.

    The database input component and the tUniqueRow component are already configured according to your connection and the columns you are analyzing.

  3. Save the Job and press F6 to execute it.

Results

Duplicate values are written to the specified output database and file.

What to do next

You can follow the same procedure to remove duplicates from the postal column.

For further information on using the Profiling Profiling perspective to identify and remove corrupt, incomplete or inaccurate data, see the Data Cleansing chapter in Talend Studio User Guide at https://help.talend.com.

For further information on using the Profiling perspective to identify and remove corrupt, incomplete or inaccurate data, see the Data Cleansing chapter in Talend Studio User Guide.