Removing duplicate values - 7.3

Talend Data Management Platform Getting Started Guide

Version
7.3
Language
English
Operating system
Data Management Platform
Product
Talend Data Management Platform
Module
Talend Administration Center
Talend DQ Portal
Talend Installer
Talend Runtime
Talend Studio
Content
Data Quality and Preparation > Cleansing data
Data Quality and Preparation > Profiling data
Design and Development
Installation and Upgrade
Last publication date
2023-07-24

The profiling results of the column analysis show that there are some duplicate records in the email and phone columns. Check Showing analysis results for detail.

From the analysis results, you can generate out-of-box Jobs that separate unique from duplicate records in the selected columns. Such Jobs output all the duplicates in a reject delimited file by default, and writes the unique values in the database used in the analysis.

You can follow the same procedure to remove duplicates from the Email or Phone columns.

Before you begin

  • You have opened the Profiling perspective in the Studio.

  • You have created and executed the column analysis. For further information, see Identifying anomalies in data.

Procedure

  1. Open the column analysis in the Profiling perspective and click Analysis Results at the bottom of the editor.
  2. In the Simple Statistics results of the Email or Phone column, right-click Duplicate Count and select Identify duplicates.

    This example uses the outcome of the simple statistics used on the Email column.

    The Integration perspective opens showing the generated Job, and the Job is listed in the Repository tree view.

    The tMysqlInput, tUniqueRow and tMysqlOutputBulkExec components are automatically configured according to your connection and the columns you are analyzing. tMysqlOutputBulkExec writes unique records to a new table in MySQL and tFileOutputDelimited writes duplicate records in an output delimited file.

  3. Press F6 to execute the Job.

Results

Duplicate values are written to the output file and unique records are written to a new table in the gettingstarted database in MySQL.