Removing duplicate values

Talend Data Fabric Getting Started Guide

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Data Fabric
task
Data Quality and Preparation > Cleansing data
Data Quality and Preparation > Profiling data
Design and Development
Installation and Upgrade

The profiling results of the column analysis show that there are some duplicate records in the email and phone columns. Check Showing analysis results for detail.

From the analysis results, you can generate out-of-box Jobs that separate unique from duplicate records in the selected columns. Such Jobs output all the duplicates in a reject delimited file by default, and writes the unique values in the database used in the analysis.

You can follow the same procedure to remove duplicates from the Email or Phone columns.

Before you begin

  • You have opened the Profiling perspective in the Studio.

  • You have created and executed the column analysis. For further information, see Identifying anomalies in data.

Procedure

  1. Open the column analysis in the Profiling perspective and click Analysis Results at the bottom of the editor.
  2. In the Simple Statistics results of the Email or Phone column, right-click Duplicate Count and select Identify duplicates.

    This example uses the outcome of the simple statistics used on the Email column.

    The Integration perspective opens showing the generated Job, and the Job is listed in the Repository tree view.

    The tMysqlInput, tUniqueRow and tMysqlOutputBulkExec components are automatically configured according to your connection and the columns you are analyzing. tMysqlOutputBulkExec writes unique records to a new table in MySQL and tFileOutputDelimited writes duplicate records in an output delimited file.

  3. Press F6 to execute the Job.

Results

Duplicate values are written to the output file and unique records are written to a new table in the gettingstarted database in MySQL.