Deduplicating rows - 8.0

Talend Data Preparation User Guide

Version
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Preparation
Content
Data Quality and Preparation > Cleansing data
Last publication date
2024-03-26

You can use the Remove duplicate rows function to easily delete all the rows that are exact duplicates and keep only one in your dataset.

Note: This function is not compatible with Spark Jobs, and S3 exports.

Duplicated information can be introduced in spreadsheets because of human error, like a bad copy and paste for example, as well as automated operations. In this example, you received a dataset containing customer information, where all the rows are systematically duplicated.

You will use the Remove duplicate rows to easily clean your dataset.

Procedure

  1. Click the header of any column from your dataset.
  2. Click the Table tab of the functions panel to display the list of functions that can be applied on the whole table.
  3. Point your mouse over the Remove duplicate rows function to preview its result, and click to apply it.

Results

All the duplicated information has been removed in one simple action, leaving you with only one correct occurrence of each row in your dataset.