Deduplicating values in columns

Deduplicating values in columns - 7.3

Talend Data Preparation User Guide

Version

7.3

Language

English

Product

Talend Big Data

Talend Big Data Platform

Talend Data Fabric

Talend Data Integration

Talend Data Management Platform

Talend Data Services Platform

Talend ESB

Talend MDM Platform

Talend Real-Time Big Data Platform

Module

Talend Data Preparation

Content

Data Quality and Preparation > Cleansing data

Last publication date

2023-11-28

You can use the Deduplicate rows with identical values function to easily delete rows that are partly or entirely duplicates with other ones.

Note: This function is not compatible with Spark Jobs, and HDFS or S3 exports.

Duplicated information can be introduced in spreadsheets because of human error, like a bad copy and paste for example, as well as automated operations. In the following dataset, that contains basic customers information, you will notice that the firstname and lastname columns both contain values that are present more than once.

Jake and Peralta are indeed entries that make it look like the firstname and lastname columns contain duplicates when taken separately. But looking more closely shows that the information from rows 1, 2 and 4 belong to separate customers that share either their first or last names. Row 3 on the other hand is a legitimate duplicate of row 2, and is even missing some information.

Because performing a deduplication operation on the two columns separately would make you lose valuable information on customers that happen to have the same first name or last name, you will use the Deduplicate rows with identical values function on the two columns at once. This way, the function will only remove rows were both first and last names are duplicates, like rows 2 and 3, but also other potential duplicates further down in the dataset.

Procedure

While pressing the Ctrl key, click the headers of the firstname and lastname columns to select their content.
In the functions panel, type Deduplicate rows with identical values and click the result to display the options of the associated function.
From the Matching criterion drop down list, select the restriction rule that you want to apply, Exact value for example.
- Simplified text: Punctuation, white spaces, case and accents are ignored. For example, if Pâté-en-croûte is your reference value, rows with pate-eN-cRoute will be deleted but not rows with Pâté n croûte.
- Ignore case and accents: Case and accents are not taken into account. For example, if Pâté-en-croûte is your reference value, rows with pate-en-croute will be deleted but not rows with pate en croute.
- Exact value: The most restrictive validation rule. The rows will be deleted only if there is an exact match with the reference value.
Click Submit.

Results

The row that was a duplicate of row 2 has been deleted, while other rows that contained identical values were kept because they did not match the two-column criteria.