Finding and grouping similar content - 7.3

Talend Data Preparation Getting Started Guide

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Preparation
Content
Data Quality and Preparation > Cleansing data
Last publication date
2023-01-05

Finding and grouping similar text can be used to harmonize content with only small variations.

Note: The Find and group similar text function does not support Asian characters.

In the customers.xlsx file, there is information about the occupation of your clients. Some of the values are closely similar to each other, for example College/Grad Student and College Student. A way to improve the readability, and thus the quality of your data, would be to regroup some of these values together.

To find and group similar content, proceed as follows:

Procedure

  1. Click the header of the Occupation column to select its content.

    You can confirm in the statistics box that there are occurrences of job titles that only slightly differ.

  2. In the functions list, select Find and Group Similar Text....

    The Find and group similar text menu opens.

    All similar occupations are grouped together in the second column. In this case, College/Grad Student and College Student. The third column suggests an occupation title that could replace the values in the second column. You can choose another value from the drop-down list, or type a whole new one. Clear the check boxes in front of the values or groups of values you want to leave unchanged.

  3. In the drop-down list of the third column, select College Student.
  4. Click Submit.

Results

All the occurrences of College/Grad Student and College Student have been regrouped under College Student, the new harmonized value.