Finding similar values - 7.3

Talend Data Preparation User Guide

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Preparation
Content
Data Quality and Preparation > Cleansing data
Last publication date
2023-11-28

If you want to find and filter some text that looks alike, in order to fix typos for example, you can use the Match Similar Text function.

This function creates a new column with the value true if the pattern matches and false if it does not.

Procedure

  1. Select the text column where you want to find similar text.
  2. In the Functions panel, type Match Similar Text and click the result to open the options for the associated function.
  3. Fill in the options according to your needs.

    The Reference field corresponds to some text you enter, and the Fuzziness field corresponds to the number of characters that can be added, removed or different from the Reference. This number is called the Levenshtein distance.

    Note that the Reference field is case sensitive. In this example, the Reference text is new and the Levenshtein distance (Fuzziness) is 1.

    In this example, the function would match words such as "few", "now", "net" or "news", but not "bow", "nap" or "led".

  4. Click the Submit button to apply the function with the selected options.

Results

This creates a new column with the value true if the pattern matches and false if it does not.

For more information on the Levenshtein distance, see https://en.wikipedia.org/wiki/Levenshtein_distance