Finding similar values

Talend Data Preparation User Guide

author
Talend Documentation Team
EnrichVersion
6.4
2.1
EnrichProdName
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Data Services Platform
Talend Big Data
Talend Data Management Platform
Talend Data Fabric
Talend ESB
Talend Data Integration
Talend Big Data Platform
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation

If you want to find and filter some text that looks alike, in order to fix typos for example, you can use the Match Similar Text function.

This function creates a new column with the value true if the pattern matches and false if it does not.

Procedure

  1. Select the text column where you want to find similar text.
  2. In the Functions panel, type Match Similar Text and click the result to open the options for the associated function.
  3. Fill in the options according to your needs.

    The Reference field corresponds to some text you enter, and the Fuzziness field corresponds to the number of characters that can be added, removed or different from the Reference. This number is called the Levenshtein distance.

    Note that the Reference field is case sensitive. In this example, the Reference text is new and the Levenshtein distance (Fuzziness) is 1.

    In this example, the function would match words such as "few", "now", "net" or "news", but not "bow", "nap" or "led".

  4. Click the Submit button to apply the function with the selected options.

Results

This creates a new column with the value true if the pattern matches and false if it does not.

For more information on the Levenshtein distance, see https://en.wikipedia.org/wiki/Levenshtein_distance