Standardizing values using dictionaries - 2.3

Talend Data Preparation User Guide

author
Talend Documentation Team
EnrichVersion
6.5
2.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation

You can use the Standardize value (fuzzy matching) to find the closest valid value for invalid values within a column.

The function checks the invalid data contained in a column against the current semantic type, and retrieves the correct values, if the selected matching threshold is achieved. This function is only available if the semantic type is based on a dictionary of values present by default in Talend Data Preparation, or that you have created using Talend Dictionary Service. For more information on how to create custom semantic types, or edit the existing ones, see Enriching the semantic types libraries through the user interface.

Let's say that you have to work on a dataset containing various information on customers based in the United-States, such as their names, email addresses and the State they live in.

As you can see in the header of the State column, the data has been recognized as US States, but as shown in the quality bar, some of the entries contain invalid names.

In a single action, you will fix those invalid values, and replace them with the correct value from the US State dictionary, or semantic type, that contains an exhaustive list of all the US States.

Procedure

  1. Click the header of the State column in order to select its content.
  2. In the functions panel, type Standardize values and click the result to open the options for the associated function.
  3. In the Match threshold drop-down list, select the matching percentage that must be achieved between the incorrect value and correct value for the substitution to happen.

    The three following percentage values are available:

    • High: Only values that have a 90% match or more with the correct value are replaced.
    • Default: Only values that have a 80% match or more with the correct value are replaced.
    • None: Replaces the invalid value with the closest valid value.

    The Levenshtein algorithm is used to match the data. In the case of a composed string, the matching process is actually divided in four parts:

    • A search occurs on the full string and on each tokens.
    • Dictionary values that have less than a 3-character difference to the full string or one token are returned.
    • A distance on the possible pairs is computed to return the best one.
    • The user threshold filters the results according to the distance.
    • Clermont Talend matches with Clermont thanks to the first token.
    • Clermont-Ferra matches with Clermont-Ferrand because there are less than three different characters compared to the full string.
    • Clermon-Ferant matches with Clermont because there are more than three different characters compared to the full string but only one different character with the token Clermont.
  4. Point your mouse over the Sumbit button to preview result of the function, and click to apply it.

Results

The incorrect values have been standardized, using the dictionary of US States.