Skip to main content Skip to complementary content

Soundex frequency statistics

Indicators in this group use the Soundex algorithm built in the DBMS.

Those indicators index records by sounds. This way, records with the same English pronunciation are encoded to the same representation so that they can be matched despite minor differences in spelling.

  • Soundex Frequency: computes the number of most frequent distinct records relative to the total number of records having the same pronunciation.
  • Soundex Low Frequency: computes the number of less frequent distinct records relative to the total number of records having the same pronunciation.

To be able to use Soundex frequency statistics indicators on PostgreSQL, Amazon for PostgreSQL and Amazon Redshift, install an extension into the PostgreSQL database using the CREATE EXTENSION fuzzystrmatch; query.

For more information, see PostgreSQL documentation.

To be able to use Soundex frequency statistics indicators on Amazon Redshift, you can also create a custom user-defined function.

For more information, see Creating user-defined functions.

You can only use Soundex frequency statistics indicators on Snowflake with the Java engine.

Chinese characters are only supported by the SQL engine.

Due to some limitation in Teradata soundex implementation, you may not be able to drill down the results of profiling Teradata with this indicator.

The following table shows the indicators that you can select in any database:

Indicator Supported data types with the Java analysis engine Supported data types with the SQL analysis engine
Soundex Frequency Table Text Text
Soundex Low Frequency Table Text Text

Teradata error: "Invalid Input: only Latin letters allowed"

You will get an error message when using Talend Studio to profile the Teradata database using the Soundex Frequency Table indicator because your input is invalid.

From the Profiling perspective, try to profile a column in Teradata, first_name for example, using the Soundex Frequency Table indicator. Run the column analysis with the SQL engine. The analysis runs successfully.

Try to drill down data on the result page: in the Frequency Statistics table in the Analysis Results view, right-click a row and select View Rows. You will get an error in the SQL Editor about the generated SQL query.

[Error 5583] Invalid Input: only Latin letters allowed.

This limitation is due to Teradata soundex implementation. The Teradata database requires that a character string or expression that contains a surname is evaluated in simple Latin characters.

A simple Latin character is one that does not have diacritical marks such as tilde (~) or acute accent (´). There are 26 uppercase simple Latin characters and 26 lowercase simple Latin characters. Even a simple call to SOUNDEX ('Sébastien') cannot be executed on Teradata. Therefore, it is not possible to drill down into all rows that sounds like 'Sébastien'.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!