Skip to main content Skip to complementary content

Pattern frequency statistics

Indicators in this group determine the most and less frequent patterns.
Information noteRemember:

When running an analysis with the SQL engine, percentage values do not appear in the analysis results if you did not select the Row Count indicator.

Date Pattern Frequency supports 30 types of date patterns. If the user-defined pattern is not included, results will be empty. To be able to add a user-defined pattern, create a user-defined indicator.

Pattern frequency indicators

Pattern frequency indicators include Pattern Frequency and Pattern Low Frequency.
There are two types of pattern frequency indicators:
  • The Pattern Frequency indicator computes the number of most frequent records for each distinct pattern.
  • The Pattern Low Frequency indicator computes the number of less frequent records for each distinct pattern.

These two indicators give patterns by converting alpha characters to a and numerics to 9.

Pattern frequency indicators and database compatibility

The following table shows the indicators that you can select in any database:

Indicator Supported data types with the Java analysis engine Supported data types with the SQL analysis engine
Pattern Frequency
  • Number
  • Text
  • Date
  • Number
  • Text
  • Date
Pattern Low Frequency
  • Number
  • Text
  • Date
  • Number
  • Text
  • Date

East Asia pattern frequency indicators

East Asia pattern frequency indicators include East Asia Pattern Frequency and East Asia Pattern Low Frequency.
There are two types of pattern frequency indicators:
  • The East Asia Pattern Frequency indicator computes the number of most frequent records for each distinct pattern.
  • The East Asia Pattern Low Frequency indicator computes the number of less frequent records for each distinct pattern.

These two indicators work only with Latin characters and are available only with the Java engine. They are useful when you want to identify patterns in Asian data.

The above two indicators give patterns by converting Asian characters to letters such as H,K,C and G following the rules described in the following table:

Character type Usage
Latin numbers 9 replaces all ASCII digits
Latin lowercase letters a replaces all ASCII Latin characters
Latin uppercase letters A replaces all uppercase Latin characters
Full-width Latin numbers 9 replaces all ASCII digits
Full-width Latin lowercase letters a replaces all ASCII Latin characters
Full-width Latin uppercase letters A replaces all uppercase Latin characters
Hiragana H replaces all Hiragana characters
Half-width Katakana k replaces all half-width Katakana characters
Full-width Katakana K replaces all full-width Katakana characters
Katakana K replaces all Katakana characters
Kanji C replaces Chinese characters
Hangul G replaces Hangul characters

Below is an example of a column analysis using the East Asia Pattern Frequency and East Asia Pattern Low Frequency indicators on an address column.

Configuration to apply the East Asia Pattern Frequency and East Asia Pattern Low Frequency indicators.

The analysis results of the East Asia Pattern Low Frequency indicator will look like the following:

Table and graphical results of the East Asia Pattern Low Frequency Statistics indicator.

These results give the number of the least frequent records for each distinct pattern. Some patterns have characters and numbers, while others have only characters. Patterns also have different lengths, so this shows that the address is not consistent and you may need to correct and clean it.

East Asia pattern frequency indicators and database compatibility

The following table shows the indicators that you can select in any database:

Indicator Supported data types with the Java analysis engine Supported data types with the SQL analysis engine
East Asia Pattern Frequency
  • Number
  • Text
  • Date
None
East Asia Pattern Low Frequency
  • Number
  • Text
  • Date
None

Date pattern frequency indicator

This indicator evaluates the most frequent date patterns by counting the number of records for each distinct date pattern.

Date pattern frequency indicator and database compatibility

The following table shows the indicators that you can select in any database:

Indicator Supported data types with the Java analysis engine Supported data types with the SQL analysis engine
Date Pattern Frequency
  • Text
  • Date
None

Word-based pattern indicators

Word-based pattern indicators include case-sensitive and case-insensitive indicators.

Word-based pattern indicators count the number of records for each distinct pattern and are available only with the Java engine.

You can use those indicators with the String data type only.

Case-sensitive indicators

There are two types of case-sensitive indicators:
  • The CS Word Pattern Frequency indicator evaluates the most frequent word patterns.
  • The CS Word Pattern Low Frequency indicator evaluates the least frequent word patterns.

Patterns focus on words and are case sensitive:

Pattern Description
[Word] Word starting with an uppercase character and consisting of lowercase characters
[WORD] Word with uppercase characters
[word] Word with lowercase characters
[Char] Single uppercase character
[char] Single lowercase character
[Ideogram] One of the CJK Unified Ideographs
[IdeogramSeq] Sequence of ideograms
[hiraSeq] Sequence of Japanese Hiragana characters
[kataSeq] Sequence of Japanese Katakana characters
[hangulSeq] Sequence of Korean Hangul characters
[digit] One of the Arabic numerals: 0,1,2,3,4,5,6,7,8,9
[number] Sequence of digits

When using the CS Word Pattern Frequency and CS Word Pattern Low Frequency indicators, the following strings are replaced with the following patterns:

String Pattern
A character is NOT a Word [Char] [word] [word] [WORD] [char] [Word]
someWordsINwORDS [word][Word][WORD][char][WORD]
Example123@domain.com [Word][number]@[word].[word]
anotherExample8@domain.com [word][Word][digit]@[word].[word]
袁 花木蘭88 [Ideogram] [IdeogramSeq][number]
Latin2中文 [Word][digit][IdeogramSeq]
Latin3フランス [Word][digit][kataSeq]
Latin4とうきょう [Word][digit][hiraSeq]
Latin5나는 한국 사람입니다 [Word][digit][hangulSeq]

Case-insensitive indicators

There are two types of case-insensitive indicators:
  • The CI Word Pattern Frequency indicator evaluates the most frequent word patterns.
  • The CI Word Pattern Low Frequency indicator evaluates the least frequent word patterns.

Patterns focus on words and are case insensitive:

Pattern Description
[word] Word with lowercase characters
[char] Single lowercase character
[Ideogram] One of the CJK Unified Ideographs
[IdeogramSeq] Sequence of ideograms
[hiraSeq] Sequence of Japanese Hiragana characters
[kataSeq] Sequence of Japanese Katakana characters
[hangulSeq] Sequence of Korean Hangul characters
[digit] One of the Arabic numerals: 0,1,2,3,4,5,6,7,8,9
[number] Sequence of digits
[alnum] Alphanumeric value consisting of characters and Arabic numerals

When using the CI Word Pattern Frequency and CI Word Pattern Low Frequency indicators, the following strings are replaced with the following patterns:

String Pattern
A character is NOT a Word [char] [word] [word] [word] [char] [word]
someWordsINwORDS [word]
Example123@domain.com [alnum]@[word].[word]
anotherExample8@domain.com [alnum]@[word].[word]
袁 花木蘭88 [Ideogram] [IdeogramSeq][number]
Latin2中文 [word][digit][IdeogramSeq]
Latin3フランス [word][digit][kataSeq]
Latin4とうきょう [word][digit][hiraSeq]
Latin5나는 한국 사람입니다 [word][digit][hangulSeq]

Word-based pattern indicators and database compatibility

The following table shows the indicators that you can select in any database:

Indicator Supported data types with the Java analysis engine Supported data types with the SQL analysis engine
CS Word Pattern Frequency
  • Number
  • Text
  • Date
None
CS Word Pattern Low Frequency
  • Number
  • Text
  • Date
None
CI Word Pattern Frequency
  • Number
  • Text
  • Date
None
CI Word Pattern Low Frequency
  • Number
  • Text
  • Date
None

List of engines used and database types supported when using Pattern Frequency Statistics indicators

When creating a column analysis from the Profiling perspective of Talend Studio, you can profile a database using the Pattern Frequency Statistics indicators. To execute the analysis, you can use the Java or the SQL engine depending on the type of the database you want to profile.
Engine compatibility depending on database type when using Pattern Frequency Statistics indicators
Database type Java engine SQL engine
Exasol Yes Yes
Hive Yes Yes
MySQL Yes Yes
Netezza Yes Yes
Oracle Yes Yes
PostgreSQL Yes Yes
Sybase Yes No
Teradata Yes No
Vertica Yes Yes

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!