Word-based pattern indicators - Cloud - 8.0

Talend Studio User Guide

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-02-29
Available in...

Big Data Platform

Cloud API Services Platform

Cloud Big Data Platform

Cloud Data Fabric

Cloud Data Management Platform

Data Fabric

Data Management Platform

Data Services Platform

MDM Platform

Real-Time Big Data Platform

Word-based pattern indicators include case-sensitive and case-insensitive indicators.

Word-based pattern indicators count the number of records for each distinct pattern and are available only with the Java engine.

You can use those indicators with the String data type only.

Case-sensitive indicators

There are two types of case-sensitive indicators:
  • The CS Word Pattern Frequency indicator evaluates the most frequent word patterns.
  • The CS Word Pattern Low Frequency indicator evaluates the least frequent word patterns.

Patterns focus on words and are case sensitive:

Pattern Description
[Word] Word starting with an uppercase character and consisting of lowercase characters
[WORD] Word with uppercase characters
[word] Word with lowercase characters
[Char] Single uppercase character
[char] Single lowercase character
[Ideogram] One of the CJK Unified Ideographs
[IdeogramSeq] Sequence of ideograms
[hiraSeq] Sequence of Japanese Hiragana characters
[kataSeq] Sequence of Japanese Katakana characters
[hangulSeq] Sequence of Korean Hangul characters
[digit] One of the Arabic numerals: 0,1,2,3,4,5,6,7,8,9
[number] Sequence of digits

When using the CS Word Pattern Frequency and CS Word Pattern Low Frequency indicators, the following strings are replaced with the following patterns:

String Pattern
A character is NOT a Word [Char] [word] [word] [WORD] [char] [Word]
someWordsINwORDS [word][Word][WORD][char][WORD]
Example123@domain.com [Word][number]@[word].[word]
anotherExample8@domain.com [word][Word][digit]@[word].[word]
袁 花木蘭88 [Ideogram] [IdeogramSeq][number]
Latin2中文 [Word][digit][IdeogramSeq]
Latin3フランス [Word][digit][kataSeq]
Latin4とうきょう [Word][digit][hiraSeq]
Latin5나는 한국 사람입니다 [Word][digit][hangulSeq]

Case-insensitive indicators

There are two types of case-insensitive indicators:
  • The CI Word Pattern Frequency indicator evaluates the most frequent word patterns.
  • The CI Word Pattern Low Frequency indicator evaluates the least frequent word patterns.

Patterns focus on words and are case insensitive:

Pattern Description
[word] Word with lowercase characters
[char] Single lowercase character
[Ideogram] One of the CJK Unified Ideographs
[IdeogramSeq] Sequence of ideograms
[hiraSeq] Sequence of Japanese Hiragana characters
[kataSeq] Sequence of Japanese Katakana characters
[hangulSeq] Sequence of Korean Hangul characters
[digit] One of the Arabic numerals: 0,1,2,3,4,5,6,7,8,9
[number] Sequence of digits
[alnum] Alphanumeric value consisting of characters and Arabic numerals

When using the CI Word Pattern Frequency and CI Word Pattern Low Frequency indicators, the following strings are replaced with the following patterns:

String Pattern
A character is NOT a Word [char] [word] [word] [word] [char] [word]
someWordsINwORDS [word]
Example123@domain.com [alnum]@[word].[word]
anotherExample8@domain.com [alnum]@[word].[word]
袁 花木蘭88 [Ideogram] [IdeogramSeq][number]
Latin2中文 [word][digit][IdeogramSeq]
Latin3フランス [word][digit][kataSeq]
Latin4とうきょう [word][digit][hiraSeq]
Latin5나는 한국 사람입니다 [word][digit][hangulSeq]

Word-based pattern indicators and database compatibility

The following table shows the indicators that you can select in any database:

Indicator Supported data types with the Java analysis engine Supported data types with the SQL analysis engine
CS Word Pattern Frequency
  • Number
  • Text
  • Date
None
CS Word Pattern Low Frequency
  • Number
  • Text
  • Date
None
CI Word Pattern Frequency
  • Number
  • Text
  • Date
None
CI Word Pattern Low Frequency
  • Number
  • Text
  • Date
None