East Asia pattern frequency indicators - Cloud - 8.0

Talend Studio User Guide

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-02-29
Available in...

Big Data Platform

Cloud API Services Platform

Cloud Big Data Platform

Cloud Data Fabric

Cloud Data Management Platform

Data Fabric

Data Management Platform

Data Services Platform

MDM Platform

Real-Time Big Data Platform

East Asia pattern frequency indicators include East Asia Pattern Frequency and East Asia Pattern Low Frequency.
There are two types of pattern frequency indicators:
  • The East Asia Pattern Frequency indicator computes the number of most frequent records for each distinct pattern.
  • The East Asia Pattern Low Frequency indicator computes the number of less frequent records for each distinct pattern.

These two indicators work only with Latin characters and are available only with the Java engine. They are useful when you want to identify patterns in Asian data.

The above two indicators give patterns by converting Asian characters to letters such as H,K,C and G following the rules described in the following table:

Character type Usage
Latin numbers 9 replaces all ASCII digits
Latin lowercase letters a replaces all ASCII Latin characters
Latin uppercase letters A replaces all uppercase Latin characters
Full-width Latin numbers 9 replaces all ASCII digits
Full-width Latin lowercase letters a replaces all ASCII Latin characters
Full-width Latin uppercase letters A replaces all uppercase Latin characters
Hiragana H replaces all Hiragana characters
Half-width Katakana k replaces all half-width Katakana characters
Full-width Katakana K replaces all full-width Katakana characters
Katakana K replaces all Katakana characters
Kanji C replaces Chinese characters
Hangul G replaces Hangul characters

Below is an example of a column analysis using the East Asia Pattern Frequency and East Asia Pattern Low Frequency indicators on an address column.

Configuration to apply the East Asia Pattern Frequency and East Asia Pattern Low Frequency indicators.

The analysis results of the East Asia Pattern Low Frequency indicator will look like the following:

Table and graphical results of the East Asia Pattern Low Frequency Statistics indicator.

These results give the number of the least frequent records for each distinct pattern. Some patterns have characters and numbers, while others have only characters. Patterns also have different lengths, so this shows that the address is not consistent and you may need to correct and clean it.

East Asia pattern frequency indicators and database compatibility

The following table shows the indicators that you can select in any database:

Indicator Supported data types with the Java analysis engine Supported data types with the SQL analysis engine
East Asia Pattern Frequency
  • Number
  • Text
  • Date
None
East Asia Pattern Low Frequency
  • Number
  • Text
  • Date
None