Supported character types in column analyses and data masking operations - Cloud

Talend Cloud Data Management Platform Studio User Guide

EnrichVersion
Cloud
EnrichProdName
Talend Cloud
EnrichPlatform
Talend Management Console
Talend Studio
task
Design and Development

When masking data using Talend Data Preparation or the tDataMasking component, each of the characters in the input data is masked to a character from the same character type, within the supported Unicode ranges.

When creating column analyses in Talend Studio, you can use the East Asia Pattern Frequency or East Asia Pattern Low Frequency indicators for Asian characters, to define the content, structure and quality of your data.

The following table describes the supported character types and the related Unicode ranges (version 11.0).

For more information, see the documentation for the Unicode Standard (http://unicode.org/standard/standard.html) and the character code charts (http://www.unicode.org/charts/).

Character Type Unicode Range (version 11.0) Corresponding characters
Latin numbers [0030-0039] [0-9]
Latin lower-cased letters [0061-007A] [00DF-00F6] [00F8-00FF] [a-z] [ß-ö] [ø-ÿ]
Latin upper-cased letters [0041-005A] [00C0-00D6] [00D8-00DE] [A-Z] [À-Ö] [Ø-Þ]
Full-width Latin numbers [FF10-FF19] [0-9]
Full-width Latin lower-cased letters [FF41-FF5A] [a-z]
Full-width Latin upper-cased letters [FF21-FF3A] [A-Z]
Hiragana [3041-3096] 30FC 309D 309E [ぁ-ゖ] ー ゝ ゞ
Half-width Katakana [FF66-FF9D] [ヲ-ン]
Full-width Katakana [30A1-30FA] 30FC 30FD 30FE [ァ-ヺ] ー ヽ ヾ
Phonetic extension: [31F0-31FF] [ㇰ-ㇿ]
Kanji CJK Extension A: [4E00-9FEF] [3400-4DB5] [一-] [㐀-䶵]
CJK Extension B: [20000-2A6D6] [𠀀-𪛖]
CJK Extension C: [2A700-2B734] [𪜀-𫜴]
CJK Extension D: [2B740-2B81D] [𫝀-𫠝]
CJK Extension E: [2B820-2CEA1] [-]
CJK Extension F: [2CEB0-2EBE0] [-]
CJK Compatibility Ideographs: [F900-FA6D] [FA70-FAD9] [豈-舘] [-]
CJK Compatibility Ideographs Supplement: [2F800-2FA1D] [-]
KangXi Radicals: [2F00-2FD5] [⼀-⿕]
CJK Radicals Supplement: [2E80-2E99] [2E9B-2EF3] [⺀-⺙] [⺛-⻳]
CJK Symbols and Punctuation: [3005-3005] [3007-3007] [3021-3029] [3038-303B] [々-々] [〇-〇] [〡-〩] [〸-〻]
Hangul [AC00-D7AF] [가-힯]