Data masking functions in the masking components - 7.3

Data privacy

Version
7.3
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Data privacy components
Data Quality and Preparation > Third-party systems > Data Quality components > Data privacy components
Design and Development > Third-party systems > Data Quality components > Data privacy components
Last publication date
2024-03-28

There are several functions in the masking components which vary according to the data type of the column.

It is advisable to use the functions predefined in the component with columns that contain personally identifiable information, such as first and last names, email addresses, addresses, SSNs, credit card numbers, bank account numbers, genders, date of births and salaries.

Format-preserving encryption in the masking components

The component uses Format-Preserving Encryption (FPE) methods to generate masked output values in the same format as the input values.

Note: Java 8u161 is the minimum required version to use the FF1 with AES method. To be able to use this FPE method with Java versions earlier than 8u161, download the Java Cryptography Extension (JCE) unlimited strength jurisdiction policy files from Oracle website.

The FPE methods are based on a National Institute of Standards and Technology (NIST) standard:

  • FF1 with AES relies on the Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the secure hash function HMAC-256.

The FPE methods are bijective methods, except when using tweaks.

Important: The FPE methods encrypt data to perform pseudonymization. These methods are less strong than classical encryption algorithms. If you want to keep the data format, use the masking components. Otherwise, use the tDataEncrypt component. The encryption is stronger.

The FF1 with AES and FF1 with SHA-2 methods require a password to generate encrypted and repeatable masked values. Those FPE methods do not use a seed.

You can specify this password in the password for FF1 method field, from the Advanced Settings of the component.

You can use tweaks so that the bijection is not performed. It makes the encryption stronger. A unique tweak is generated for each record and applies to all data of a record. The tweaks change at each Job execution. You can unmask the data by using the tDataUnmasking component and the corresponding tweaks.

Format-preserving encryption in the tDatamasking component

When using the FF1 with AES and FF1 with SHA-2 methods, input values must contain a minimum number of characters to be masked. Otherwise, the function returns null.

For example, you want to mask S426A789QQ using the Keep the first n digits and replace following ones function with the following parameters:
  • FF1 with AES or FF1 with SHA-2
  • The Digits alphabet
  • "2" as an extra-parameter
There are only 4 digits to be masked because you decided to keep the two first digits. As a result, the function returns null.

The minimum number of characters required in the input values varies depending on the selected Alphabet.

When selecting Best guess, the number varies depending on the represented alphabets in the input values.

Alphabet Minimum number of characters to mask
Alphanumeric 4
Digits 6
Latin extended 3
Hiragana 4
Katakana 3
Kanji 2
Hangul 2

Alphabets

When using the Character handling functions, such as Replace all, Replace characters between two positions, Replace all digits with FPE methods, you must select an alphabet.

Characters that belong to the selected alphabet are masked with characters from the same alphabet.

When selecting the Best guess alphabet, masked values contain characters from all character types represented in the input values. Best guess is the default alphabet.

Any unrecognized character is copied to the output as is.

The following alphabets are supported:

Alphabet Character Type Unicode Range (version 11.0) Corresponding characters
Alphanumeric Latin numbers [0030-0039] [0-9]
Latin lower-cased letters [0061-007A] [a-z]
Latin upper-cased letters [0041-005A] [A-Z]
Digits Latin numbers [0030-0039] [0-9]
Latin extended Latin numbers [0030-0039] [0-9]
Latin lower-cased letters [0061-007A] [a-z]
Latin extended lower-cased letters [00DF-00F6] [00F8-00FF] [ß-ö] [ø-ÿ]
Latin upper-cased letters [0041-005A] [A-Z]
Latin extended upper-cased letters [00C0-00D6] [00D8-00DE] [À-Ö] [Ø-Þ]
Hiragana Hiragana [3041-3096] 30FC 309D 309E [ぁ-ゖ] ー ゝ ゞ
Katakana Half-with Katakana https://www.unicode.org/charts/PDF/UFF00.pdf [ヲ-ン][FF66-FF9D]
Full-width Katakana [30A1-30FA] 30FC 30FD 30FE [ァ-ヺ] ー ヽ ヾ
Phonetic extension: [31F0-31FF] [ㇰ-ㇿ]
Kanji Kanji CJK Extension A[FF66-FF9D: [4E00-9FEF] [3400-4DB5] [一-] [㐀-䶵]
CJK Extension B: [20000-2A6D6] [𠀀-𪛖]
CJK Extension C: [2A700-2B734] [𪜀-𫜴]
CJK Extension D: [2B740-2B81D] [𫝀-𫠝]
CJK Extension E: [2B820-2CEA1] [-]
CJK Extension F: [2CEB0-2EBE0] [-]
CJK Compatibility Ideographs: [F900-FA6D] [FA70-FAD9] [豈-舘] [-]
CJK Compatibility Ideographs Supplement: [2F800-2FA1D] [-]
KangXi Radicals: [2F00-2FD5] [⼀-⿕]
CJK Radicals Supplement: [2E80-2E99] [2E9B-2EF3] [⺀-⺙] [⺛-⻳]
CJK Symbols and Punctuation: [3005-3005] [3007-3007] [3021-3029] [3038-303B] [々-々] [〇-〇] [〡-〩] [〸-〻]
Hangul Hangul [AC00-D7AF] [가-힯]