tPatternMasking Standard properties - 7.0

Data privacy

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Data privacy components
Data Quality and Preparation > Third-party systems > Data Quality components > Data privacy components
Design and Development > Third-party systems > Data Quality components > Data privacy components
EnrichPlatform
Talend Studio

These properties are used to configure tPatternMasking running in the Standard Job framework.

The Standard tPatternMasking component belongs to the Data Quality family.

Basic settings

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

The output schema of this component contains one read-only column, ORIGINAL_MARK. This column identifies by true or false if the record is an original record or a substitute record respectively.

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

Modifications

Define in the table what fields to change and how to change them:

Column to mask: Select the column from the input flow for which you want to generate similar data by modifying its values.

You can mask data from different columns but you need to follow the order of the fields you want to mask.

Each column is processed sequentially, meaning that data masking operations will be performed on the data from the first column, the second column, and so on.

Field type: Select the field type the data belongs to.
  • Interval: When selected, set a range of numeric values used for masking purposes in the Range field, using the following syntax: "<min>,<max>".

    The number of masked characters from the input data corresponds to the number of characters of the maximum value.

    For example, "1,999" will be interpreted as "001,999", which means that three characters from the input data will be masked by a value randomly selected from the defined range of values.

  • Enumeration: When selected, enter a comma-separated list of values to be used for masking data in the Values field, using the following syntax: "value1,value2,value3".

    Each value must contain the same number of characters. For example: "30001,30002,30003" or "FR,EN".

  • Enumeration from file: When selected, set the path to the file containing a list of values to be used for masking data in the Path field. The file must contain one value per row and each value must have the same number of characters.
  • Date pattern (YYYYMMDD): When selected, set a range of years in the Date Range field, using the following syntax: "<min_year>,<max_year>".

    Years can only have four digits, for example: "1900,2100".

    The input dates to be masked must follow the YYYYMMDD pattern, for example: 20180101.

    For example, if the input date is 20180101 and the value in the Date Range is "1900,2100", 19221221 could be the output date.

In the Values, Path, Range and Date Range, values must be enclosed in double quotes.

When the input data is invalid, meaning that a value is not included in the defined range, date range or in the enumeration, the generated value is null.

Advanced settings

Seed for random generator

Set a random number if you want to generate the same sample of substitute data in each execution of the Job. This field is set to 12345678 by default.

Repeating the execution with a different value for this field will result in a different sample being generated. Keep this field empty if you want to generate a different sample each time you execute the Job.

Output the original row?

Select this check box to output original data rows in addition to the substitute data. Having both data rows can be useful in debug or test processes.

Should Null input return NULL?

This check box is selected by default. When selected, the component outputs null when input values are null. Otherwise, it returns the default value when the input is null, that is an empty string for string values, 0 for numeric values and the current date for date values.

This parameter does not have an effect on the Generate Sequence function. If the input is null, this function will not return null, even if the box is checked.

Should EMPTY input return EMPTY?

When this check box is selected, the component returns the input values if they are empty. Otherwise, the selected functions are applied to the input data.

tStat Catcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Usage

Usage rule

This component is an intermediary step. It requires an input and output flows.