These properties are used to configure tPatternMasking running in the Spark Batch Job framework.
The Spark Batch tPatternMasking component belongs to the Data Quality family.
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
Basic settings
Schema and Edit Schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Click Sync columns to retrieve the schema from the previous component connected in the Job. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
The output schema of this component contains one read-only column,
ORIGINAL_MARK. This column identifies by
|
|
Built-In: You create and store the schema locally for this component only. |
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. |
Modifications |
Define in the table what fields to change and how to change them: Column to mask: Select the column from the input flow for which you want to generate similar data by modifying its values. You can mask data from different columns but you need to follow the order of the fields you want to mask. Each column is processed sequentially, meaning that data masking operations will be performed on the data from the first column, the second column, and so on.
Field type: Select the field type the data belongs to.
In the Values, Path, Range and Date Range, values must be enclosed in double quotes. When the input data is invalid, meaning that a value is not included in
the defined range, date range or in the enumeration, the generated value
is |
Advanced settings
Seed for random generator |
Set a random number if you want to generate the same sample of substitute data in each execution of the Job. This field is set to 12345678 by default. Repeating the execution with a different value for this field will result in a different sample being generated. Keep this field empty if you want to generate a different sample each time you execute the Job. |
Output the original row? |
Select this check box to output original data rows in addition to the substitute data. Having both data rows can be useful in debug or test processes. |
Should Null input return NULL? |
This check box is selected by default. When selected, the component outputs null when input values are null. Otherwise, it returns the default value when the input is null, that is an empty string for string values, 0 for numeric values and the current date for date values. This parameter does not have an effect on the Generate Sequence function. If the input is null, this function will not return null, even if the box is checked. |
Should EMPTY input return EMPTY? |
When this check box is selected, the component returns the input values if they are empty. Otherwise, the selected functions are applied to the input data. |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs. |
Spark Connection |
In the Spark Configuration tab in the
Run view, define the connection to a given Spark
cluster for the whole Job. In addition, since the Job expects its dependent JAR files
for execution, you must specify the directory in the file system to which these JAR
files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |