tDataMasking properties for Apache Spark Streaming - 7.1

Data privacy

Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Studio
Data Governance > Third-party systems > Data Quality components > Data privacy components
Data Quality and Preparation > Third-party systems > Data Quality components > Data privacy components
Design and Development > Third-party systems > Data Quality components > Data privacy components

These properties are used to configure tDataMasking running in the Spark Streaming Job framework.

The Spark Streaming tDataMasking component belongs to the Data Quality family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.

The output schema of this component contains one read-only column, ORIGINAL_MARK. This column identifies by true or false if the record is an original record or a substitute record respectively.


Built-In: You create and store the schema locally for this component only.


Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.


Define in the table what fields to change and how to change them:

Input Column: Select the column from the input flow for which you want to generate similar data by modifying its values.

These modifications are based on the function you select in the Function column and the number of modifications you set in the Max Modification Count column.

Function: Select the function that will decide what modification to do in order to generate similar substitutional data. For example, you can decide to have similar values through replacing or adding letters or numbers, replacing values with synonyms from an index file or deleting values by setting the function to null.

Before the full path to the file, you need enter the protocol: file:///, even if you run the Job in local mode, or hdfs:// if the file is on a cluster.

The Function list will vary according to the column type. For further information about function behavior, see Function behavior in common PII.

For example, a column of a Long type will have a Numeric variance option in the list while a column of a String type will not have such function. Also, the Function list for a Date column is date-specific, it allows you to decide the type of modification you want to do on date values.

Extra Parameter: This field is used by some of the functions, it will be disabled when not applicable. When applicable, enter a number or a letter to decide the behavior of the function you have selected.

Keep format: this function is only used on Strings. Select this check box to keep the input format when using the Generate unique SSN number, Generate account number and keep original country and Generate credit card number and keep original bank functions. That is to say, if there are spaces, dots ('.'), hyphens ('-') or slashes ('/') in the input, the output will have the same characters.

Advanced settings

Seed for random generator

Set a random number if you want to generate the same sample of substitute data in each execution of the Job. This field is set to 12345678 by default.

Repeating the execution with a different value for this field will result in a different sample being generated. Keep this field empty if you want to generate a different sample each time you execute the Job.

Output the original row

Select this check box to output original data rows in addition to the substitute data. Having both data rows can be useful in debug or test processes.

Should null input return null

This check box is selected by default. When selected, the component outputs null when input values are null. Otherwise, it returns the default value when the input is null, that is an empty string for string values, 0 for numeric values and the current date for date values.

This parameter does not have an effect on the Generate Sequence function. If the input is null, this function will not return null, even if the box is checked.

Should empty input return empty

When this check box is selected, the component returns the input values if they are empty. Otherwise, the selected functions are applied to the input data.

tStat Catcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.


Usage rule

This component, along with the Spark Streaming component Palette it belongs to, appears only when you are creating a Spark Streaming Job.

This component is used as an intermediate step.

You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job.

This connection is effective on a per-Job basis.

For further information about a Talend Spark Streaming Job, see the sections describing how to create, convert and configure a Talend Spark Streaming Job of the Talend Open Studio for Big Data Getting Started Guide .

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Spark Connection

In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent JAR files for execution, you must specify the directory in the file system to which these JAR files are transferred so that Spark can access these files:
  • Yarn mode (YARN client or YARN cluster):
    • When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab.

    • When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab.

    • When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab.
    • When using Qubole, add a tS3Configuration to your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster.
    • When using on-premise distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration or tS3Configuration.

    If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem).

This connection is effective on a per-Job basis.