Further restricting the use of sensitive data - Cloud - 8.0

Data privacy

Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Studio
Data Governance > Third-party systems > Data Quality components > Data privacy components
Data Quality and Preparation > Third-party systems > Data Quality components > Data privacy components
Design and Development > Third-party systems > Data Quality components > Data privacy components
Last publication date

When shuffling data, it is still advised to mask sensitive data. Remember also to consider relationships between the columns when shuffling data and make sure the original data set cannot be reconstructed.

In this scenario, last names and first names are grouped together but the email adresses are not in the same group. Consequently, the email column does not relate to the lname and fname columns. Since the email column usually contains information about first names and last names, it may help attackers to reconstruct the original data.

Additionally, the address1, city and email columns are not in any group, so they were not shuffled. This means it is possible to infer, for example, that Robert Damstra lives at 1619 Stillman Court, Lynnwood.

Using this scenario, you can restrict the use of actual sensitive data even more:
  • To avoid the use of real credit card numbers, you can mask credit card numbers using the tDataMasking component.

  • To avoid the identification of customers with their email addresses, you can mask email addresses using the tDataMasking component.

  • To make it more difficult to read real addresses, you can add the address1 and city columns in other groups.


As tDataShuffling is supported on the Spark framework, you can convert this standard Job to a Spark Batch Job by editing the Job properties. This way you do not need to redefine the settings of the components in the Job.