Adding a new regular expression-based semantic type - 7.1

Talend Data Stewardship User Guide

Version
7.1
Language
English (United States)
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Stewardship
Content
Administration and Monitoring > Managing users
Data Governance > Assigning tasks
Data Governance > Managing campaigns
Data Governance > Managing data models
Data Quality and Preparation > Handling tasks
Data Quality and Preparation > Managing semantic types

You can create a semantic type based on a regular expression in Talend Dictionary Service and add it to the list of recognized data types in Talend Data Stewardship.

In Talend Dictionary Service, not every type of data can currently be matched with and validated against one of the predefined semantic types. Italian social security numbers, also known as codice fiscale, are currently not recognized for example.

About this task

Let's say that you work for an Italian company, only dealing with Italian customers. In this example, you need to intervene and manage some customer data, such as their names, email address, or their social security number. When defining the data model in Talend Data Stewardship, you will be obliged to set the semantic type for the column containing the social security number to text as there is no predefined semantic type for Italian social security number. This is a bit disappointing and you would like to create a more specific category in order to match this type of data: a codice_fiscale semantic type in this case.

You can create this new semantic type in Talend Dictionary Service, and it will be automatically available in Talend Data Stewardship so that your data can be matched with and validated against a proper type.

Procedure

  1. Create a .txt file containing the following regular expression and save it as REGEX_CODICE_FISCALE.txt.

    This regular expression is designed to match the Italian codice fiscale, which is an alphanumeric code of 16 characters.

  2. Add this file to the <Dictionary_Service_Path>/command-line/samples/source folder.
    This folder is used for the sake of this example, but you can save it to your preferred location.
  3. Open a command prompt window and use the cd command to go to the <Dictionary_Service_Path>/command-line folder.
  4. To create the new codice_fiscale semantic type in Talend Dictionary Service and configure its different parameters, put the following command in one single line and execute it according to your operating system:
    • category_manager.bat -c -name codice_fiscale -type REGEX -desc "Italian social security number" -src samples\source\REGEX_codice_fiscale.txt for Windows.
    • ./category_manager.sh -c -name codice_fiscale -type REGEX -desc "Italian social security number" -src samples/source/REGEX_codice_fiscale.txt for Linux.
    You are prompted for your Talend Administration Center credentials. The command is executed after you enter a valid login and password.

    The codice_fiscale semantic type is now added to the list of categories in Talend Dictionary Service.

  5. Go back to Talend Data Stewardship and create the data model for the Italian customers data.
    The new semantic category codice_fiscale is available now in the list of semantic types and you can set it for the column containing the social security number.

Results

When you load the customer data to Talend Data Stewardship, data is now matched with and validated against the codice_fiscale semantic type, that you created in Talend Dictionary Service.
To display a list of all the available commands in Talend Dictionary Service, go to <Dictionary_Service_Path>/command-line and enter the following command according to your operating system:
  • category_manager.bat -h command for Windows.
  • ./category_manager.sh -h for Linux.