You can create a semantic type based on a regular expression in Talend Dictionary Service and add it to the list of recognized data types in Talend Data Preparation.
In Talend Data Preparation, not every type of data can currently be matched with one of the predefined semantic types. Italian social security numbers, also known as codice fiscale, are currently not recognized for example.
Let's say that you work for an Italian company, only dealing with Italian customers. In
this example, you need to clean some customer data, such as their names, email address,
or their social security number. The semantic type for the column containing the social
security number data will be set by default to
text. This is a bit
disappointing and you would like to create a more specific category in order to match
this type of data: a
codice_fiscale semantic type in this case.
You will create this new semantic type in Talend Dictionary Service, and it will be automatically available in Talend Data Preparation so that your data can be matched with a proper type.
Create a .txt file containing the following regular
expression and save it as REGEX_CODICE_FISCALE.txt.
This regular expression is designed to match the Italian codice fiscale, which is an alphanumeric code of 16 characters. Data that matches that pattern in Talend Data Preparation will be identified as codice fiscale.
Add this file to the
This folder is used for the sake of this example, but you can save it to your preferred location.
- Open a command prompt window.
cdcommand, go to the <Dictionary_Service_Path>/command-line folder.
To create the new
codice_fiscalesemantic type in Talend Dictionary Service and configure its different parameters, execute the following command according to your operating system:
category_manager.bat -c -name codice_fiscale -type REGEX -desc "Italian social security number" -src samples\source\REGEX_codice_fiscale.txtfor Windows.
./category_manager.sh -c -name codice_fiscale -type REGEX -desc "Italian social security number" -src samples/source/REGEX_codice_fiscale.txtfor Linux.
Please note that to be able to use this command, you need to put it on one single line.
You are prompted for your Talend Administration Center credentials. The command is executed after you enter a valid login and password.
codice_fiscalesemantic type is now added to the list of categories in Talend Dictionary Service.
Go back to Talend Data Preparation and
open your dataset with the column containing the social security numbers.
The change in semantic types is instantly effective in Talend Data Preparation for every new dataset that you import. For existing datasets, you need to manually change the column type.
To apply the new
codice_fiscalesemantic type to your column, click the white arrow next to the column name.
The column type now matches the newly created category.
Your data is now matched with the
codice_fiscale semantic type, that
you manually created in Talend Dictionary Service. From
now on, when importing new datasets containing Italian social security numbers, they
will automatically be matched with the proper type.
To display a list of all the available commands in Talend Dictionary Service, go to <Dictionary_Service_Path>/command-line and enter the following command according to your operating system:
category_manager.bat -hcommand for Windows.
./category_manager.sh -hfor Linux.