You can create a semantic type based on a closed dictionary in Talend Dictionary Service and add it to the list of recognized data types in Talend Data Preparation.
In Talend Data Preparation, not every type of data can currently be matched with one of the predefined semantic types. The counties of United Kingdom for example, are currently not recognized as such.
Let's say that you work for a British company, with customers only residing in the United
Kingdom. In this example, you need to clean some customer data, such as their names,
email address, or the county they live in. The semantic type for the column containing
the counties data will be set by default to
city. Some of the data may
actually match names of cities, but you want to add a semantic type that is more
specific to your data:
UK_counties semantic type in this case.
You will create this new semantic type in Talend Dictionary Service, and it will be automatically available in Talend Data Preparation so that your data can be matched with a proper type.
- Create a .txt file containing the exhaustive list of the
British counties and save it as DICT_UK_COUNTIES.txt.
You must enter only one entry per line.
Unlike an open dictionary which purpose is to identify data, this exhaustive list will act as a closed dictionary of values to identifiy and validate data in Talend Data Preparation. Data that exactly matches one of the listed values will be categorized as a British county.
In cases when the source file is very large, you must split it into smaller files. For further information, see Creating a dictionary-based semantic type using a large source file.
- Add this file to the
This folder is used for the sake of this example, but you can save it to your prefered location.
- Open a command prompt window
- Using the
cdcommand, go to the <Dictionary_Service_Path>/command-line folder.
- To create the new
UK_countiessemantic type in Talend Dictionary Service and configure its different parameters, execute the following command according to your operating system:
category_manager.bat -c -name UK_counties -type DICT -cmpl true -desc "Counties of the United Kingdom" -src samples\source\DICT_UK_COUNTIES.txtfor Windows.
./category_manager.sh -c -name UK_counties -type DICT -cmpl true -desc "Counties of the United Kingdom" -src samples/source/DICT_UK_COUNTIES.txtfor Linux.
Please note that to be able to use this command, you need to put it on one single line.
You are prompted for your Talend Administration Center credentials. The command is executed after you enter a valid login and password.
-cmplattribute stands for completeness, and is used to determine if the dictionary you are adding is an open or a closed dictionary. It is set to
falseby default but in this case, it must be set to
The UK_counties semantic type is now added to the list of categories in Talend Dictionary Service.
- Go back to Talend Data Preparation and
open the dataset with the column containing the counties names.
The change in semantic types is instantly available in Talend Data Preparation, but you need to manually refresh the column to make it visible in your existing datasets and preparations.
- To make the changes in semantic types active, you can either:
- import your dataset again.
- make a copy of the column which semantic type you want to update, COUNTY in this example.
The column type now matches the newly created category.
Your data is now matched with the
UK_counties semantic type, that
you manually created in Talend Dictionary Service. From
now on, when importing new datasets containing names of British counties, they will
automatically be matched with the proper type.
To display a list of all the available commands in Talend Dictionary Service, go to <Dictionary_Service_Path>/command-line and enter the following command according to your operating system:
category_manager.bat -hcommand for Windows.
./category_manager.sh -hfor Linux.