Adding a new dictionary-based semantic type

Enriching semantic types

EnrichVersion
6.3
2.0
task
Data Quality and Preparation > Managing semantic types
EnrichPlatform
Talend Data Preparation

You can create a semantic type based on a closed dictionary in Talend Dictionary Service and add it to the list of recognized data types in Talend Data Preparation.

In Talend Data Preparation, not every type of data can currently be matched with one of the predefined semantic types. The counties of United Kingdom for example, are currently not recognized as such.

Let's say that you work for a British company, with customers only residing in the United Kingdom. In this example, you need to clean some customer data, such as their names, email address, or the county they live in. The semantic type for the column containing the counties data will be set by default to city. Some of the data may actually match names of cities, but you want to add a semantic type that is more specific to your data: UK_counties semantic type in this case.

You will create this new semantic type in Talend Dictionary Service, and it will be automatically available in Talend Data Preparation so that your data can be matched with a proper type.

Procedure

  1. Create a .txt file containing the exhaustive list of the British counties and save it as DICT_UK_COUNTIES.txt.

    You must enter only one entry per line.

    Unlike an open dictionary which purpose is to identify data, this exhaustive list will act as a closed dictionary of values to identifiy and validate data in Talend Data Preparation. Data that exactly matches one of the listed values will be categorized as a British county.

  2. Add this file to the <Dictionary_Service_Path>/command-line/samples/source folder.

    This folder is used for the sake of this example, but you can save it to your prefered location.

  3. Open a command prompt window
  4. Using the cd command, go to the <Dictionary_Service_Path>/command-line folder.
  5. To create the new UK_counties semantic type in Talend Dictionary Service and configure its different parameters, execute the following command according to your operating system:
    • category_manager.bat -c -name UK_counties -type DICT -cmpl true -desc "Counties of the United Kingdom" -src samples\source\DICT_UK_COUNTIES.txt for Windows.
    • ./category_manager.sh -c -name UK_counties -type DICT -cmpl true -desc "Counties of the United Kingdom" -src samples/source/DICT_UK_COUNTIES.txt for Linux.

    Please note that to be able to use this command, you need to put it on one single line.

    You are prompted for your Talend Administration Center credentials. The command is executed after you enter a valid login and password.

    The -cmpl attribute stands for completeness, and is used to determine if the dictionary you are adding is an open or a closed dictionary. It is set to false by default but in this case, it must be set to true.

    The UK_counties semantic type is now added to the list of categories in Talend Dictionary Service.

  6. Go back to Talend Data Preparation and open the dataset with the column containing the counties names.

    The change in semantic types is instantly available in Talend Data Preparation, but you need to manually refresh the column to make it visible in your existing datasets and preparations.

  7. To make the changes in semantic types active, you can either:
    • import your dataset again.
    • make a copy of the column which semantic type you want to update, COUNTY in this example.

    The column type now matches the newly created category.

Results

Your data is now matched with the UK_counties semantic type, that you manually created in Talend Dictionary Service. From now on, when importing new datasets containing names of British counties, they will automatically be matched with the proper type.

To display a list of all the available commands in Talend Dictionary Service, go to <Dictionary_Service_Path>/command-line and enter the following command according to your operating system:

  • category_manager.bat -h command for Windows.
  • ./category_manager.sh -h for Linux.