Adding a new dictionary-based semantic type - 7.1

Talend Data Stewardship User Guide

Version
7.1
Language
English (United States)
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Stewardship
Content
Administration and Monitoring > Managing users
Data Governance > Assigning tasks
Data Governance > Managing campaigns
Data Governance > Managing data models
Data Quality and Preparation > Handling tasks
Data Quality and Preparation > Managing semantic types

You can create a semantic type based on a dictionary in Talend Dictionary Service and add it to the list of recognized data types in Talend Data Stewardship. However, duplicate values are not allowed in a dictionary-based semantic type as they are useless and can slow down the process.

In Talend Data Stewardship, not every type of data can currently be matched with one of the predefined semantic types. The counties of United Kingdom for example, are currently not recognized as such.

About this task

Let's say that you work for a British company, with customers only residing in the United Kingdom. In this example, you need to intervene and manage some customer data, such as their names, email address, or the county they live in. You will wonder what semantic type to use for the column containing the counties when you define the data model in Talend Data Stewardship. You want here to add a semantic type specific to your data: UK_counties semantic type in this case.

You can create this new semantic type in Talend Dictionary Service, and it will be automatically available in Talend Data Stewardship so that your data can be matched with and validated against a proper type.

Procedure

  1. Create a .txt file containing the exhaustive list of the British counties and save it as DICT_UK_COUNTIES.txt.
    Make sure to enter one item per line.

    Unlike an open dictionary which purpose is to identify data, this exhaustive list acts as a closed dictionary of values to validate data in Talend Data Stewardship. Data that exactly matches one of the listed values is categorized as a British county.

  2. Add this file to the <Dictionary_Service_Path>/command-line/samples/source folder.
    This folder is used for the sake of this example, but you can save it to your preferred location.
  3. Open a command prompt window and use the cd command to go to the <Dictionary_Service_Path>/command-line folder.
  4. To create the new UK_counties semantic type in Talend Dictionary Service and configure its different parameters, put the following command in one single line and execute it according to your operating system:
    • category_manager.bat -c -name UK_counties -type DICT -cmpl true -desc "Counties of the United Kingdom" -src samples\source\DICT_UK_COUNTIES.txt for Windows.
    • ./category_manager.sh -c -name UK_counties -type DICT -cmpl true -desc "Counties of the United Kingdom" -src samples/source/DICT_UK_COUNTIES.txt for Linux.
    You are prompted for your Talend Administration Center credentials. The command is executed after you enter a valid login and password.

    The -cmpl attribute stands for completeness, and is used to determine if the dictionary you are adding is a closed dictionary. It is set to false by default but in this case, it must be set to true. Open dictionaries are not supported with Talend Data Stewardship.

    The UK_counties semantic type is now added to the list of categories in Talend Dictionary Service.

  5. Go back to Talend Data Stewardship and create a data model for the United Kingdom customers data.

    UK_counties is now available in the list of the semantic types and you can set it for the County column.

Results

When you load data containing the United Kingdom counties to Talend Data Stewardship, they are matched with and validated against the proper semantic type, UK_counties that you manually created in Talend Dictionary Service.

To display a list of all the available commands in Talend Dictionary Service, go to <Dictionary_Service_Path>/command-line and enter the following command according to your operating system:
  • category_manager.bat -h command for Windows.
  • ./category_manager.sh -h for Linux.