Removing a semantic type through command line interface - 2.3

Talend Data Preparation User Guide

author
Talend Documentation Team
EnrichVersion
6.5
2.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation

You can delete a semantic type in Talend Dictionary Service to remove it from the list of recognized data types in Talend Data Preparation.

This applies to both predefined semantic types, as well as custom semantic types.

The variety of semantic types that are present by default in Talend Data Preparation may not apply to your business context. For example, a five-digit number can be interpreted as a American ZIP code, but also as a French or German one since they share the same format.

Talend Data Preparation tends to automatically match five-digit number with French ZIP codes. Let's say that you are working in an American company, and you only have to deal with data coming from American clients, including ZIP codes. Always having the wrong semantic type in your columns containing ZIP codes can quickly become annoying.

In this example, the ZIP column of the dataset you are preparing can be matched with at least four types.

Using Talend Dictionary Service, you will simply remove the other semantic types that match the five-digit format and only leave US_POSTAL_CODE. The change will then be ported instantly in Talend Data Preparation, and five-digit numbers will automatically be identified as US ZIP codes from now on.

Procedure

  1. Open a command prompt window.
  2. Using the cd command, go to the <Dictionary_Service_Path>/command-line folder.
  3. To display the names of the existing semantic types and see which ones to remove, execute the folllowing command: according to your operating system:
    • category_manager.bat -l -type REGEX for Windows.
    • ./category_manager.sh -l -type REGEX for Linux.

    You are prompted for your Talend Administration Center credentials. The command is executed after you enter a valid login and password.

    The list of semantic types based on regular expressions is displayed. You can identify the name of the ones you want to remove, FR_POSTAL_CODE or DE_POSTAL_CODE among others.

  4. To remove the French postal codes semantic type, execute the following command according to your operating system:
    • category_manager.bat -d -name FR_POSTAL_CODE for Windows.
    • ./category_manager.sh -d -name FR_POSTAL_CODE for Linux.

    The FR_POSTAL_CODE has been removed from the list of recognized semantic types and five-digit numbers will not be associated with French ZIP codes anymore.

  5. Repeat this operation to remove the other semantic types that match five-digit numbers:
    • DE_POSTAL_CODE
    • FR_INSEE_CODE
  6. Go back to your preparation with the column containing ZIP codes in Talend Data Preparation.

    The change in semantic types is instantly available. Because you deleted the semantic type that was used until now, the ZIP column is automatically defined as text.

  7. To set the proper semantic type to the column, click the white arrow in the column header.
  8. Point your mouse over This column is a text and select US Postal Code.

    This time, the data from the Zip can only be matched with the US_POSTAL_CODE semantic type.

Results

You have deleted all the semantic types compatibles with five-digit numbers but one. From now on, when adding new datasets, this type of data will be identified as US postal codes.

To display a list of all the available commands in Talend Dictionary Service, enter the category_manager.bat -h command for Windows or ./category_manager.sh -h for Linux