Adding a new compound semantic type - 7.3

Talend Data Preparation User Guide

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Preparation
Content
Data Quality and Preparation > Cleansing data
Last publication date
2023-11-28

You can create a compound semantic to group other semantic types that are published on the Talend Dictionary Service server and add it to the list of recognized data types in Talend Data Preparation.

You can mix all semantic types when creating a compound type, and a compound semantic type can reference other compound types on the condition that all children types are already published.

In this example you need to prepare a file containing information about customers from the United States, the United Kingdom, Germany and France. One of the columns in this dataset contains postal codes from these different countries, and as a consequence, with different formats. In this case, Talend Data Preparation will apply the semantic type that matches the most with the values in the column, US Postal code for example. This will cause the rest of the data, German, French and British postal codes, to be considered invalid.

To make Talend Data Preparation more adapted to this situation, you will create a compound type, regrouping the several semantic types used to validate postal codes.

Before you begin

All the semantic types that you want to group under the compound type have been published.

Procedure

  1. Open the Semantic types view from the left panel of the Talend Data Preparation homepage and click Add semantic type.
  2. In the Name field, enter Postal code.
  3. In the Description field, enter American, British, German and French postal codes.
  4. In the Type drop-down list, select Compound type.
  5. Keep the Use for validation switch activated.

    This compound type will be used to define which values are considered right or wrong when applied on a given column. The result of this validation process can be seen in the quality bar of each column in your datasets.

    In this example, if you were to deactivate the switch, the compound type would only be used for data discovery, and no value would be considered invalid.

  6. From the Children types drop-down list, select the semantic types you want to group under this Postal code compound type.
  7. Click Save and publish to send the new compound type to the Talend Dictionary Service server and make it available to the Talend Data Preparation users.

    Clicking Save as draft means that the semantic type will be stored in Talend Dictionary Service, but will not be broadcast to the Talend Web applications. This allows you to chose the moment when you want to make your semantic types public.

    The Postal code type is now available in the list of semantic types with the status set as Published.

    The change in semantic types is instantly effective in Talend Data Preparation for every new dataset that you import. For existing datasets, you need to manually change the column type or reimport your dataset.

  8. Go back to your dataset containing the postal codes from several countries.
  9. Click the menu icon in the header of the column containing the postal codes and select this columns is a... > Postal code.

Results

Your data is now matched with the Postal code compound type, that you manually created in Talend Dictionary Service. From now on, when importing new datasets containing postal codes, they will automatically be matched with the proper type.