You can create a semantic type based on a dictionary in Talend Dictionary Service and add it to the list of recognized data types in Talend Data Stewardship. However, duplicate values are not allowed in a dictionary-based semantic type as they are useless and can slow down the process.
In Talend Data Stewardship, not every type of data can currently be matched with one of the predefined semantic types. The counties of United Kingdom for example, are currently not recognized as such.
Let's say that you work for a British company, with customers only residing in the
United Kingdom. In this example, you need to intervene and manage some customer
data, such as their names, email address, or the county they live in. You will
wonder what semantic type to use for the column containing the counties when you
define the data model in Talend Data Stewardship. You want here to add a
semantic type specific to your data:
UK_counties semantic type in
You will create this new semantic type in Talend Dictionary Service, and it will be automatically available in Talend Data Stewardship so that your data can be matched with and validated against a proper type.
Create a text file where you list the counties of United Kingdom.
The file can have one or multiple values per line. Maximum length for a value is 255 characters.
When you use multiple values on the same line, separate them by commas. In that case, all values are considered as synonyms. You should include in quotes non-alphabetical values, otherwise the file will be rejected.
In the homepage, click
- Enter a name and a description for the new semantic type.
- Select the semantic type from the Type list.
Keep the Use for validation switch activated.
Using a regular expression, a dictionary or a compound type for validation means that it will be used to define which values are considered right or wrong in a given column. The result of this validation process can be seen in the quality bar of each column in your datasets.
In any case, regular expressions or dictionary of values are used for data discovery, that calculates the matching percentage between the reference values and your data to define the semantic type of each column.
In this example, if you were to deactivate the switch, the dictionary would only be used for data discovery, and no value would be considered invalid.
From the Validation criterion list, select the rule to
use while matching data against the values in the dictionary:
Option Description Simplified text Punctuation, white spaces, case and accent are ignored during validation and data is considered as valid. For instance, if Pâté-en-croûte is the reference value in the dictionary, then pate-en-croute and PATE--EN CROUTE will both be considered valid but Pâté n croûte will not be considered valid. Ignore case and accents Case and accents are ignored during validation and data is considered as valid. For instance, if Pâté-en-croûte is the reference value in the dictionary, then pate-en-croute is considered valid (despite the case and accent differences), but pate en croute is not because the dashes have been replaced with spaces. Exact value Very restrictive. Data is considered as valid only if it is an exact match with the value.
Click the icon to the right of Values and
import the text file of the counties of United Kingdom.
You can use the icon to add values manually and the search icon to search values in the list.
Click SAVE AND PUBLISH to send the semantic type to
the Talend Dictionary Service
server and make it available to be used by the system.
Clicking SAVE AS DRAFT stores the new type on the server without propagating it to the system. The new type is not usable unless it is published. For a use case of this option, let's say that you have new semantic types to deploy as part of a new project. You can prepare the work by creating the semantic types and save them as draft before the go-live of the project, and can deploy the semantic types only the day of go-live.
Go back to Talend Data Stewardship and create a data model for the United Kingdom customers data.
When you load data containing the United Kingdom
counties to Talend Data Stewardship,
they are matched with and validated against the proper semantic type,
UK_counties that you manually created in Talend Dictionary Service.