You can create a semantic type based on a dictionary in
Talend Dictionary Service and add it to the list of recognized data types in
Talend Cloud Data Stewardship. However, duplicate values are not allowed in a dictionary-based semantic
type as they are useless and can slow down the process.
In
Talend Cloud Data Stewardship, not every type of data can currently be matched with one of the predefined
semantic types. The counties of United Kingdom for example, are currently
not recognized as such.
About this task
Let's say that you work for a British company, with customers only
residing in the United Kingdom. In this example, you need to intervene
and manage some customer data, such as their names, email address,
or the county they live in. You will wonder what semantic type to use for the column
containing the counties when you define the data model in Data Stewardship. You want
here to add a semantic type specific to your data:
UK_counties
semantic type in this case.
You can create this new semantic type in
Talend Dictionary Service, and it will be automatically available Data Stewardship so that
your data can be matched with and validated against a proper type.
Procedure
-
Create a text file where you list the counties of United Kingdom.
The file can have one or multiple values per line. Maximum length for a value is 255
characters.
When you use multiple values on the same line, separate them by commas. In that case,
all values are considered as synonyms. You should include in quotes non-alphabetical
values, otherwise the file will be rejected.
-
Select .
-
Enter a name and a description for the new semantic type.
-
Select the semantic type from the
Type list.
-
Keep the Use for validation switch activated.
Using a regular expression, a dictionary or a compound type for validation
means that it will be used to define which values are considered right or
wrong in a given column. The result of this validation process can be seen
in the quality bar of each column in your datasets.
In any case, regular expressions or dictionary of values are used for data
discovery, that calculates the matching percentage between the reference
values and your data to define the semantic type of each column.
In this example, if you were to deactivate the switch, the dictionary would
only be used for data discovery, and no value would be considered
invalid.
-
From the Validation criterion list, select the rule to
use while matching data against the values in the dictionary:
Option |
Description |
Simplified text |
Punctuation, white spaces, case and accent are ignored during
validation and data is considered as valid. For instance, if
Pâté-en-croûte is the reference value in the
dictionary, then pate-en-croute and
PATE--EN CROUTE will both be considered valid but
Pâté n croûte will not be considered
valid. |
Ignore case and accents |
Case and accents are ignored during validation and data is
considered as valid. For instance, if
Pâté-en-croûte is the reference value in the
dictionary, then pate-en-croute is considered
valid (despite the case and accent differences), but pate en
croute is not because the dashes have been replaced with
spaces. |
Exact value |
Very restrictive. Data is considered as valid only if it is an exact
match with the value. |
-
Click the
icon to the right of Values and
import the text file of the counties of United Kingdom.
You can use the

icon to add values manually and the search icon
to search values in the list.
Note: You can upload up to 10 MB of content to
Talend Dictionary Service per
tenant.
-
Click Save and
publish to send the semantic type to the Talend Dictionary Service server and make it
available to be used by Data Stewardship.
Clicking Save as draft
stores the new type on the server without propagating it to the system. The new type
is not usable unless it is published. For a use case of this option, let's say that
you have new semantic types to deploy as part of a new project. You can prepare the
work by creating the semantic types and save them as draft before the go-live of the
project, and can deploy the semantic types only the day of go-live.
-
From the Data models page, create a data model for the
United Kingdom customers data.
UK_counties is now available in the list of the semantic types
and you can set it for the
County column.
Results
When you load data containing the United Kingdom counties to
Talend Cloud Data Stewardship, they are matched with and validated against the proper semantic type,
UK_counties
that you manually created in
Talend Dictionary Service.