You can create a semantic type based on a closed dictionary in the
Semantic types menu, so that it is added to the list of
recognized data types.
In Talend Cloud Data Preparation, not every type of data can currently be matched with one of the predefined semantic
types. The counties of United Kingdom for example, are currently not recognized as
such.
Let's say that you work for a British company, with customers only residing in the United
Kingdom. In this example, you need to clean some customer data, such as their names,
email address, or the county they live in. The semantic type for the column containing
the counties data will be set by default to city
. Some of the data may
actually match names of cities, but you might want to add a semantic type that is more
specific to your data: UK_counties
semantic type in this case.
You will create this new semantic type in the dedicated menu, and it will be instantly
available in your preparation, so that your data can be matched with a proper type.
Procedure
-
Click the Semantic types tab of the left menu.
The list of all the semantic types present by default in Talend Cloud Data Preparation opens. For the complete list, see Predefined Semantic Types.
-
Click the Add semantic type button.
The semantic type creation form opens.
-
In the Name field, enter the name you want to give your
semantic type, UK Counties in this example.
-
In the Description field, enter List of
counties in the United Kingdom.
-
In the Type drop-down list, select
Dictionary.
You will indeed create this semantic type based on an exhaustive list of
values.
-
Keep the Use for validation switch activated.
Using a regular expression, a dictionary or a compound type for validation
means that it will be used to define which values are considered right or
wrong in a given column. The result of this validation process can be seen
in the quality bar of each column in your datasets.
In any case, regular expressions or dictionary of values are used for data
discovery, that calculates the matching percentage between the reference
values and your data to define the semantic type of each column.
In this example, if you were to deactivate the switch, the dictionary would
only be used for data discovery, and no value would be considered
invalid.
-
In the Validation criterion drop-down list, select the
restriction rule that you want to apply, Exact value for
example.
-
Simplified text: Punctuation, white spaces, case
and accents are ignored during validation. For example, if
Pâté-en-croûte
is your reference value,
pate-eN-cRoute
will be considered valid but not
Pâté n croûte
.
-
Ignore case and accents: Case and accents are not
taken into account during the validation. For example, if
Pâté-en-croûte
is your reference value,
pate-en-croute
will be considered valid but not
pate en croute
.
-
Exact value: The most restrictive validation
rule. Data is considered as valid only if it is an exact match with the
reference value.
-
To add the list of counties that will make up the
UK Counties
semantic type in the Values field, you can:
- Manually add each value. Click the plus icon to enter a value, and click the
check icon to validate your
change. Repeat for each county to add to the list.
- Import file containing a plain text list of UK counties. Click the
import button to select the file to upload. The
file format is not important, as long as the content is plain text.
Note: You can upload up to 10 MB of
content to Talend Dictionary Service per tenant.
Retrieve the dict_uk_counties.txt file from the
Downloads tab of the documentation page.
Enter each different value on a separate line. Values that are on the
same line and separated by a comma will be considered as synonyms.
When importing a list from a file, non-alphabetical values must be
protected by quotes, otherwise the file will be rejected.
Duplication of values is not allowed. When manually adding values, a check is
done. And when importing a file, a deduplication step is automatically
performed.
The full list of counties has been added.
-
Click Save and publish to send the new semantic type to
the Talend Dictionary Service server and make it available to the Talend Cloud Data Preparation users.
Clicking Save as draft means that the semantic type
will be stored in Talend Dictionary Service, but will not be broadcast to the Talend Web applications. This allows you to chose the moment when you want to
make your semantic types public.
The UK Counties type is now available in the list of
semantic types with the status set as Published.
The change in semantic types is instantly effective in Talend Cloud Data Preparation for every new dataset that you import. For existing datasets, you need to
manually change the column type or reimport your dataset.
-
Go back to your dataset containing the counties names.
-
Click the menu icon in the County column header and
select .
The column type now matches the newly created category.
Results
Your data is now matched with the UK Counties
semantic type, that
you manually created in Talend Dictionary Service. From now on, when
importing new datasets containing names of British counties, they will automatically
be matched with the proper type.