Adding a new dictionary-based semantic type - Cloud

Talend Cloud Data Inventory User Guide

Version
Cloud
Language
English
Product
Talend Cloud
Module
Talend Data Inventory
Content
Administration and Monitoring > Managing connections
Data Governance
Data Quality and Preparation > Enriching data
Data Quality and Preparation > Identifying data
Data Quality and Preparation > Managing datasets
Last publication date
2024-02-28

You can create a semantic type based on a closed dictionary in the Semantic types menu, so that it is added to the list of recognized data types.

In the application, not every type of data can currently be matched with one of the predefined semantic types. The counties of United Kingdom for example, are currently not recognized as such.

Let's say that you work for a British company, with customers only residing in the United Kingdom. In this example, you have created a dataset containing some customer data, such as their names, email address, or the county they live in. The semantic type for the column containing the counties data will be set by default to city. Some of the data may actually match names of cities, but you might want to add a semantic type that is more specific to your data: UK_counties semantic type in this case.

You will create this new semantic type in the dedicated menu, and it will be instantly available in your dataset, so that your data can be matched with a proper type.

Procedure

  1. From the left panel of the homepage, open the Semantic type view.
    The list of all the semantic types present by default in Talend Dictionary Service opens.
  2. Click the Add semantic type button.
    The semantic type creation form opens.
  3. In the Name field, enter the name you want to give your semantic type, UK Counties in this example.
  4. In the Description field, enter List of counties in the United Kingdom.
  5. In the Type drop-down list, select Dictionary.
    You will indeed create this semantic type based on an exhaustive list of values.
  6. Keep the Use for validation switch activated.

    Using a regular expression, a dictionary or a compound type for validation means that it will be used to define which values are considered right or wrong in a given column. The result of this validation process can be seen in the quality bar of each column in your dataset samples.

    In any case, regular expressions or dictionary of values are used for data discovery, that calculates the matching percentage between the reference values and your data to define the semantic type of each column.

    In this example, if you were to deactivate the switch, the dictionary would only be used for data discovery, and no value would be considered invalid.

  7. In the Validation criterion drop-down list, select the restriction rule that you want to apply, Exact value for example.
    • Simplified text: Punctuation, white spaces, case, and accents are ignored during validation. For example, if Pâté-en-croûte is your reference value, pate-eN-cRoute will be considered valid but not Pâté n croûte.
    • Ignore case and accents: Case and accents are not taken into account during the validation. For example, if Pâté-en-croûte is your reference value, pate-en-croute will be considered valid but not pate en croute.
    • Exact value: The most restrictive validation rule. Data is considered as valid only if it is an exact match with the reference value.
  8. To add the list of counties that will make up the UK Counties semantic type in the Values field, you can:
    • Manually add each value. Click the plus icon to enter a value, and click the check icon to validate your change. Repeat for each county to add to the list.
    • Import a file containing a plain text list of UK counties. Click the import button to select the file to upload. The file format is not important, as long as the content is plain text.
      Note: You can upload up to 10 MB of content to Talend Dictionary Service per tenant.

      Download and extract the file: dict_uk_counties.zip.

      Sample of the dict_uk_counties.txt file.

      Enter each different value on a separate line. Values that are on the same line and separated by a comma will be considered as synonyms.

      When importing a list from a file, non-alphabetical values must be protected by quotes, otherwise the file will be rejected.

    Duplication of values is not allowed. When manually adding values, a check is done. When importing a file, a deduplication step is automatically performed.

    The full list of counties has been added.

  9. Click Save and publish to send the new semantic type to the Talend Dictionary Service server and make it available to the Talend Cloud Data Inventory users

    Clicking Save as draft means that the semantic type will be stored in Talend Dictionary Service, but will not be broadcast to the Talend Cloud applications. This allows you to chose the moment when you want to make your semantic types public.

    The UK Counties type is now available in the list of semantic types with the status set as Published.

    The change in semantic types is instantly effective in Talend Cloud Data Inventory for every new dataset that you create. For existing datasets, you will need to refresh the sample in order to recalculate the quality with the new category.

  10. Go back to your dataset containing the counties names.
  11. Click the Refresh button.
    Location of the Refresh button from the dataset overview.

Results

Your data is now matched with the UK Counties semantic type, that you manually created in Talend Dictionary Service.
New dictionary-based semantic type has been added.

From now on, when importing new datasets containing names of British counties, they will automatically be matched with the proper type.