Adding a new regular expression-based semantic type - Cloud

Talend Cloud Data Inventory User Guide

Version
Cloud
Language
English
Product
Talend Cloud
Module
Talend Data Inventory
Content
Administration and Monitoring > Managing connections
Data Governance
Data Quality and Preparation > Enriching data
Data Quality and Preparation > Identifying data
Data Quality and Preparation > Managing datasets
Last publication date
2024-02-28

You can create a semantic type based on a regular expression in Talend Dictionary Service and add it to the list of recognized data types.

In the application, not every type of data can currently be matched with one of the predefined semantic types. Italian social security numbers, also known as codice fiscale, are currently not recognized for example.

Let's say that you work for an Italian company, only dealing with Italian customers. In this example, you have created a dataset containing some customer data, such as their names, email address, or their social security number. The semantic type for the column containing the social security number data will be set by default to text. This is not specific enough and you want to create a new category in order to match this type of data: a codice fiscale semantic type in this case.

You will create this new semantic type in Talend Dictionary Service, and it will be automatically available in your dataset so that your data can be matched with a proper type.

Important: For security reasons, a few regular expressions cannot be used, especially the backreferences. For more information, see the RE2/J documentation.

Procedure

  1. From the left panel of the homepage, open the Semantic type view.
  2. Click the Add semantic type button.
  3. In the Name field, enter codice fiscale.
  4. In the Description field, enter Italian social security number.
  5. In the Type drop-down, select Regular expression.
  6. Keep the Use for validation switch activated.

    Using a regular expression, a dictionary or a compound type for validation means that it will be used to define which values are considered right or wrong in a given column. The result of this validation process can be seen in the quality bar of each column in your datasets.

    In any case, regular expressions or dictionary of values are used for data discovery, that calculates the matching percentage between the reference values and your data to define the semantic type of each column.

    In this example, if you were to deactivate the switch, the regular expression would only be used for data discovery, and no value would be considered invalid.

  7. In the Content drop-down list, select the type of content that you want to validate, Any character in this case.
    This option helps optimizing performances. Only the data that matches the selected type will be validated. You can choose to only validate Alphabetical or Numerical values against a regular expression, but because Italian social security numbers contain both, you must select Any character.
  8. In the Validation pattern field, enter ^[A-Z]{6}[0-9]{2}[A-Z][0-9]{2}[A-Z][0-9]{3}[A-Z]$.
    This regular expression is designed to match the Italian codice fiscale, which is an alphanumeric code of 16 characters. Data that matches this pattern will be identified as codice fiscale.
  9. Click Save and publish to send the new semantic type to the Talend Dictionary Service server and make it available to the Talend Cloud Data Inventory users

    Clicking Save as draft means that the semantic type will be stored in Talend Dictionary Service, but will not be broadcast to the Talend Cloud applications. This allows you to chose the moment when you want to make your semantic types public.

    The codice fiscale type is now available in the list of semantic types with the status set as Published.

    The change in semantic types is instantly effective in Talend Cloud Data Inventory for every new dataset that you create. For existing datasets, you will need to refresh the sample in order to recalculate the quality with the new category.

  10. Go back to your dataset containing the Italian social security numbers.
  11. Click the Refresh sample button.
    Location of the Refresh button from the dataset overview.

Results

Your data is now matched with the codice_fiscale semantic type, that you manually created in Talend Dictionary Service.
New regular expression-based semantic type has been added.

From now on, when importing new datasets containing Italian social security numbers, they will automatically be matched with the proper type.