Discovering semantic and data types

Discovering semantic and data types - 8.0

Talend Data Catalog User Guide

Version

8.0

Language

English

Product

Talend Big Data Platform

Talend Data Fabric

Talend Data Management Platform

Talend Data Services Platform

Talend MDM Platform

Talend Real-Time Big Data Platform

Module

Talend Data Catalog

Content

Data Governance

Last publication date

2023-09-26

The data discovery calculates how many values match each data class and, if the result is greater than 50%, it suggests the data class. The data discovery also assigns data types.

From the Overview tab, you can see the percentages in the Inferred Datatypes and Data Classifications areas.

The inferred data classes greater than the percentage defined in the Matching Threshold field are suggested in the Data Classifications area. By default, the Matching Threshold field is set to 50%. It means that the inferred data classes greater than 50% are suggested by default.

To assign or reject a proposed data class, click the tick or cross button. You can assign more than one data class. When a data class is rejected, it is removed.

How is the percentage calculated?

This percentage is the sum of two percentages:

One percentage represents the number of values matching the data class; up to 100% allocated.
To determine if a value matches a data class, the data discovery depends on the type of the data class:
- Enumeration: Does the value match a value from the dictionary? Punctuation, case, spaces and accents are ignored.
- Regular expression: Does the value match the regular expression?
- Compound: is the value discovered into at least one child?
  A compound type is a group of existing data classes, called children.
If the answer is positive, the value is considered valid.
The other percentage represents the similarity between the column name and the name of the data class; up to 10% allocated.
To compare the names:
- The Levenshtein algorithm is used. It calculates the minimum number of edits (insertion, deletion or substitution) required to transform one string into another.
- The case and accents are ignored.
- If the strings contain spaces, the word order is ignored. For example, US Phone and Phone US are considered identical.
The maximum percentage is 100%. If all values match a data class and the column name is identical to the name of the data class, the result still is 100%.

Discovering data types

Data types are automatically assigned. You do not need to accept them.

To determine of which type is a value, the data discovery follows an order:

Is the value empty?
Is the value of type boolean? true and false are the only values considered of type boolean.
Is the value of type integer?
Is the value of type decimal?
Is the value of type date?
If the value is not of one of the above types, it is considered a text value.

As the verification is incremental, a value is only of one type. For example, the value 5 is of type integer. It will not be considered of type text.