Data profiling details - 7.3

Talend Data Catalog User Guide

Version
7.3
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Catalog
Content
Data Governance
Last publication date
2023-08-09
Talend Data Catalog can store and display the following data profile information for table/view and column objects:
Type Description
Inferred Datatypes [type, rows] List of data type matches and their frequency as a percentage, classified from the highest value to the lowest one.

The column data type is detected by the profiler. When a column has data of different data types, the profiler chooses the most used one. You can overwrite the value manually. The value could contradict the data type declared by the database. For example, when VARCHAR database column contains only date values, the profiler sets the DATE data type.

The supported types are Text, Date, Time, DateTime, Geographical, No Percentiles, Means, Median, Variance, Std. Deviation and Number.
Frequency [value, rows] Distribution of values and their frequency as a percentage.
Patterns [pattern, rows] List of different patterns of data presentation discovered in the source and their frequency as a percentage.
Inferred Semantic Types List of inferred semantic types.
Data Profiling Statistics
  • Profiling Date: Date of data profiling execution.
  • Count: Number of rows actually profiled, which is either the total number in the source or the limit set when defining the harvesting options.
  • Distinct: non-distinct=total-distinct-empty. For example, when there is one million rows and the column has much less such as 10 distinct values, the data is considered to be distinct.
  • Duplicate: Duplicate rows in database or in files.
  • Valid: Valid rows in database or in files.
  • Empty: Null rows in database or empty rows in files.
  • Invalid: Invalid rows in database or in files.
  • Avg length: Average length of values.
  • Min length: Minimum length of values.
  • Max length: Maximum length of values.