Data profiling details - Cloud

Talend Cloud Data Catalog User Guide

Version
Cloud
Language
English
Product
Talend Cloud
Module
Talend Data Catalog
Content
Data Governance
Last publication date
2023-11-13
Talend Cloud Data Catalog can store and display the following data profile information for table/view and column objects:
Type Description
Inferred Datatypes [type, rows] List of data type matches and their frequency as a percentage, classified from the highest value to the lowest one.

The column data type is detected by the profiler. When a column has data of different data types, the profiler chooses the most used one. You can overwrite the value manually. The value could contradict the data type declared by the database. For example, when VARCHAR database column contains only date values, the profiler sets the DATE data type.

The supported types are Text, Date, Time, DateTime, Geographical, No Percentiles, Means, Median, Variance, Std. Deviation and Number.
Frequency [value, rows] Distribution of values and their frequency as a percentage.
Patterns [pattern, rows] List of different patterns of data presentation discovered in the source and their frequency as a percentage.
Data Profiling Statistics
  • Profiling Date: Date of data profiling execution.
  • Count: Number of rows actually profiled, which is either the total number in the source or the limit set when defining the harvesting options.
  • Distinct: non-distinct=total-distinct-empty. For example, when there is one million rows and the column has much less such as 10 distinct values, the data is considered to be distinct.
  • Duplicate: Duplicate rows in database or in files.
  • Valid: Valid rows in database or in files.
  • Empty: Null rows in database or empty rows in files.
  • Invalid: Invalid rows in database or in files.

    The valid/invalid values depend on the datatype that has been autodetected for the column. For example, if the first column was identified as an INTEGER data type but the value in the last record contains the value "a", which is not a valid INTEGER, it would contribute to the invalid counter.

  • Avg length: Average length of values.
  • Min length: Minimum length of values.
  • Max length: Maximum length of values.