Data profiling details - 8.0

Talend Data Catalog User Guide

Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Data Catalog
Data Governance
Last publication date
Talend Data Catalog can store and display the following data profile information for table/view and column objects:
Type Description
Inferred Datatypes [type, rows] List of data type matches and their frequency as a percentage, classified from the highest value to the lowest one.

The column data type is detected by the profiler. When a column has data of different data types, the profiler chooses the most used one. You can overwrite the value manually. The value could contradict the data type declared by the database. For example, when VARCHAR database column contains only date values, the profiler sets the DATE data type.

The supported types are Text, Date, Time, DateTime, Geographical, No Percentiles, Means, Median, Variance, Std. Deviation and Number.
Frequency [value, rows] Distribution of values and their frequency as a percentage.
Patterns [pattern, rows] List of different patterns of data presentation discovered in the source and their frequency as a percentage.
Data Profiling Statistics
  • Profiling Date: Date of data profiling execution.
  • Count: Number of rows actually profiled, which is either the total number in the source or the limit set when defining the harvesting options.
  • Distinct: non-distinct=total-distinct-empty. For example, when there is one million rows and the column has much less such as 10 distinct values, the data is considered to be distinct.
  • Duplicate: Duplicate rows in database or in files.
  • Valid: Valid rows in database or in files.
  • Empty: Null rows in database or empty rows in files.
  • Invalid: Invalid rows in database or in files.

    The valid/invalid values depend on the datatype that has been autodetected for the column. For example, if the first column was identified as an INTEGER data type but the value in the last record contains the value "a", which is not a valid INTEGER, it would contribute to the invalid counter.

  • Avg length: Average length of values.
  • Min length: Minimum length of values.
  • Max length: Maximum length of values.