Sampling and profiling data - 7.3

Talend Data Catalog User Guide

Version
7.3
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Catalog
Content
Data Governance
Last publication date
2023-08-09

While technical and descriptive metadata contain a great wealth of information about metadata elements, this is only true if the information has been documented on those elements. In many cases, that metadata is incomplete and the best way to determine what that metadata should be (for example semantic data type or valid values) is to look at the data itself.

Talend Data Catalog provides the option to profile the actual data contained in files and tables, in addition to the metadata captured from a source format or tool, as part of the harvesting process. At harvesting time, you can specify the number of records to profile and how many should be maintained as a sample for visualization later.

That information is then available when you navigate to the file or table’s page or when looking at individual fields or columns from the file or table.

Talend Data Catalog makes an effort to protect the information and show it to authorised users only. You need to have the Data Viewer role to look at the information. Generic profiling statistics, like “% of distinct values” are available to all users that can view the content.

The application can store and display the following data profiling details for table/view and column objects:
  • Counts (standard and custom counts, like empty and valid rows)
  • Values (distinct values and their counts)
  • Patterns (patterns and their counts)
  • Data types (inferred data types and their counts)