Data Sampling and Profiling Technical Details

Talend Data Catalog reuses the model bridge infrastructure and metamodel for data profiling. Database and file system bridges provide “concealed” support for data profiling. They run in the metadata import mode by default. You can run them in the profiling mode by specifying dedicated Miscellaneous options.

When the bridges are running in the metadata mode they import not only basic structural details, like tables and columns but also advanced details, like keys and indexes. When they are running in the profiling mode they import the same basic structural details to carry profiling statistics (e.g. UDPs on MIR Attribute). It allows MM to integrate the profiling statistics into already loaded metadata using basic structure.

The bridges use the data profiling library. The library is derived and depends on the open source Talend data quality library. When the bridge runs in the metadata mode it does not depend on the data profiling library.

The bridge uses two queries for data sampling/profiling:

the first query if the count of rows is less than 100 000 rows

SELECT * FROM TableName DISTRIBUTE BY rand() SORT BY rand() limit 100

the second query if the count of rows is more than or equal to 100 000 rows:

SELECT * FROM TableName TABLESAMPLE( n PERCENT)

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here