Skip to main content Skip to complementary content

Database Specific Sampling and Profiling Considerations

Information note

Data sampling and data profiling is available on all dedicated database bridges (e.g., Oracle, SQL Server, Hive, Teradata, etc.). It is implemented by automatically including the following options in the Miscellaneous bridge parameter:

Miscellaneous = -dp 1000 -dps 10

Which would be data profiling based upon 1000 rows and sampling of 10 rows.

Information note

However, the generic database JDBC import (e.g., HP Vertica) cannot predict the proper query to access data, and therefore does not support data profiling and sampling. However, the user may try by specifying the special option in the Miscellaneous bridge parameter. This is as follows:

Miscellaneous = -dp.query-template <SQL Query>

Examples for specific databases (to replace <SQL Query>):

HP Vertica:

SELECT @column FROM @schema.@table LIMIT @recordsLimit

IBM DB2 ISeries (AS/4oo):

SELECT @column FROM @schema.@table FETCH FIRST @recordsLimit ROWS ONLY

Microsoft SQL Server:

SELECT TOP @recordsLimit [@column] FROM [@schema].[@table]

Oracle database:

SELECT "@column" FROM "@schema"."@table" WHERE ROWNUM <= @recordsLimit

Oracle MySQL:

SELECT `@column` FROM `@schema`.`@table` LIMIT @recordsLimit

Information note

Data profiling and sampling requires access to the data in the data store. Thus, you must specify a user name / password with the appropriate access rights. For example:

- SAP HANA database:

GRANT the SELECT privileges to a role the user has.

https://help.sap.com/viewer/4fe29514fd584807ac9f2a04f6754767/2.0.03/en-US/20f674e1751910148a8b990d33efbdc5.html

Steps

  1. Specify profiling information when harvesting the metadata source
  2. Execute the harvest.

Example

Table or file sample data

Log in as a user with the Data Viewing capability object role assignment for the Data Lake .

Viewing the sample data records for the Employee.csv file on its object page in the Sample Data tab.

Sign in as Dan , who is a Data Lake Viewer :

Information note

The SSN field is obfuscated. This is due to the fact that the US Social Security Number data class has a Default Sensitivity of Classified and the Classified sensitivity label has the Hide Data flag set.

Field or column profiling data

Now, log in as a user with the Data Management capability object role assignment for the Data Lake or one that is a member of the Administrators or Data Administrators group.

Then, click the Fields tab and click Gender

Go to the Employee.csv file object page and then the Fields tab.

Click on the Gender field to go to its object page .

Data classes

A field or column may be assigned one or more data classes. E.g., the field in the example above is assigned the data class Gender. Once that data class is assigned, it is a property of the element and may be searched on, filtered by and further edited (remove the assignment).

Some actions can apply to all objects of a certain data class. In particular the Hide/Show property referenced above.

You may manage the list of data classes with proper admin privileges.

See data classification for more details.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!