Skip to main content

Data Quality CSV Format

  • Rules (folder with Data Quality information)
  • Folder.csv (allows to recreate an hierarchy of folders in order to organize DQ rules)
  • Id (folder unique identifier)
  • Name (folder name)
  • Parent Folder Id (identifier of the parent folder, optional)
  • Description (folder description, optional)
  • Rule.csv (provides DQ rule declarations)
  • Id (rule unique identifier)
  • Name (rule name. Name must be unique under the parent folder.)
  • Folder Id (identifier of the parent folder)
  • Description (rule description, optional)
  • Operation (rule operation, optional)
  • Dimension (rule dimension name. See the list of the possible values below)
  • Url (hyperlink on the rule in the original DQ tool, optional)
  • Rule Application.csv (allows to specify DQ rule instances and associate them with the individual columns)
  • Id (rule application unique identifier)
  • Rule Id (identifier of the associated DQ rule)
  • Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
  • Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
  • Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
  • Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
  • Table (column path, table name. It can also be a file, or partitioned directory name)
  • Column (column path, column name)
  • Description (rule application description, optional)
  • Threshold (rule application threshold. It is a floating point value that is used to compute DQ Status in MM (Good, Acceptable or Bad). Use comma as a delimiter if more than one value is required)
  • Rule Application Column.csv (provides additional associations between DQ Rule Applications and the corresponding columns if Rule Application refers to two or more columns)
  • Rule Application Id (rule application identifier)
  • Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
  • Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
  • Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
  • Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
  • Table (column path, table name. It can also be a file, or partitioned directory name)
  • Column (column path, column name)
  • Rule Measure.csv (provides rule execution results)
  • Id (rule measure unique identifier)
  • Rule Application Id (rule application identifier)
  • Date (the execution date when the values have been collected. See below the supported date time format)
  • Rows Total (the overall number of the processed rows)
  • Rows Failed (the number of rows that failed the DQ rule)
  • Profiling (folder with Data Profiling information, optional)
  • Table.csv (table level profiling information)
  • Type (table path, connection type. Is needed to interpret and then correctly stitch the referenced table in MM. See the list of the possible values below.)
  • Server (table path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
  • Catalog (table path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
  • Schema (table path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
  • Table (table path, table name. It can also be a file, or partitioned directory name)
  • Creation Date (the date/time when data profiling information has been collected. See below the supported date time format)
  • Data Update Date (the last date/time when data has been updated in the source table. See below the supported date time format)
  • Records All (the number of records in the source table)
  • Records Processed (the number of records that have been processed by the profiler)
  • Column.csv (column level profiling information)
  • Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
  • Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
  • Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
  • Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
  • Table (column path, table name. It can also be a file, or partitioned directory name)
  • Column (column path, column name)
  • Creation Date (the date/time when data profiling information has been collected. See below the supported date time format)
  • Count (KPI with the count of all the values in the column. Long value)
  • Distinct (KPI with the counter for distinct values. Long value)
  • Duplicate (KPI with the counter for duplicate values. Long value)
  • Blank (KPI with the counter for blank string values. Long value)
  • Zero (KPI with the counter for zero numeric values. Long value)
  • Null (KPI with the counter for NULL values. Long value)
  • Min (KPI with the minimum value in the column. Usually applicable for the numeric values.)
  • Max (KPI with the maximum value in the column. Usually applicable for the numeric values.)
  • Mean (KPI with the mean value in the column. Usually applicable for the numeric values.)
  • Variance (KPI with the variance value in the column. Usually applicable for the numeric values.)
  • Median (KPI with the median value in the column. Usually applicable for the numeric values.)
  • Lower quantile (KPI with the lower quantile value in the column. Usually applicable for the numeric values.)
  • Upper quantile (KPI with the upper quantile value in the column. Usually applicable for the numeric values.)
  • Avg length (KPI with the average value length in the column. Usually applicable for the string values. Double value)
  • Min length (KPI with the minimum value length in the column. Usually applicable for the string values. Long value)
  • Max length (KPI with the maximum value length in the column. Usually applicable for the string values. Long value)
  • Valid (KPI with the counter for valid values. Usually computed based on the column’s data type. Long value)
  • Invalid (KPI with the counter for invalid values. Usually computed based on the column’s data type. Long value)
  • Data Type Invalid Values (Serialized list of the invalid values. Usually computed based on the column’s data type. Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
  • Data Type Frequency (Serialized list of data type frequencies in the format “[value],[count],[ratio]” (e.g. INTEGER,6,60.00;STRING,4,40.00). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
  • Pattern Frequency (Serialized list of pattern frequencies in the format “[value],[count],[ratio]” (e.g. 999-99-9999,19972,1.00000). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
  • Value Frequency (Serialized list of value frequencies in the format “[value],[count],[ratio]” (e.g. 987-66-3365,1,0.00005;987-66-5400,1,0.00005). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
  • Sampling (folder with Data Sampling information, optional)
  • Data Sampling.csv (mapping file that allows to lookup Data Sampling information for the individual tables/files. Is used to resolve special characters that can be part of the table names but not allowed in the file system to name the files accordingly)
  • *[file path recreated from ‘Data Sampling.csv’] (e.g. "\JDBC\172.0.0.1\FinanceDWDimensional\dbo\Customer.csv")
  • Table Url.csv (file that allows to specify a hyperlink to the original DQ source for the specified table, optional)
  • Type (table path, connection type. Is needed to interpret and then correctly stitch the referenced table in MM. See the list of the possible values below.)
  • Server (table path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
  • Catalog (table path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
  • Schema (table path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
  • Table (table path, table name. It can also be a file, or partitioned directory name)
  • Url (a hyperlink on the corresponding table record in the original DQ tool)

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!