Checking the quality of your data - Cloud

Talend Cloud Data Inventory with Snowflake Getting Started Guide

EnrichVersion
Cloud
EnrichProdName
Talend Cloud
EnrichPlatform
Talend Data Inventory
Talend Data Preparation
Talend Pipeline Designer
task
Data Governance

From the Overview page, you could get an idea of the overall quality of the dataset, but it is possible to look at more precise indicators.

While the Data quality tile allowed you to get an idea of the quality at the dataset level, you will now access the dataset Sample to look at the quality at the record level.

In the application, data can be categorized as empty, valid or invalid, against the semantic type automatically detected for a column, with the following color code:

  • Green for data that matches the column format
  • Orange for data that does not match the column format
  • Black for empty cells

Procedure

  1. From the left panel menu, click the Sample icon.
    Your dataset opens in a grid format, and the first 10,000 rows are displayed in a tabular form. The sample will show by default a grid view of your JDBC dataset, but for other file types, or depending on your preferences, you can decide to display the sample in a hierarchical view, or a raw view.
  2. In the header above the dataset, you can see the same pie charts as in the overview, showing the repartition of invalid, empty, and valid values across the entire dataset.
  3. Take a look at the header of each column.
    When using the grid view of your dataset, every column header integrates a quality bar. The statistics displayed here apply to each specific column.
  4. Point your mouse over each color in the quality bar of any column to display the detailed statistics for this specific column.

    You can see in this example that the column contains X cells that do not match the semantic type detected for the column, X empty cells, and X valid cells. In the grid view, cells containing invalid values are displayed with an orange left border.

    The column semantic type can be changed at any time to better match the content of the column and reduce the number of invalid values.

Results

You have checked the repartition of empty, invalid and valid records on the whole dataset, as well as in each column.