Checking the quality indicators of your data - Cloud

Talend Cloud Data Inventory Getting Started Guide

Version
Cloud
Language
English (United States)
Product
Talend Cloud
Module
Talend Data Inventory
Content
Administration and Monitoring > Managing connections
Data Governance
Data Quality and Preparation > Enriching data
Data Quality and Preparation > Identifying data

From the Overview page, you could get an idea of the overall quality of the dataset, but it is possible to look at more precise indicators.

While the Data quality tile allowed you to get an idea of the quality at the dataset level, you will now access the dataset Sample to look at the quality at the record level.

In the application, data can be categorized as empty, valid or invalid, against the semantic type automatically detected for a column, with the following color code:

  • Green for data that matches the column format
  • Orange for data that does not match the column format
  • Black for empty cells

Procedure

  1. From the left panel menu, click the Sample icon.
    Your dataset opens in a grid format, and all 100 rows are displayed in a tabular form. The maximum sample size in Talend Cloud Data Inventory is 10,000 records. The sample will show by default a grid view of your .csv file, but for other file types, or depending on your preferences, you can decide to display the sample in a hierarchical view, or a raw view.
  2. In the header above the dataset, you can see the same pie charts as in the overview, showing the repartition of invalid, empty, and valid values across the entire dataset.
  3. Take a look at the header of each column.
    When using the grid view of your dataset, every column header integrates a quality bar. The statistics displayed here apply to each specific column.
  4. Point your mouse over each color in the quality bar of the production_country column to display the detailed statistics for this specific column.
    You can see that this column contains 8 cells that do not match the Country semantic type, 1 empty cell, and 91 valid cells. In the grid view, cells containing invalid values are displayed with an orange left border.

Results

You have checked the repartition of empty, invalid and valid records on the whole dataset, as well as in each column. Most columns contain at least a few empty entries, but only popularity, production_country and original_language also have invalid values. For one of these column, the quality problem could come from a semantic type issue.