From the Overview page, you could get an idea of
the overall quality of the dataset, but it is possible to look at more precise
indicators.
While the Data quality tile allowed you to get an idea of the
quality at the dataset level, you will now access the dataset
Sample to look at the quality at the record level.
In the application, data can be categorized as empty, valid or invalid, against the
semantic type automatically detected for a column, with the following color code:
- Green for data that matches the column format
- Red for data that does not match the column format
- Black for empty cells
Procedure
-
From the left panel menu, click the
Sample icon.
Your dataset opens in a grid format, and all 100 rows are displayed in a tabular
form. The maximum sample size in
Talend Cloud Data Inventory is
10,000 records. The sample will show by default a grid view of your
.csv file, but for other file types, or depending on your
preferences, you can decide to display the sample in a hierarchical view, or a raw view.
-
In the header above the dataset, you can see the same pie
charts as in the overview, showing the repartition of invalid, empty, and valid
values across the entire dataset.
-
Take a look at the header of each column.
When using the grid view of your dataset, every column header integrates a
quality bar. The statistics displayed here apply to each specific column.
-
Point your mouse over each color in the quality bar of the
production_country column to display
the detailed statistics for this specific column.
You can see that this column contains 8 cells that do not match the
Country
semantic type, 1 empty cell, and 91 valid cells. In the
grid view, cells containing invalid values are displayed with an red left
border.
Results
You have checked the repartition of empty, invalid and valid records on the whole
dataset, as well as in each column. Most columns contain at least a few empty entries,
but only
popularity,
production_country
and
original_language also have invalid values. For one of these
column, the quality problem could come from a semantic type issue.