You are now a data analyst from the finance department, tasked with investigating the poor quality of the customers_billing_dataset dataset that you have been given access to. You will look at the data itself and create a new preparation.
From the Dataset list, click
customers_billing_dataset to open the detailed view
of the dataset.
You can already get a sense of the dataset, with the Talend Trust Score™ diagram showing a downward trend in the last few days, which means that the latest data added to the database contains errors. This is confirmed by the Data quality tile showing a certain percentage of invalid and empty values.
To take a look at the data itself, click the Sample icon
from the left menu.
The data is displayed in a grid view. You can quickly see discrepancies between valid and invalid values in certain columns, and most noticeably, you notice that the Billing_Country column contains full addresses that should have been split between several columns.
To start a new preparation on this dataset and fix these errors, click the
button on the top right of the screen.
Talend Cloud Data Preparation opens and you can now start applying transformation operations on the data sample.
Apply the following functions to correct the billing information:
The data from the full addresses has been split into new columns, that you have also cleaned to ensure it is in the right format. This leaves you only with the rows that initially contained the errors, now with the billing information properly split in dedicated columns for country, state, city, and street.
- Split the text in parts on the Billing_Country column, to split it in 4 Parts and with , as Separator.
- Remove trailing and leading characters on the Billing_Country_Split_2, Billing_Country_Split_3 and Billing_Country_Split_4 columns, to remove whitespaces.
- Delete the rows that match on the Billing_Country_Split_1 column, and use the (FR)|(US)|(GB) regular expression as Value.
The preparation now displays cleaner data that can be used to update the source dataset.