Skip to main content Skip to complementary content
Close announcements banner

Fixing the issues with Talend Cloud Data Preparation

Availability-noteBeta
You are now a data analyst from the finance department, tasked with investigating the poor quality of the customers_billing_dataset dataset that you have been given access to. You will look at the data itself and create a new preparation.

Procedure

  1. From the Dataset list, click customers_billing_dataset to open the detailed view of the dataset.
    You can already get a sense of the dataset, with the Talend Trust Score™ diagram showing a downward trend in the last few days, which means that the latest data added to the database contains errors. This is confirmed by the Data quality tile showing a certain percentage of invalid and empty values.
    Detailed view of the customers_billing_dataset with charts and quality indicators.
  2. To take a look at the data itself, click the Sample icon from the left menu.
    The data is displayed in a grid view. You can quickly see discrepancies between valid and invalid values in certain columns, and most noticeably, you notice that the Billing_Country column contains full addresses that should have been split between several columns.
    Sample view of the dataset, showing errors to be fixed in the data.
  3. To start a new preparation on this dataset and fix these errors, click the Preparations > Add button on the top right of the screen.
    Mouse pointing over the Add preparation button.

    Talend Cloud Data Preparation opens and you can now start applying transformation operations on the data sample.

  4. Apply the following functions to correct the billing information:
    1. Split the text in parts on the Billing_Country column, to split it in 4 Parts and with , as Separator.
    2. Remove trailing and leading characters on the Billing_Country_Split_2, Billing_Country_Split_3 and Billing_Country_Split_4 columns, to remove whitespaces.
    3. Delete the rows that match on the Billing_Country_Split_1 column, and use the (FR)|(US)|(GB) regular expression as Value.
    The data from the full addresses has been split into new columns, that you have also cleaned to ensure it is in the right format. This leaves you only with the rows that initially contained the errors, now with the billing information properly split in dedicated columns for country, state, city, and street.

Results

The preparation now displays cleaner data that can be used to update the source dataset.
Sample view of the dataset, with improved data quality and formatting.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!