Cleansing your data

Preparing an HDFS-based dataset

EnrichVersion
6.4
2.1
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation

Now that your preparation has been saved, you can start working on the customer data, like with any other dataset, and choose among all the usual functions.

The dataset that you have imported originally contains 20,000 rows but only a sample of the first 10,000 by default rows is displayed. Don't worry, all the preparation steps that you add can be applied to the whole dataset.

You will perform some basic cleansing operations, to ensure that all the data contained in the dataset is valid and free of errors.

You can for example notice the presence of unnecessary whitespaces in some entries of the First_Name and Last_Name columns.

The quality bar under each column also indicates that your data contains rows with empty or invalid cells. The Email column, for example, contains both.

You are going to delete all the empty and invalid rows from the preparation in a single action, and remove the formatting errors in the columns containing the customer names.

Procedure

  1. Click the header of the First_Name column.
  2. While keeping the Ctrl button pressed, click the header of the Last_Name column.

    The two columns are now selected, and you can apply a function to both columns in one action.

  3. In the Functions panel, search for the Remove trailing and leading characters function and click it to open the options panel.
  4. In the Padding character drop-down list, select whitespace and click Submit.

    Blank spaces have been removed from the selected columns.

  5. Click the white arrow on the top left of the grid and select Display rows with invalid or empty values.

    A filter has been applied on your data, and only the rows with empty or invalid cells are displayed, making it easier for you to delete them in one go.

  6. In the Functions panel, click Delete these Filtered Rows to apply the corresponding function.

    All the filtered lines have been deleted, and you can now clear the filter by clicking the garbage bin icon in the filter bar.

Results

In two simple actions, you have removed all the errors contained in your dataset and improved the quality of your data.

The quality bar for each column is now completely green, indicating that there is no invalid data left in your preparation.