Creating preparation versions - 8.0

Talend Data Preparation User Guide

Version
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Preparation
Content
Data Quality and Preparation > Cleansing data
Last publication date
2024-03-26

In the following example, you will perform a few preparation steps on your data, create versions at two different moments, and see how you can switch between your versions, as well as switch back to the current state of your preparation.

The dataset used here contains customer data such as their names, occupation, phone number and email address, but that requires some cleansing. Formatting inconsistencies can be found in the columns containing the customers names, such as leading or trailing whitespaces, and inconsistent case. In addition, various phone and email entries are invalid.

As you progress in your preparation, you are going to create two versions, that reflect the state of your preparation at two different times.

Procedure

  1. Click the header of the FIRST_NAME column, and while pressing the Ctrl key, click the header of the LAST_NAME column.

    The content of the two columns is now selected.

  2. Apply the Remove trailing and leading characters and the Change to title case functions to remove whitespaces and harmonize the case.

    Removing those formatting errors marks the first big step in your preparation, and you are going to create a version to track these changes.

  3. Click the Manage versions button located in the header bar.

    The Functions panel is replaced with the Versions panel. This panel is empty since no versions exist for this preparation yet.

    Adding new versions via the Manage versions button is only available to Talend Data Preparation user with administrator rights. Other users are only able to consult existing version in read-only mode.

  4. Click the Add version button.
  5. Enter a quick description of the version in the corresponding field, Fixing formatting errors in names in this example, and click Add version.

    The version is now listed in the Versions panel with a timestamp, and the description you added before.

  6. Click the version to access it in read-only mode.

    You can apply filters and browse the data, but you cannot apply functions on it.

  7. To leave the read-only mode and resume preparing your data, click the Switch to current state button located in the header bar.

    You are now back to the edit mode.

  8. To cleanse the remaining invalid entries from the PHONE and EMAIL columns, click the menu icon on the top left corner of the grid, and select Display rows with invalid or empty values.
  9. From the Functions panel, select the Delete these filtered rows functions.

    All the invalid values have been removed from your dataset, and you are going to create another version to capture this state.

  10. Repeat steps 3 to 5 to create a new version, but this time, enter Removing all invalid values as description.

    Your two versions are now listed in the Versions panel and can be accessed in read-only mode.

Results

You have created two versions of your preparation, in order to capture its state at two different steps of the cleansing process. You can choose to export one of these versions, use it in a Talend Job, or continue to edit the current state of your preparation.