Dynamically using the data from another dataset - 8.0

Talend Data Preparation User Guide

Version
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Preparation
Content
Data Quality and Preparation > Cleansing data
Last publication date
2024-03-26

The lookup feature matches data from the current preparation with its counterpart in a reference dataset, allowing you add this reference data to your preparation.

Tip: Talend recommends to use the lookup feature on datasets of up to 1,000,000 rows and 10 columns.
In this example, you are working on a preparation that contains information about the American States your customers live in, but only in the form of a two-letter code. In parallel, you own a second dataset where the two-letter American State codes are matched with the full name of the State. You will use the lookup feature to retrieve this information and add it to your preparation.

Procedure

  1. Select the column that is common to both your main preparation and your reference dataset, state_code in this example.
    In order to perform a lookup, at least one column with common data must be present in the preparation and dataset that you want to blend.
  2. Click the lookup button in the upper right part of the screen to open the lookup panel.

    The Add data from lookup panel opens at the bottom of the screen.

  3. Click the button.
  4. In the dialog box that opens, select the dataset you want to use to perform the lookup and click Add.
    In this case, you need to add the dataset containing the list of two-letter State codes and their matching names. It must have been imported to Talend Data Preparation beforhand.

    The second dataset opens in the bottom part of the screen.

  5. In both your preparation and reference dataset, click the column containing the State codes so that they are highlighted.
  6. Select the Add to Dataset check box under every column you want to include in your lookup, State in this example.
  7. Point your mouse over the Confirm button to preview the changes, and click it to apply them.
    The State name for each matching State code between main preparation and reference dataset was retrieved.

    The only exception that you can see here is the second row, with the DC State code, that could not be matched with a full State name. The reason is that DC is not actually a proper State code, hence the invalid status in the first place. Logically, DC is not referenced in the second dataset, that only contains official values, and no match was found for this value in your preparation.

    The lookup feature behaves as follows:

    • If a value is common to both your main preparation and the reference dataset, the additional information is retrieved.
    • If a value exists in your preparation but not in the reference dataset, the lookup operation will leave an empty cell in the resulting column.
    • If a value from the reference dataset does not find a match in the main preparation, the additional information will simply not be retrieved.

    The logic behind the lookup operation in Talend Data Preparation corresponds to a left outer join when speaking in terms of Venn diagrams:

    Applied to our example, it means that you have used the common information about the State codes to create a join, and retrieved only the useful information from the second dataset to enrich the first one.

Results

A new column contaning the full State names is created with the result of the lookup. Each customer was matched with its corresponding State name when possible.

Another behavior of the lookup feature, that was not illustrated in the previous example, is how the possible duplicate and empty values from a reference dataset are handled.

Let's take the following example, where one of the State codes is missing from your preparation, and your reference dataset has a duplicate entry for the NY State code. For the sake of the example, NY has two possible values, New York and Nueva York, but it could very well be the same for both entries.

The above result was achieved by applying the following behavior:

  • When a same value exists several times in the reference dataset, only the first occurrence is matched, Nueva York in this case.
  • An empty cell in your preparation will be matched with an empty cell.