The lookup feature matches data from the current preparation with its counterpart
in a reference dataset, allowing you add this reference data to your
preparation.
Tip: Talend recommends to use the lookup feature on
datasets of up to 1,000,000 rows and 10 columns.
In this example, you are working
on a preparation that contains information about the American States your customers live
in, but only in the form of a two-letter code. In parallel, you own a second dataset where
the two-letter American State codes are matched with the full name of the State. You will
use the lookup feature to retrieve this information and add it to your preparation.
Procedure
-
Select the column that is common to both your main preparation and your
reference dataset, state_code in this example.
In order to perform a lookup, at least one column with common data must be
present in the preparation and dataset that you want to blend.
-
Click the lookup button in the upper right part of the screen to open the
lookup panel.
The Add data from lookup panel opens at the bottom of
the screen.
-
Click the
button.
-
In the dialog box that opens, select the dataset you want to use to perform the
lookup and click Add.
In this case, you need to add the dataset containing the list of two-letter
State codes and their matching names. It must have been imported to
Talend Data Preparation beforhand.
The
second dataset opens in the bottom part of the screen.
-
In both your preparation and reference dataset, click the column containing the
State codes so that they are highlighted.
-
Select the Add to Dataset check box under every column
you want to include in your lookup, State in this
example.
-
Point your mouse over the Confirm button to preview the
changes, and click it to apply them.
The State name for each matching State code between main preparation and
reference dataset was retrieved.
The only exception that you can see here is the second row, with the
DC State code, that could not be matched with a
full State name. The reason is that DC is not
actually a proper State code, hence the invalid status in the first place.
Logically, DC is not referenced in the second
dataset, that only contains official values, and no match was found for this
value in your preparation.
The lookup feature behaves as follows:
- If a value is common to both your main preparation and the reference
dataset, the additional information is retrieved.
- If a value exists in your preparation but not in the reference dataset,
the lookup operation will leave an empty cell in the resulting
column.
- If a value from the reference dataset does not find a match in the main
preparation, the additional information will simply not be
retrieved.
The logic behind the lookup operation in Talend Data Preparation corresponds to a
left outer join when speaking in terms of Venn diagrams: