The lookup feature matches data from the current preparation with its counterpart
in a reference dataset, allowing you add this reference data to your
preparation.
Tip: Talend
recommends to use the lookup feature on datasets of up to 1,000,000 rows and 10
columns.
In this example, you are working on a preparation that contains
information about the American States your customers live in, but only in the form of a
two-letter code. In parallel, you own a second dataset where the two-letter American State
codes are matched with the full name of the State. You will use the lookup feature to
retrieve this information and add it to your preparation.
Procedure
-
Select the column that is common to both your main preparation and your
reference dataset, state_code in this example.
In order to perform a lookup, at least one column with common data must be
present in the preparation and dataset that you want to blend.
-
Click the lookup button in the upper right part of the screen to open the
lookup panel.
The Add data from lookup panel opens at the bottom of
the screen.
-
Click the
button.
-
In the dialog box that opens, select the dataset you want to use to perform the
lookup and click Add.
In this case, you need to add the dataset containing the list of two-letter
State codes and their matching names. It must have been imported to
Talend Data Preparation beforhand.
The
second dataset opens in the bottom part of the screen.
-
In both your preparation and reference dataset, click the column containing the
State codes so that they are highlighted.
-
Select the Add to Dataset check box under every column
you want to include in your lookup, State in this
example.
-
Point your mouse over the Confirm button to preview the
changes, and click it to apply them.
The State name for each matching State code between main preparation and
reference dataset was retrieved.
The only exception that you can see here is the second row, with the
DC State code, that could not be matched with a
full State name. The reason is that DC is not
actually a proper State code, hence the invalid status in the first place.
Logically, DC is not referenced in the second
dataset, that only contains official values, and no match was found for this
value in your preparation.
The lookup feature behaves as follows:
- If a value is common to both your main preparation and the reference
dataset, the additional information is retrieved.
- If a value exists in your preparation but not in the reference dataset,
the lookup operation will leave an empty cell in the resulting
column.
- If a value from the reference dataset does not find a match in the main
preparation, the additional information will simply not be
retrieved.
The logic behind the lookup operation in Talend Data Preparation corresponds to a
left outer join when speaking in terms of Venn diagrams:
Applied to our example, it means that you have used the common
information about the State codes to create a join, and retrieved only the
useful information from the second dataset to enrich the first
one.
Results
A new column contaning the full State names is created with the result of the
lookup. Each customer was matched with its corresponding State name when
possible.
Another behavior of the lookup feature, that was not illustrated in the
previous example, is how the possible duplicate and empty values from a reference
dataset are handled.
Let's take the following example, where one of the State
codes is missing from your preparation, and your reference dataset has a duplicate
entry for the NY State code. For the sake of the example,
NY has two possible values, New
York and Nueva York, but it could very well
be the same for both entries.
The above result was achieved by applying the following behavior:
- When a same value exists several times in the reference dataset, only the first
occurrence is matched, Nueva York in this case.
- An empty cell in your preparation will be matched with an empty cell.