The lookup feature matches data from the current preparation with its counterpart in a reference dataset, allowing you add this reference data to your preparation.
Select the column that is common to both your main preparation and your
reference dataset, state_code in this example.
In order to perform a lookup, at least one column with common data must be present in the preparation and dataset that you want to blend.
Click the lookup button in the upper right part of the screen to open the
The Add data from lookup panel opens at the bottom of the screen.
- Click the button.
In the dialog box that opens, select the dataset you want to use to perform the
lookup and click Add.
In this case, you need to add the dataset containing the list of two-letter State codes and their matching names. It must have been imported to Talend Data Preparation beforhand.
The second dataset opens in the bottom part of the screen.
- In both your preparation and reference dataset, click the column containing the State codes so that they are highlighted.
Select the Add to Dataset check box under every column
you want to include in your lookup, State in this
Point your mouse over the Confirm button to preview the
changes, and click it to apply them.
The State name for each matching State code between main preparation and reference dataset was retrieved.
The only exception that you can see here is the second row, with the DC State code, that could not be matched with a full State name. The reason is that DC is not actually a proper State code, hence the invalid status in the first place. Logically, DC is not referenced in the second dataset, that only contains official values, and no match was found for this value in your preparation.
The lookup feature behaves as follows:
- If a value is common to both your main preparation and the reference dataset, the additional information is retrieved.
- If a value exists in your preparation but not in the reference dataset, the lookup operation will leave an empty cell in the resulting column.
- If a value from the reference dataset does not find a match in the main preparation, the additional information will simply not be retrieved.
The logic behind the lookup operation in Talend Data Preparation corresponds to a left outer join when speaking in terms of Venn diagrams:
Applied to our example, it means that you have used the common information about the State codes to create a join, and retrieved only the useful information from the second dataset to enrich the first one.
Another behavior of the lookup feature, that was not illustrated in the previous example, is how the possible duplicate and empty values from a reference dataset are handled.
Let's take the following example, where one of the State codes is missing from your preparation, and your reference dataset has a duplicate entry for the NY State code. For the sake of the example, NY has two possible values, New York and Nueva York, but it could very well be the same for both entries.
The above result was achieved by applying the following behavior:
- When a same value exists several times in the reference dataset, only the first occurrence is matched, Nueva York in this case.
- An empty cell in your preparation will be matched with an empty cell.