The lookup feature matches data from the current preparation with its
counterpart in a reference dataset, allowing you to add this reference data to your
preparation.
When using the lookup feature in
Talend Cloud Data Preparation, column
descriptions previously added in the lookup dataset via
Talend Cloud Data Inventory will be
displayed in the preparation.
In this example, you are working on a preparation that
contains information about the American States your customers live in, but only in the form
of a two-letter code. In parallel, you own a second dataset where the two-letter American
State codes are matched with the full name of the state. You can use the lookup feature to
retrieve this information and add it to your preparation.
Before you begin
A current limitation prevents the datasets based on the following connections to be
used in a lookup:
- Google Cloud Storage (Cloud file systems)
- Azure Synapse (Database)
- Google BigQuery (Database)
- Google BigTable (Database)
- Marketo (Business Apps)
- Google Analytics (Business Apps)
- NetSuite (Business Apps)
- Workday (Business Apps)
- Kafka (Messaging)
- RabbitMQ (Messaging)
Procedure
-
Open your preparation.
-
Click the lookup button to open the lookup panel.
-
Click Select dataset to select an existing dataset.
-
Select the dataset you want to use to perform the lookup.
In this case, you need to add the dataset containing the list of two-letter state
codes and their matching names. It must have been imported to
Talend Cloud Data Preparation
beforehand.
-
Click Select.
-
From the Current preparation and Lookup
dataset drop-down list, select the columns matching in your main
preparation and your reference dataset, the state_code and
States Code columns in this example.
In order to perform a lookup, at least one column with matching data must be
present in the preparation and dataset that you want to blend.
-
From the Columns to add drop-down list, select the column
containing the states to add it to the current dataset, State
in this example.
-
Choose to apply those changes only on the filtered rows or on all of the
rows.
-
Click Submit. The state name for each matching state code
between main preparation and reference dataset is retrieved.
The lookup feature behaves as follows:
- If a value is common to both your main preparation and the reference dataset,
the additional information is retrieved.
- If a value exists in your preparation but not in the reference dataset, the
lookup operation will leave an empty cell in the resulting column.
- If a value from the reference dataset does not find a match in the main
preparation, the additional information will simply not be retrieved.
The logic behind the lookup operation in Talend Cloud Data Preparation
corresponds to a left outer join when speaking in terms of Venn diagrams:
Applied to our example, it means that you have used the common information about
the state codes to create a join, and retrieved only the useful information from
the second dataset to enrich the first one.
Results
A new column contaning the full state names is created with the result of the lookup.
Each customer was matched with its corresponding state name when possible.
Another behavior of the lookup feature, that was not illustrated in the previous
example, is how the possible duplicate and empty values from a reference dataset are
handled.
Let's take the following example, where one of the State codes is missing from your
preparation, and your reference dataset has a duplicate entry for the
NY State code. For the sake of the example,
NY has two possible values, New York
and Nueva York, but it could very well be the same for both
entries.
The above result was achieved by applying the following behavior:
- When a same value exists several times in the reference dataset, only the first
occurrence is matched, Nueva York in this case.
- An empty cell in your preparation will be matched with an empty cell.