Dynamically using the data from another dataset

Dynamically using the data from another dataset - Cloud

Talend Cloud Data Preparation User Guide

Version

Cloud

Language

English

Product

Talend Cloud

Module

Talend Data Preparation

Content

Administration and Monitoring > Managing connections

Data Quality and Preparation > Cleansing data

Data Quality and Preparation > Managing datasets

Last publication date

2024-04-15

The lookup feature matches data from the current preparation with its counterpart in a reference dataset, allowing you to add this reference data to your preparation.

Tip: Talend recommends to use the lookup feature on datasets of up to 100,000 rows and 10 columns.

When using the lookup feature in Talend Cloud Data Preparation, column descriptions previously added in the lookup dataset via Talend Cloud Data Inventory will be displayed in the preparation.

In this example, you are working on a preparation that contains information about the American States your customers live in, but only in the form of a two-letter code. In parallel, you own a second dataset where the two-letter American State codes are matched with the full name of the state. You can use the lookup feature to retrieve this information and add it to your preparation.

Dataset containing American States in the form of a two-letter code.

Before you begin

A current limitation prevents the datasets based on the following connections to be used in a lookup:

Google Cloud Storage (Cloud file systems)
Azure Synapse (Database)
Google BigQuery (Database)
Google BigTable (Database)
Marketo (Business Apps)
Google Analytics (Business Apps)
NetSuite (Business Apps)
Workday (Business Apps)
Kafka (Messaging)
RabbitMQ (Messaging)

Procedure

Open your preparation.
Click the lookup button to open the lookup panel.
Click Select dataset to select an existing dataset.
Select the dataset you want to use to perform the lookup.

In this case, you need to add the dataset containing the list of two-letter state codes and their matching names. It must have been imported to Talend Cloud Data Preparation beforehand.
Click Select.
From the Current preparation and Lookup dataset drop-down list, select the columns matching in your main preparation and your reference dataset, the state_code and States Code columns in this example.

In order to perform a lookup, at least one column with matching data must be present in the preparation and dataset that you want to blend.
From the Columns to add drop-down list, select the column containing the states to add it to the current dataset, State in this example.
Choose to apply those changes only on the filtered rows or on all of the rows.
Click Submit. The state name for each matching state code between main preparation and reference dataset is retrieved.
The lookup feature behaves as follows:
- If a value is common to both your main preparation and the reference dataset, the additional information is retrieved.
- If a value exists in your preparation but not in the reference dataset, the lookup operation will leave an empty cell in the resulting column.
- If a value from the reference dataset does not find a match in the main preparation, the additional information will simply not be retrieved.
The logic behind the lookup operation in Talend Cloud Data Preparation corresponds to a left outer join when speaking in terms of Venn diagrams:

Applied to our example, it means that you have used the common information about the state codes to create a join, and retrieved only the useful information from the second dataset to enrich the first one.

Results

A new column containing the full state names is created with the result of the lookup. Each customer was matched with its corresponding state name when possible.

Dataset containing American States with full names.

Another behavior of the lookup feature, that was not illustrated in the previous example, is how the possible duplicate and empty values from a reference dataset are handled.

Let's take the following example, where one of the State codes is missing from your preparation, and your reference dataset has a duplicate entry for the NY State code. For the sake of the example, NY has two possible values, New York and Nueva York, but it could very well be the same for both entries.

The above result was achieved by applying the following behavior:

When a same value exists several times in the reference dataset, only the first occurrence is matched, Nueva York in this case.
An empty cell in your preparation will be matched with an empty cell.