Handling Lookups - 6.4

Talend Data Management Platform Studio User Guide

Talend Data Management Platform
Data Quality and Preparation
Design and Development
Talend Studio


For Big Data users only:

This feature is not available in the MapReduce version of tMap.

When implementing a join (including Inner Join and Left Outer Join) in a tMap between different data sources, there is always only one main flow and one or more lookup flows connected to the tMap. All the records of the lookup flow need to be loaded before processing each record of the main flow. Three types of lookup loading models are provided suiting various types of business requirement and the performance needs: Load once, Reload at each row, and Reload at each row (cache).

  • Load once: it loads once (and only once) all the records from the lookup flow either in the memory or in a local file before processing each record of the main flow in case the Store temp data option is set to true. This is the default setting and the preferred option if you have a large set of records in the main flow to be processed using a join to the lookup flow.

  • Reload at each row: it loads all the records of the lookup flow for each record of the main flow. Generally, this option increases the Job execution time due to the repeated loading of the lookup flow for each main flow record. However, this option is preferred in the following situations:

    • The lookup data flow is constantly updated and you want to load the latest lookup data for each record of the main flow to get the latest data after the join execution;

    • There are very few data from the main flow while a large amount of data from a database table in the lookup flow. In this case, it might cause an OutOfMemory exception if you use the Load once option. You can use dynamic variable settings such as where clause to update the lookup flow on the fly as it gets loaded, before the main flow join is processed. For an example, refer to Reloading data at each row.

  • Reload at each row (cache): it functions like the Reload at each row model, all the records of the lookup flow are loaded for each record of the main flow. However, this model can't be used with the Store temp data on disk option. The lookup data are cached in memory, and when a new loading occurs, only the records that are not already exist in the cache will be loaded, in order to avoid loading the same records twice. This option optimizes the processing time and helps improve processing performance of the tMap component. Note that you can not use Reload at each row (cache) and Store temp data at the same time.

Note that when your lookup is a database table, the best practise is to open the connection to the database in the beginning of your Job design in order to optimize performance.

Setting the loading mode of a lookup flow

To set the loading mode of a lookup flow:

  1. Click the tMap settings button at the top right of the lookup table to display the table properties.

  2. Click in the Value field corresponding to Lookup Model, and then click the [...] button to display the [Options] dialog box.

  3. In the [Options] dialog box, double-click the wanted loading mode, or select it and then click OK, to validate the setting and close the dialog box.

For use cases using these options, see the related documentation of the tMap component.

Reloading data at each row

The Reload at each row option is used to load all the records of a lookup flow for each record of the main flow.

When the main flow has much less rows than the lookup flow (for example, with a ratio of 1000 or more) and the lookup input is a database component, the advantage of this approach is that it helps deal with the fact that the amount of lookup data increases over time, since you can run queries against the data from the main flow in the database component to select only the lookup data that is relevant for each record in the main flow, such as in the following example which uses lookup data from a MySQL database.

The schemas of the main flow, the lookup flow and the output flow read as follows:

You can select from the MySQL database only the data that matches the values of the id column of the main flow. To do this, proceed as follows:

  1. Double-click tSetGlobalVar to open its Component view.

  2. Click the [+] button to add one row and name the Key to id and the Value to row1.id.

  3. Double-click tMysqlInput to open its Component view.

  4. In the Query field, enter the query to select the data that matches the id column of the main flow. In this example, this query reads:

    Select * from person where id="+(Integer)globalMap.get("id")

Refer to the related documentation of the components used in this example for more information.