In order to adapt to the multiple processing types as well as to address performance issues, the tMap component supports different lookup loading modes.
Load once: Default setting. Select this option to load the entire lookup flow before processing the main flow. This is the preferred option if you have a great number of data from your main flow that needs to be requested in your lookup, or if your reference (or lookup) data comes from a file that can be easily loaded.
Reload at each row: At each row, the lookup gets loaded again. This is mainly interesting in Jobs where the lookup volume is large, while the main flow is pretty small. Note that this option allows you to use dynamic variable settings such as where clause, to change/update the lookup flow on the fly as it gets loaded, before the main flow join is processed. This option could be considered as the counter-part of the Store temp data option that is available for file lookups.
Reload at each row (cache): Expressions (in the Lookup table) are assessed and looked up in the cache first. The results of joins that have already been solved, are stored in the cache, in order to avoid loading the same results twice. This option optimizes the processing time and helps improve processing performance of the tMap component.
Note that for the time being, you cannot use Reload at each row (cache) and Store temp data at the same time.
To set the loading mode of a lookup flow:
Click the tMap settings button at the top of the lookup table to display the table properties.
Click in the Value field corresponding to Lookup Model, and then click the [...] button to display the [Options] dialog box.
In the [Options] dialog box, double-click the wanted loading mode, or select it and then click OK, to validate the setting and close the dialog box.
For use cases using these options, see the tMap section of Talend Open Studio for Big Data Components Reference Guide.
When your lookup is a database table, the best practise is to open the connection to the database in the beginning of your job design in order to optimize performance.
As explained above, the Reload at each row option is used to read all the records of a lookup flow for each record of the main flow. In general, this approach increases the time of Job execution, however, it becomes recommended if the main flow and lookup flow meet the following conditions:
The main flow has much less rows than the lookup flow (for example, with a ratio of 1000 or more).
The lookup input is a database component, such as tMysqlInput.
The advantage of this approach, with both conditions satisfied, is that it helps deal with the fact that the amount of lookup data increases over time, since you can run queries against the data from the main flow in the database component in order to select only that lookup data which is assumed to be relevant for each record in the main flow, such as in the following example which uses lookup data from a MySQL database.
The schemas of the main flow, the lookup flow and the output flow read as follows:
You can select from the MySQL database only the data that matches the values of the id column of the main flow. To do this, proceed as follows:
Double-click tSetGlobalVar to open its Component view.
Click the [+] button to add one row and name the Key to id and the Value to row1.id.
Double-click tMysqlInput to open its Component view.
In the Query field, enter the query to select the data that matches the id column of the main flow. In this example, this query reads:
Select * from person where id="+(Integer)globalMap.get("id")
For further information about the components used in this example, see Talend Open Studio for Big Data Components Reference Guide.