When implementing a join (including Inner Join and Left Outer Join) in a tMap between different data sources, there is always only one main flow and one or more lookup flows connected to the tMap. All the records of the lookup flow need to be loaded before processing each record of the main flow. Three types of lookup loading models are provided suiting various types of business requirement and the performance needs: Load once, Reload at each row, and Reload at each row (cache).
- Load once: it loads once (and only once) all the records from the lookup flow either in the memory or in a local file before processing each record of the main flow in case the Store temp data option is set to true. This is the default setting and the preferred option if you have a large set of records in the main flow to be processed using a join to the lookup flow.
Reload at each row: it loads all the records of the lookup flow
for each record of the main flow. Generally, this option increases the Job execution
time due to the repeated loading of the lookup flow for each main flow record.
However, this option is preferred in the following situations:
- The lookup data flow is constantly updated and you want to load the latest lookup data for each record of the main flow to get the latest data after the join execution;
- There are very few data from the main flow while a large amount of data from a database table in the lookup flow. In this case, it might cause an OutOfMemory exception if you use the Load once option. You can use dynamic variable settings such as where clause to update the lookup flow on the fly as it gets loaded, before the main flow join is processed. For an example, refer to Reloading data at each row.
Reload at each row (cache): it functions like the
Reload at each row model, all the records of the lookup
flow are loaded for each record of the main flow. However, this model can't be used
with the Store temp data on disk option. The lookup data are
cached in memory, and when a new loading occurs, only the records that are not
already exist in the cache will be loaded, in order to avoid loading the same
records twice. This option optimizes the processing time and helps improve
processing performance of the tMap component. Note that you
can not use Reload at each row (cache) and Store
temp data at the same time.
Note that Reload at each row (cache) in a streaming Job is supported by the Lookup Input components only such as tMongoDBLookupInput.
Note that when your lookup is a database table, the best practise is to open the connection to the database in the beginning of your Job design in order to optimize performance.