Best Practices when processing large lookup tables

This article describes some best practices to adopt when designing Talend Jobs to process hundreds of millions of records.

Run profiling to understand the main and lookup data. Using Talend Data profiling, understand the patterns and the keys by running analysis such as Column analysis, Redundancy analysis etc.
Use the smaller table as the lookup. Talend reads the lookup into the memory when the Job is started, so the smaller the size of lookup the faster the Job can process the lookup data.
Load the minimal number of columns needed in the lookup table to optimise memory usage. Remove all unused columns.
Use blocking keys if your lookup table is big but the lookup keys don’t change much.
Load the lookup keys based on Data Analysis into a table as an initial step and use this table as a lookup table in the actual lookup Job. This way only a subset of the records are being used in the lookup. Blocking keys could be a concatenation of columns. Analysis can be done on the blocking keys to understand the distribution which defines the number of lookups that will be done and the number of rows in each lookup. For more information on how blocking keys can be used in match analysis, see Creating a match analysis. In this particular scenario, you are not doing complex matches. However, a blocking key is a good way to reduce the dataset you are loading in memory for looking up. You can design your own custom blocking key.
If the source tables and lookup tables are in the same database, consider using the database joins for the lookup. The selected columns with the join condition can be added to the input component. The schema would need to be updated accordingly.
Use ELT (tELT*) components if the lookup and source tables are in the same database. This might be a good option if the database has enough resources to handle the Talend SQL, which is automatically generated. Also, this is a good option if the transformations needed can all be handled by the database.
Do not use the "Reload for every row" option for large lookups. This setting needs to be used when lookup table is changing, as the Job is still running. Use the setting if the Job needs to load the latest changes in the lookup table.
Understand the Infrastructure that the Talend lookup Job will run on. It will be helpful to know how busy the systems will be when the Job is running. Knowing the different processes or other Talend Jobs running on Talend JobServer and scheduling the memory intensive Jobs such as lookups is a best practice.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here