Processing large lookup tables
This article shows a specific use case and discusses a way to handle the problem in Talend Studio.
In this scenario, you have a source table with hundreds of millions of records in the table, and this input data is used to look up data from a table which also has hundreds of millions of records in a different database. The source data combined with the lookup data will be inserted or updated into a target table.
Assumptions
- The source table and the lookup table have a common column that can be used in the join condition.
- Source and lookup tables are in different RDBM systems.
Problem Description
You have a simple job which reads source data from the cust table and looks up the CUST_LOCATIONS table using the location_id column. This is being done in tMap.
If you run this job, it runs out of memory while trying to load the whole lookup data (70 million rows) into memory. This job works for small loads, but if you have hundreds of millions of records in the source table and the lookup table, it is very inefficient.
It is recommended to set the Store temp data to true if you have large lookups.
But it is still not enough in this scenario.
Batch processing large lookup tables
This article explains how to use batch processing to handle large lookup tables.
One way to approach this issue is to use a process called Batching. Batching allows a batch of the records to be processed in a single run, which is done iteratively to create more batches and process all the records.
Each iteration will process and extract a fixed number of records from the source and the lookup tables, perform a join and load the target table.
This way you can control the number of records (batchrec variable) that the process holds in the memory.
To do this you can use Context variables, tLoop component, tContextLoad component, tMap and tJava component.
Procedure
Results
Here is the log from a sample run where the source table (cust) has 100 million rows and the lookup table (CUST_LOCATIONS) has 70 million rows. Batchrec="10000000".
The job took 66 minutes to run and ran within the memory available on the execution server. It did not get impacted with out of memory exceptions.
Best Practices when processing large lookup tables
- Run profiling to understand the main and lookup data. Using Talend Data profiling, understand the patterns and the keys by running analysis such as Column analysis, Redundancy analysis etc.
- Use the smaller table as the lookup. Talend reads the lookup into the memory when the job is started, so the smaller the size of lookup the faster the job can process the lookup data.
- Load the minimal number of columns needed in the lookup table to optimise memory usage. Remove all unused columns.
- Use blocking keys if your lookup table is big but the lookup keys don’t change much.
Load the lookup keys based on Data Analysis into a table as an initial step and use this table as a lookup table in the actual lookup job. This way only a subset of the records are being used in the lookup. Blocking keys could be a concatenation of columns. Analysis can be done on the blocking keys to understand the distribution which defines the number of lookups that will be done and the number of rows in each lookup. For more information on how blocking keys can be used in match analysis, see Creating a match analysis. In this particular scenario, you are not doing complex matches. However, a blocking key is a good way to reduce the dataset you are loading in memory for looking up. You can design your own custom blocking key.
- If the source tables and lookup tables are in the same database, consider using the database joins for the lookup. The selected columns with the join condition can be added to the input component. The schema would need to be updated accordingly.
- Use ELT (tELT*) components if the lookup and source tables are in the same database. This might be a good option if the database has enough resources to handle the Talend SQL, which is automatically generated. Also, this is a good option if the transformation needed can all be handled by the database.
- Do not use the "Reload for every row" option for large lookups. This setting needs to be used when lookup table is changing, as the job is still running. Use the setting if the job needs to load the latest changes in the lookup table.
- Understand the Infrastructure that the Talend lookup job will run on. It will be helpful to know how busy the systems will be when the job is running. Knowing the different processes or other Talend jobs running on Talend JobServer and scheduling the memory intensive jobs such as lookups is a best practice.