Processing large lookup tables
This article shows a specific use case and discusses a way to handle the problem in Talend Studio.
In this scenario, you have a source table with hundreds of millions of records in the table, and this input data is used to look up data from a table which also has hundreds of millions of records in a different database. The source data combined with the lookup data will be inserted or updated into a target table.
- The source table and the lookup table have a common column that can be used in the join condition.
- Source and lookup tables are in different RDBM systems.
You have a simple job which reads source data from the cust table and looks up the CUST_LOCATIONS table using the location_id column. This is being done in tMap.
If you run this job, it runs out of memory while trying to load the whole lookup data (70 million rows) into memory. This job works for small loads, but if you have hundreds of millions of records in the source table and the lookup table, it is very inefficient.
It is recommended to set the Store temp data to true if you have large lookups.
But it is still not enough in this scenario.
Batch processing large lookup tables
This article explains how to use batch processing to handle large lookup tables.
One way to approach this issue is to use a process called Batching. Batching allows a batch of the records to be processed in a single run, which is done iteratively to create more batches and process all the records.
Each iteration will process and extract a fixed number of records from the source and the lookup tables, perform a join and load the target table.
This way you can control the number of records (batchrec variable) that the process holds in the memory.
To do this you can use Context variables, tLoop component, tContextLoad component, tMap and tJava component.
Use context variables to make the job dynamic. Different values
can be used for context variables in different environments such as DEV and PROD. This also
makes the job flexible and it does not need any code changes across environments.
Recnum This variable is used as a starting point of the run. It is set to "1" by default for this job. Batchrec This variable is used as the number of the records that you want the job to process in each iteration. By making this variable dynamic, you can control how many records you want to process for each batch. MAX_CUSTID This variable is used to stop the process after the last iteration, which is after reading the last record. This variable is loaded using the tContextLoad component.
And the SQL run is as follows:
"SELECT 'MAX_CUSTID' as Key, max(id) as Value FROM `customer`.`cust` "
Use the sssql and lsql variables to build the SQL that the tMysqlInput and tMSSqlInput executes on the database. The columns chosen in ssqlCOLUMNS and lsqlCOLUMNS should be the same as the schema defined in the input components.
For example, the query in the customer (tMySqlInput) component is defined as below, making the full query dynamic.
" SELECT " + context.ssqlCOLUMNS + context.ssqlFROM + context.ssqlWHERE + ">=" + context.Recnum + context.ssqlAND + "<" + (context.Recnum + context.Batchrec) ;
A similar SQL query is defined on the CUST_LOCATIONS (tMSSqlInput) component.
Configure the tLoop component as follows.
The context variables defined above are used in the tLoop component as shown below. So for each iteration until the maximum customer id, the batch record number defined in the Batchrec variable is used to retrieve the records from the source and lookup tables.
Configure the tJava component as follows.
This component is used to increment the starting cust_id by the number of the records being processed by each batch.
Configure the tMap component as follows.
This component is used for the join condition. The Lookup Model is set as Load Once as your lookup table is static. Match Model is set as Unique match as you do not expect the lookup table to have duplicates. Join Model is set as Left Outer Join as you want source data to be loaded into the target even if a location is not found.
Connect the subjobs using OnSubjobOk and OnComponentOk triggers.
Do not connect tLoop directly to the tMysqlInput component. And do not connect tJava (Update record counter) directly to tMysqlOutput_1. The tLoop and tJava need to be part of their own subjobs.
The complete job looks as shown in the screenshot below.
- Run the Job
Here is the log from a sample run where the source table (cust) has 100 million rows and the lookup table (CUST_LOCATIONS) has 70 million rows. Batchrec="10000000".
The job took 66 minutes to run and ran within the memory available on the execution server. It did not get impacted with out of memory exceptions.
Best Practices when processing large lookup tables
- Run profiling to understand the main and lookup data. Using Talend Data profiling, understand the patterns and the keys by running analysis such as Column analysis, Redundancy analysis etc.
- Use the smaller table as the lookup. Talend reads the lookup into the memory when the job is started, so the smaller the size of lookup the faster the job can process the lookup data.
- Load the minimal number of columns needed in the lookup table to optimise memory usage. Remove all unused columns.
Use blocking keys if your lookup table is big but the lookup keys don’t change much.
Load the lookup keys based on Data Analysis into a table as an initial step and use this table as a lookup table in the actual lookup job. This way only a subset of the records are being used in the lookup. Blocking keys could be a concatenation of columns. Analysis can be done on the blocking keys to understand the distribution which defines the number of lookups that will be done and the number of rows in each lookup. For more information on how blocking keys can be used in match analysis, see Creating a match analysis. In this particular scenario, you are not doing complex matches. However, a blocking key is a good way to reduce the dataset you are loading in memory for looking up. You can design your own custom blocking key.
- If the source tables and lookup tables are in the same database, consider using the database joins for the lookup. The selected columns with the join condition can be added to the input component. The schema would need to be updated accordingly.
- Use ELT (tELT*) components if the lookup and source tables are in the same database. This might be a good option if the database has enough resources to handle the Talend SQL, which is automatically generated. Also, this is a good option if the transformation needed can all be handled by the database.
- Do not use the "Reload for every row" option for large lookups. This setting needs to be used when lookup table is changing, as the job is still running. Use the setting if the job needs to load the latest changes in the lookup table.
- Understand the Infrastructure that the Talend lookup job will run on. It will be helpful to know how busy the systems will be when the job is running. Knowing the different processes or other Talend jobs running on Talend JobServer and scheduling the memory intensive jobs such as lookups is a best practice.