Processing large lookup tables - 6.1

Processing large lookup tables

author
Irshad Burtally
EnrichVersion
6.4
6.3
6.2
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Processing components (Integration)
Data Quality and Preparation > Third-party systems > Processing components (Integration)
Design and Development > Third-party systems > Processing components (Integration)
EnrichPlatform
Talend Studio

This article shows a specific use case and discusses a way to handle the problem in Talend Studio.

In this scenario, you have a source table with hundreds of millions of records in the table, and this input data is used to look up data from a table which also has hundreds of millions of records in a different database. The source data combined with the lookup data will be inserted or updated into a target table.

Assumptions

  1. The source table and the lookup table have a common column that can be used in the join condition.
  2. Source and lookup tables are in different RDBM systems.

Problem Description

You have a simple job which reads source data from the cust table and looks up the CUST_LOCATIONS table using the location_id column. This is being done in tMap.

If you run this job, it runs out of memory while trying to load the whole lookup data (70 million rows) into memory. This job works for small loads, but if you have hundreds of millions of records in the source table and the lookup table, it is very inefficient.

It is recommended to set the Store temp data to true if you have large lookups.

But it is still not enough in this scenario.