CDC with Spark in Big Data

Change Data Capture

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Data Services Platform
Talend Data Integration
Talend Data Fabric
Talend Big Data
Talend Big Data Platform
Talend Data Management Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Quality and Preparation > Third-party systems > Database components > Change Data Capture
Data Governance > Third-party systems > Database components > Change Data Capture
Design and Development > Third-party systems > Database components > Change Data Capture
EnrichPlatform
Talend Studio

This article shows a sample approach how to do CDC using Talend components.

CDC has same advantages in the big data world too. But the challenge with using CDC in Hadoop is that Hadoop is not ideal for data updates. Inserting data in Hadoop is simple in Hive but updates and delete are not. As Hadoop is a distributed system where data is stored is multiple nodes across the network, the performance overhead of updating a record is huge.

One of the ways to solve this issue is create Hive base or internal tables and Hive external tables and build Views on the top of them. The Base table will hold the all the data until the time new records are being loaded. The new changed records will be loaded into the External tables. Internal tables are typically used when the data in temporary and external tables are used when the data in the tables are used outside Hive.