CDC with Spark in Big Data - Cloud - 8.0

Change Data Capture

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Database components (Integration) > Change Data Capture
Data Quality and Preparation > Third-party systems > Database components (Integration) > Change Data Capture
Design and Development > Third-party systems > Database components (Integration) > Change Data Capture
Last publication date
2024-02-20

This article shows a sample approach how to do CDC using Talend components.

CDC has same advantages in the big data world too. But the challenge with using CDC in Hadoop is that Hadoop is not ideal for data updates. Inserting data in Hadoop is simple in Hive but updates and delete are not. As Hadoop is a distributed system where data is stored is multiple nodes across the network, the performance overhead of updating a record is huge.

One of the ways to solve this issue is create Hive base or internal tables and Hive external tables and build Views on the top of them. The Base table will hold the all the data until the time new records are being loaded. The new changed records will be loaded into the External tables. Internal tables are typically used when the data in temporary and external tables are used when the data in the tables are used outside Hive.