Skip to main content

CDC with Spark in Big Data

This article shows a sample approach how to do CDC using Talend components.

CDC has same advantages in the big data world too. But the challenge with using CDC in Hadoop is that Hadoop is not ideal for data updates. Inserting data in Hadoop is simple in Hive but updates and delete are not. As Hadoop is a distributed system where data is stored is multiple nodes across the network, the performance overhead of updating a record is huge.

One of the ways to solve this issue is create Hive base or internal tables and Hive external tables and build Views on the top of them. The Base table will hold the all the data until the time new records are being loaded. The new changed records will be loaded into the External tables. Internal tables are typically used when the data in temporary and external tables are used when the data in the tables are used outside Hive.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!