Blog

What is change data capture (CDC)?

Understand change data capture for low-latency pipelines. Compare timestamp, trigger, and log-based methods, explore use cases, and see how Aerospike enables reliable streaming.

June 26, 2025 | 17 min read
Alex Patino
Alexander Patino
Solutions Content Leader

Change data capture (CDC) is a data integration pattern for tracking and capturing changes made to a database, including inserts, updates, and deletes. Instead of repeatedly bulk-transferring entire tables, CDC continuously records incremental changes, often called a CDC feed, and makes them available for processing. By operating on this stream of changes rather than full data extracts, systems propagate updates faster and keep data stores in sync in near real-time. In essence, CDC means one system, the source, that publishes its data modifications so other systems subscribe and react to those changes.

This approach has become fundamental in data architectures. CDC is widely used across data integration platforms to support real-time analytics, so multiple systems remain consistent with each other. By continuously capturing and replicating data changes as they happen, CDC supports data migration, feeds event-driven microservices, and bridges on-premises and cloud environments. In short, CDC provides a reliable mechanism to stream live updates from a source database to downstream systems that need those updates.

Why change data capture matters

Traditional batch Extract, Transform, Load (ETL) processes often resulted in delays and data silos where changes in a production database might only reach a data warehouse or secondary system hours later, during a nightly batch window. CDC addresses these limitations by supporting real-time data flow. Some of the benefits of change data capture include:

Eliminating batch delays 

Instead of bulk reloads during off-hours, CDC allows incremental or streaming updates. This avoids downtime or read-locks on the source and removes the need for inconvenient batch windows. Data in the target system is kept up-to-date continuously rather than catching up overnight.

Minimal impact on sources

The most efficient form of CDC is log-based, meaning it reads from the database’s transaction logs. This method reduces overhead on the source database, preserving its performance and resources. The data source doesn’t have to do extra work for CDC beyond what it already logs for recovery.

Zero-downtime migrations 

CDC supports live database migrations and upgrades. As changes are streamed in real time, sync a new database or cloud data store with the old one and switch over with minimal or no downtime. This is important for organizations that can’t afford prolonged maintenance windows.

Multi-system synchronization 

Keeping a continuous feed of changes means multiple downstream systems, such as caches, search indexes, and analytics databases, stay synchronized with the source. This synchronization is important for consistency in high-speed environments; users always see the latest data, whether they’re hitting the primary database or a derived system.

Designed for today’s architectures

Change data capture is well-suited for cloud and streaming use cases. It efficiently moves data across networks and into stream-processing platforms such as Kafka, supporting cloud-based analytics and event-driven applications. CDC enables streaming data pipelines, supporting uses from real-time dashboards to machine learning features.

CDC provides timely, efficient, and consistent data propagation across systems. It helps businesses react to data changes more quickly, supporting real-time user experiences, up-to-date analytics, and consistency in distributed applications.

Webinar: High throughput, real time ACID transactions at scale with Aerospike 8.0

Experience Aerospike 8.0 in action and see how to run real-time, high-throughput ACID transactions at scale. Learn how to ensure data integrity and strong consistency without sacrificing performance. Watch the webinar on demand today and take your applications to the next level.

How change data capture works

At a high level, implementing CDC requires capturing change events from the source database and delivering them to targets. There are two broad strategies to capture and propagate changes: push vs. pull. 

  • In a push model, the source system sends each source data change to a downstream message queue or service as soon as it happens. This means targets get updates more quickly, but the source needs to do more work and requires buffering in case a target is down to avoid losing events.

  • In a pull model, the source records that a change occurred, perhaps by marking rows with a last-updated timestamp or writing to a journal table, and external processes periodically poll and fetch the new changes. The pull approach puts less load on the source, but target systems see changes with some lag because they’re fetched in batches. 

In practice, CDC solutions lean toward push-based streaming with a reliable intermediary. The source captures each change via its transaction log or triggers and pushes it to a messaging system or change queue. This reduces latency, so changes flow out in near real-time, while the message broker provides durability so no updates are lost if a target is temporarily unavailable. 

Pull-based CDC or polling might be used for simpler setups or when slight delays are acceptable, but it’s generally slower. The choice may depend on needs: if an application requires immediate update propagation, a push model with proper buffering is preferred, while for something such as a large periodic data transfer, or where ultra-low latency isn’t critical, a pull (batch) model could suffice.

Common CDC capture methods

Here are some techniques to detect and capture data changes in the source system:

Timestamp columns (incremental pull)

The simplest approach is to add a timestamp field, such as “last_modified,” to each row and update it on every change. A CDC process periodically queries for rows where this timestamp is greater than the last time it ran. 

While simple, this method has notable limitations: it misses hard-deleted rows because a deleted row is gone and cannot be queried by timestamp, and it adds overhead because it requires scanning tables or indices for changes since a given time. It also requires modifying the database schema to add these columns. This approach is often used in a basic data pipeline but is less ideal for high-volume real-time CDC.

Trigger-based CDC

Many a SQL database allows triggers, which are custom procedures that execute when data is inserted, updated, or deleted. In a trigger-based CDC setup, you create triggers on each table or each type of operation per table to capture the change event. The trigger might write an entry to a separate “change log” table or message queue detailing the operation. This method captures all types of changes, including deletes, because the trigger logs a delete event and propagates it without polling. 

However, triggers add overhead to the source database because every write now performs multiple writes for each original change, which slows down the primary workload. Triggers are also more complex and require management, especially across many tables. For these reasons, trigger-based CDC is less favored for large-scale systems, but it is useful when log-based CDC is not available.

Log-based CDC

Transaction log-based capture is the most advanced and low-impact method. In any transactional database, every change is recorded in a sequential transaction log, such as a write-ahead log, redo log, or binlog, for recovery purposes. Log-based CDC taps into these logs to read the changes as they are committed. Because it reads from a log that the database already maintains, this approach doesn’t add much load on the database itself because it doesn’t lock any tables or execute extra code on data changes. Log-based CDC captures all inserts, updates, and deletes and handles multi-operation transactions consistently, because the log records the order of operations. 

The advantages are clear: no schema changes needed, no additional writes on the primary database, and complete change capture. The main downside is that parsing database logs is complex. Every database vendor has its own log format and API, so change data capture tools must be tailored to each data management system. 

Additionally, the CDC system needs to filter out or reconcile changes that were rolled back or never committed so they don’t appear as false events. Despite these challenges, log-based CDC is the de facto choice for high-throughput CDC systems because it provides the best balance of real-time capture with minimal performance impact on the source.

Timestamp-based and trigger-based CDC were early solutions but come with tradeoffs in completeness and overhead. Today, log-based CDC is preferred for robust, low-latency change capture. Many enterprise change data capture tools and database-built-in CDC features use log-based capture, because it supports high volume changes without throttling the source database.

Examples for CDC

CDC has broad applications wherever data needs to be replicated or synchronized in real time. Some of the most important examples include:

Continuous data replication and syncing

One classic use of CDC is to keep a replica or secondary database continuously in sync with a primary database. Rather than copying the entire database at intervals, which requires heavy downtime or locking, CDC feeds a steady stream of changes to a replica system. This is essential for high availability setups, cross-region replicas, or migrating data between storage systems. 

For example, for zero-downtime migration, CDC replicates ongoing changes from an old system into a new system until they are synchronized. Users and applications continue writing to the source during this process, and CDC keeps the target up-to-date. By replicating just the changes, CDC uses less bandwidth and time than full dumps. The result is an up-to-date copy of data on the target side, supporting active-passive failovers, read-only reporting databases, or cloud migrations.

Event-driven microservices integration

As monolithic applications break into microservices, services need to share changed data without tightly coupling them. CDC turns database changes into events that other services consume. 

For instance, if an Orders service updates an order status in its database, a CDC pipeline captures that update and publishes an event via Kafka or another message bus that a Notifications service or Analytics service follows. This pattern decouples services so each service updates its own database, and CDC propagates those changes out to interested consumers. 

During a migration from a monolith to microservices, CDC often keeps the old and new systems in sync until the transition is complete. In general, CDC acts as the bridge in event-driven architectures, feeding data to multiple endpoints in near real-time. This means each microservice or downstream system always has the data it needs, without directly querying the source service’s database.

Redis to Aerospike: Migration guide

Redis works well for lightweight caching and quick prototypes. But when your system grows, with more data, users, and uptime requirements, Redis starts to crack. If you're hitting ceilings with DRAM costs, vertical scaling limits, or fragile clustering, it's time for a change. This migration guide provides a clear, practical path for moving from Redis to Aerospike.

Cloud migration and hybrid data systems

Organizations increasingly operate across on-premises datacenters and multiple clouds. CDC keeps data consistent in these distributed environments. It supports live cloud migration by streaming on-premises database changes to a cloud database or data warehouse, so the cloud copy is continuously updated with on-premises changes. This way, cutovers to cloud systems happen gradually and with current data. 

In a hybrid cloud scenario, some applications may continue on-premises while others run in the cloud; CDC replicates data such as customer records or transaction data securely between the environments in real time. By filtering and routing the change streams, CDC also helps comply with data residency rules, such as sending only allowed data to the cloud while keeping sensitive changes on-premises. CDC replication topology of one-to-many, many-to-one, and bidirectional makes it useful for data distribution across regions and cloud providers.

Real-time analytics and data pipelines

Another use case for CDC is feeding analytics systems and data lakes/warehouses with real-time data. Traditionally, ETL jobs might refresh a warehouse once a day. With CDC, companies move toward real-time analytics, where dashboards and reports reflect changes within minutes or seconds of the source update. 

For example, an e-commerce company uses CDC to stream transactional data from its operational database into a streaming platform, such as Kafka, and then into an analytics database or data lake. This provides up-to-the-minute visibility into sales, inventory, and user behavior. CDC-powered pipelines also support combining streams from multiple sources to drive complex event processing and machine learning features, such as updating a real-time fraud detection model whenever a user’s account behavior changes. Because CDC delivers a chronological feed of every change, it’s useful for feeding downstream systems that maintain state or histories, such as maintaining a timeline of changes in a data lake. 

Many streaming architectures use CDC with platforms such as Kafka or cloud streaming services: the CDC process captures database commits and publishes them to Kafka topics, to which consumers such as analytics dashboards, search indexes, and caches subscribe. This pattern has become increasingly popular as businesses demand immediate insight from their data.

Challenges and considerations with CDC

While change data capture is powerful, implementing it comes with challenges and tradeoffs architects must consider:

Data consistency and ordering

In a distributed system, keeping data consistent between the source and target is hard. CDC systems are usually asynchronous, meaning there is some lag and the potential for out-of-order delivery of events. Network latency or failures may cause the target to momentarily diverge from the source, or for events to arrive out of sequence. 

For instance, if a user updates a record twice quickly, a delay in delivering the first update could result in the second update being applied before the first one at the target. Handling such scenarios requires design, such as using sequence numbers or timestamps to reorder events. In practice, most CDC pipelines guarantee eventual consistency where the target will catch up and reflect all changes, but inconsistencies may occur in the short term. 

It’s important to determine if the use case tolerates slight delays or requires strict ordering. Techniques such as including a commit timestamp or sequence ID with each change help services maintain ordering. Still, maintaining perfect synchronization under all conditions is challenging, and periodic audits or reconciliation processes may be needed to verify consistency.

Performance overhead on sources

Different CDC methods carry different overhead, but any CDC process uses some resources on the source database. Approaches such as triggers or frequent polling are used more; triggers add extra writes and CPU load on every transaction, and polling queries contend for I/O and locks. Even log-based CDC, while much lighter, isn’t free; reading and parsing the transaction log uses CPU and possibly disk I/O on the database server or requires additional infrastructure, such as an agent that reads the logs. 

In high-volume environments, the CDC pipeline must be tuned not to fall behind or overload the source. If changes are faster than the CDC consumption rate, it leads to backlogs that stress the system. There’s also a memory and network overhead to consider: changes must be buffered and transmitted, which could affect network bandwidth, especially across data centers. 

Efficient CDC solutions mitigate these issues by streaming incrementally and checkpointing their progress so they don’t re-scan large swaths of data. Nonetheless, when designing CDC, budget for the extra load. In some cases, upgrading hardware or using a separate read-replica as the CDC source isolates the performance impact. 

The good news is that log-based CDC reduces overhead by design, but it still requires monitoring. High-change-rate tables might generate huge volumes of events, so the CDC pipeline must scale accordingly. While CDC is powerful, it’s important to ensure the source system handles the additional work of change capture without degrading its primary workload.

Complexity and maintenance

Implementing CDC makes your data architecture more complex. Each database platform has its own CDC mechanism or third-party tool, leading to a lack of standardization. For example, Oracle, SQL Server, and MySQL have different log formats and different tooling for CDC. 

This means CDC pipelines often rely on connector frameworks such as Debezium or custom scripts that need to be configured and maintained. Managing these connectors to keep them in sync with database upgrades, handling failures, and scaling becomes an operational task. 

There’s also the challenge of schema changes: if the source schema evolves with new columns or tables, the CDC pipeline needs to adapt so it doesn’t break or start misinterpreting data. Some CDC solutions track schema metadata and handle evolution, but others might require manual intervention. 

Another complexity is coordinating delivery. In distributed systems, it’s easy to accidentally apply changes twice or miss changes during outages. Robust CDC systems typically guarantee at-least-once delivery, which means no lost events, though duplicates might occur in failover scenarios, and rely on downstream idempotency or deduplication to handle repeats. 

Setting up the messaging infrastructure, such as Kafka, as part of CDC also adds complexity; you need to configure retention policies, the number of partitions for parallelism, error handling for bad events, and so on. 

Finally, building monitoring and alerting for the CDC pipeline is important because you want to know if replication lag is increasing or if an error has stopped the feed. Without visibility, CDC can fail silently and lead to data divergence. 

All these factors mean that while CDC solves many problems, it also requires an investment in engineering and operations to run reliably.

Ensuring reliability and fault tolerance

By its nature, CDC is an ongoing, streaming process. It doesn’t have a clear end state but rather keeps running indefinitely. This requires considering fault tolerance. What happens if the CDC process or connector fails? 

Ideally, it should restart and continue from the last saved position without losing or duplicating events. Many CDC systems use a checkpoint or log sequence to track where they left off in the source’s change stream. 

For instance, Aerospike’s implementation tracks a Last Ship Time or similar marker for each destination to record the timestamp up to which changes have been sent. A challenge arises if the target or network is down for an extended period because the CDC mechanism must buffer or retain a large volume of changes and then catch up. 

Using a durable message queue such as Kafka or cloud pub/sub in the pipeline provides built-in durability; changes sit in the log until the target consumes them, and if the target is down, they just accumulate within retention limits rather than disappearing. This pattern decouples the source from the target’s availability. Without such a buffer, a pure push system might lose changes if the target can’t be reached in time. So, reliability best practices for CDC include using robust messaging middleware, checkpointing progress, and possibly writing a redo mechanism that reloads a full snapshot and reconciles if necessary. 

Testing the CDC setup under failure scenarios, such as simulating a network partition or restoring from a backup, is important to demonstrate that you can recover and resync without data loss. Building a CDC pipeline isn’t just about capturing changes, but also about guaranteeing delivery of those changes despite failures, which often involves additional tooling and configuration to get it right.

Data replication between data centers: Log shipping vs. Cross Datacenter Replication

Discover how Aerospike's Cross Datacenter Replication (XDR) delivers ultra-low latency, precise control, and efficient data transfer to enhance global data performance.

Aerospike and CDC

Change data capture has become an indispensable part of data strategy, supporting everything from real-time analytics to geographically distributed databases. Aerospike, as a high-performance NoSQL database platform, is designed with these needs in mind. 

In fact, Aerospike provides integrated CDC through its Cross Datacenter Replication (XDR) feature. Aerospike’s approach uses an internal change notification stream similar to CDC: as data is written or updated in the database, those changes are published via XDR to other Aerospike clusters or external systems in real time. Each change event, including record updates and deletions, is captured with metadata and is reliably delivered to its destination. 

Aerospike offers an at-least-once delivery guarantee, so no updates are lost in transit, and it allows fine-grained control, such as filtering which records or bins to replicate. In practice, this means an Aerospike cluster continuously pushes out changes to, say, a remote cluster, a Kafka topic, or a downstream application, with little lag and without slowing down the source database.

For organizations looking to build fast and scalable data infrastructures, Aerospike’s built-in CDC (via XDR) simplifies the pipeline. It eliminates the need for third-party CDC tools to get data out of the database, and it uses Aerospike’s proven low-latency engine so the change feed moves as quickly as possible. Because Aerospike was designed for high throughput and strong consistency options, it handles the intensive workload of change capture and replication even on large datasets. 

If your business needs real-time data distribution and always-synchronized systems, see how Aerospike helps. Aerospike’s platform not only serves ultra-fast transactions, but also shares those transactions as events with the rest of your ecosystem. To learn more about Aerospike’s technology and how it provides reliable change data capture to start building data pipelines on a solid foundation.

Try Aerospike Cloud

Break through barriers with the lightning-fast, scalable, yet affordable Aerospike distributed NoSQL database. With this fully managed DBaaS, you can go from start to scale in minutes.