Blog

Change data capture (CDC) to Kafka

Learn how change data capture works, why it matters for real-time data systems, and how integrating CDC with Apache Kafka enables fast, reliable data streaming for analytics, microservices, and enterprise data synchronization.

January 20, 2026 | 20 min read
Alex Patino
Alexander Patino
Solutions Content Leader

Change Data Capture (CDC) identifies and records every change made to a database, including inserts, updates, and deletions. Instead of periodically exporting entire tables, CDC works continuously by monitoring the source database for modifications and capturing those changes as discrete events. 

This means that as soon as a row is updated or a new record is added, the CDC process logs the change. By capturing only the deltas, or the changes since the last update, rather than the full dataset, CDC reduces both the load on the source system and latency in data propagation. Captured changes get forwarded to downstream systems to remain synchronized with the source in near real time.

Why is this important? Traditionally, organizations used batch Extract, Transform, Load jobs to move data between systems, such as exporting a database each night to update a data warehouse. That approach takes longer to synchronize. CDC, by contrast, syncs changes almost immediately, reducing batch processing delays. This keeps data across systems updated for real-time needs. 

CDC also improves overall data consistency across an enterprise by removing the need for ad hoc dual writes or manual updates. Once a change is captured centrally, all consumers reliably get that same update. In practice, CDC often taps into a database’s internal transaction logs to detect changes or uses triggers in some cases. Reading from transaction logs is efficient because it captures every commit in order without adding workload to the primary database. Through this approach, CDC propagates changes reliably and with less performance impact on the source database. 

In summary, CDC provides a foundation for real-time data replication and synchronization, allowing systems to react to business data as it changes, rather than hours or days later.

Data replication between data centers: Log shipping vs. Cross Datacenter Replication

Discover how Aerospike's Cross Datacenter Replication (XDR) delivers ultra-low latency, precise control, and efficient data transfer to enhance global data performance.

Why integrate CDC with Apache Kafka?

Capturing data changes is only half of the equation. Those changes still need to be delivered to the systems and services that use them. This is where Apache Kafka fits in. Kafka is an open-source distributed event streaming platform designed to publish and subscribe to streams of records in a durable, fault-tolerant way. In the context of CDC, Kafka serves as the high-throughput conduit that carries change events from the source database to downstream consumers that need them. Integrating CDC with Kafka gives organizations a reliable pipeline that handles many database events without losing them. 

Kafka’s architecture makes it suitable for this task. It is essentially a persistent commit log with replication and partitioning built in. This means every change event written to Kafka is stored durably and replicated across multiple brokers, so consumers have access to it even if some nodes fail. 

It also means Kafka handles many events by partitioning topics and spreading the load across a cluster of machines. Enterprises using CDC often deal with fast data, such as a busy orders database or sensor readings flowing nonstop. Kafka’s partitioned, distributed design reads and delivers these streams of changes with consistently low latency and high throughput

In practice, CDC pipelines use Kafka to decouple the source database from consuming applications. The database pushes each change event to Kafka once, and then many different subscribers, such as analytics systems, search indexes, and microservices, retrieve those events at their own pace without affecting the primary database. Kafka helps each subscriber see the events in the same sequence they were recorded per topic partition, preserving the order of updates for each key as long as a suitable partitioning strategy is used. It also retains them; Kafka stores change events for hours or days, which is useful if a downstream system goes offline and needs to catch up later.

In short, Kafka complements CDC by acting as the scalable, resilient messaging layer for change events. It means that no matter how fast your data changes, those changes get streamed out to all interested systems reliably. This combination of CDC for change identification and Kafka for change distribution supports real-time data pipelines. Businesses react to fresh data across a distributed environment without worrying about missing events or overwhelming their databases.

Building a CDC pipeline to Kafka

Implementing a CDC-to-Kafka pipeline typically involves connecting a database’s change data feed into Kafka in a streaming, automated fashion. One common approach is to use Kafka Connect with a CDC source connector. Kafka Connect simplifies integration between Kafka and external systems through configuration alone, with no custom code. 

For example, to capture changes from a relational database, an organization might use a Debezium connector via Kafka Connect. Debezium is an open-source CDC tool that tailors itself to databases such as MySQL, PostgreSQL, and Oracle by reading their transaction logs and translating every insert/update/delete into a corresponding event stream. 

Once configured, Debezium monitors the database’s log and pushes change events into Kafka topics in real time. Each event might contain both old and new values of a record for update events, or the details of a deleted row, depending on the database’s capabilities. Because this process reads the database’s own commit log, it captures changes accurately with little effect on the database’s performance.

For many enterprise examples, pre-built connectors such as Debezium or Confluent’s CDC connectors are the fastest route to getting CDC data into Kafka. These connectors handle the heavy lifting: they snapshot the current state if needed and then emit a steady stream of events for changes. Operators configure which tables or collections to monitor, and the connector translates each change into a Kafka message, typically writing to a topic named for the table or entity. The Kafka Connect framework manages execution, scaling, and fault tolerance of these connectors, so if a connector task or node fails, it resumes from the last recorded position without losing data. 

In other cases, such as NoSQL and high-performance databases, the database itself provides a direct CDC integration to Kafka. For example, Aerospike Database’s Cross Datacenter Replication (XDR) feature includes change notification. Aerospike’s XDR can be configured to ship every record change or insertion as an event into Kafka or another streaming platform as it happens. 

Because this capability is built in, it matches the database’s own throughput and doesn’t require external polling. It treats Kafka as just another replication target, so it propagates changes from the database to the messaging layer quickly. Similarly, some cloud databases or data warehouse technologies offer CDC streams that plug into Kafka through connectors or custom integration.

Regardless of the specific tool or connector used, a typical CDC→Kafka pipeline works like this: 

  • The CDC component reads the source database’s change log or listens for changes

  • Converts each change into an event, such as a JSON or Avro message with before/after data and metadata

  • Publishes that event to a designated Kafka topic

  • Downstream applications read data from those Kafka topics to update their own state or trigger processing

The whole pipeline is usually asynchronous. The source database transaction commits independently, and the CDC process fetches and publishes the change shortly after. Well-designed CDC pipelines have low lag, often measured in milliseconds to a few seconds, from the time of a database commit to the time a consumer sees the event. They also provide at-least-once delivery of changes to Kafka; with proper configuration, such as Kafka’s idempotent producers and exactly-once semantics, they provide exactly-once delivery so consumers don’t get duplicate events.

When implementing such a pipeline, it’s important to configure offsets and error handling. For instance, if the CDC connector falls behind, perhaps because the target Kafka broker was down briefly, it should resume from the last processed log position to avoid missing events. This is generally built into popular CDC tools. 

Moreover, using Kafka as the buffer means the system decouples the source and sink: even if a target system is down or slow, Kafka retains the change events until they are retrieved, rather than burdening the source database. This decoupling is one of the great advantages of a CDC+Kafka architecture: It adds resiliency and flexibility in how data flows through an organization.

Considerations for low-latency CDC pipelines

Building a CDC pipeline with Kafka for an enterprise system introduces design considerations. Especially when dealing with high-performance, low-latency requirements, the pipeline needs not only to work but also to be robust and efficient. Here are several considerations and how to address them.

Preserving data consistency and order

One aspect of any CDC pipeline is preserving the correctness of data as it moves from the source to consumers. This includes delivering changes in the same order they were committed in the source database and avoiding duplicates or data loss. Kafka by design preserves the order of messages within a partition, so if all changes for a particular entity, such as a customer account, go to the same partition, they arrive in sequence. 

The CDC mechanism should therefore use a consistent partition key, often the primary key or record ID, for in-order delivery per record or entity. This way, if a database row is updated twice quickly, consumers see those two events in the correct order.

Beyond ordering, consistency also means not losing any events and not processing any event more than once. By default, Kafka delivers messages at least once, which could result in duplicates if a failure occurs during processing. However, there are best practices for effectively delivering exactly. Using Kafka’s idempotent producer feature means retries of the same message won’t create duplicate events on the broker. 

Additionally, using Kafka transactions for the CDC producer groups the series of operations, such as a snapshot plus a stream of changes, into an atomic unit, so consumers never see partial results. In practice, many CDC pipelines rely on upstream log sequence numbers or offsets to deduplicate events. For instance, each change event might carry a unique log position or timestamp. Consumers track the last seen position to detect and ignore any duplicates, providing an additional safeguard.

When multiple systems retrieve the change events, a schema registry becomes valuable as well. All consumers should interpret the data consistently. If the schema of the database changes, such as a column being added or a data type change, using a schema registry in the pipeline helps manage versioned schemas so producers and consumers stay in sync. This prevents inconsistent interpretations of the data that could lead to errors.

In summary, to maintain consistency: 

  • Use appropriate partitioning for ordered delivery

  • Use idempotent producers or transactional writes to Kafka to eliminate duplicates

  • Enforce schema consistency across producers and consumers

These steps help every interested system see a reliable, correct stream of changes in the proper sequence.

Reducing latency and overhead

Enterprises choose CDC to support real-time data flows, so keeping latency low is important. Every component in the pipeline, from the CDC capture mechanism to the Kafka brokers to the consumers, should be improved for quick processing of each event. 

One important consideration is how to do CDC capture. Log-based CDC, by reading the database’s transaction log, tends to be efficient because it avoids any extra work in the transaction path and tail-reads already-written log records. This reduces overhead on the source database, so changes get extracted without slowing down transactional workloads. 

In contrast, trigger-based CDC by using database triggers to record changes adds overhead to each write on the source, which increases latency for the transaction itself and produces more load, which is something to avoid for high-performance systems.

On the Kafka side, improve end-to-end latency by configuring producer and broker settings appropriately. For example, the Kafka producer’s linger time and batch size determine how quickly messages are sent versus buffered. Lowering these reduces latency, at the expense of throughput. Kafka brokers support configuring replication settings and acknowledgement policies; for the lowest latency, some pipelines run with asynchronous replication or a low ack quorum, though this must be balanced against durability needs. 

Network factors also come into play: Placing the CDC capture process, Kafka brokers, and consumers in the same region or data center reduces WAN latency. High-speed networking and improving the network stack with TCP settings shaves off milliseconds. Moreover, using efficient serialization formats for the change events, such as a compact binary format like Avro or Protocol Buffers instead of verbose JSON, reduces message size, which makes transmission and decoding faster.

Another tactic for reducing lag is to provision the Kafka cluster to handle peak change throughput with headroom. If the volume of change events spikes, such as a bulk update or a promotional event that creates more activity, the brokers should be able to handle that burst without building up long queues. Otherwise, if Kafka becomes a bottleneck, end-to-end latency suffers as changes back up. 

Monitoring tools are important. By keeping an eye on Kafka topic lag and how far behind consumers are, and the CDC connector’s own internal metrics, such as “source lag” in Debezium, which shows how far behind the database log it is, operators detect whether the pipeline is falling behind and adjust resources accordingly. In well-tuned systems, CDC to Kafka end-to-end latencies range from a few milliseconds to a few tens of milliseconds, which is nearly real-time for most practical purposes.

Scalability and fault tolerance

Enterprise data systems must not only be fast, but also scalable and resilient to failures. In a CDC-to-Kafka pipeline, there are multiple components to consider: the source database, the CDC capture service, the Kafka infrastructure, and the consumers. Each of these should be configured for high availability and horizontal scalability.

Kafka itself is designed as a distributed, fault-tolerant system. To take advantage of this, configure Kafka topics used for CDC with a sufficient replication factor, typically at least 3 in production, so if one broker goes down, the data is not lost, and consumption continues from a replica. Kafka fails over leaders for partitions to other brokers in case of failures. 

The CDC producer, whether it’s Debezium, a custom app, or a database feature such as XDR, should also be redundant. For example, Debezium connectors run in a clustered mode where a standby takes over if the primary fails, picking up from the last saved offset. 

It’s important that the CDC process keeps track of its read position in the source’s log, so after a crash or restart, it resumes without missing or duplicating events. This is typically built in, but testing failover scenarios is a good practice.

Scaling the pipeline is usually a matter of scaling out Kafka partitions and the consumers. If there are more change events, adding more Kafka brokers and partitions spreads the load. The CDC producer may also sometimes be scaled. For instance, some databases allow multiple log readers for different tables or shards. In Kafka Connect, you might increase the number of tasks for a connector if it reads in parallel from different tables or partitions of the source. Consumers can obviously be scaled horizontally by adding more instances in the consumer group, each handling a subset of partitions. The loosely coupled nature of Kafka means each component scales relatively; add broker capacity without touching the database, or add consumer instances without altering data capture.

Another aspect of fault tolerance is handling network or downstream outages gracefully. If a target system, such as a sink that retrieves data from Kafka, goes down, Kafka retains the events until it comes back, so no data is lost. This buffering is an advantage of using Kafka as an intermediary. 

Similarly, if the network between the CDC process and Kafka experiences a partition or outage, a well-designed pipeline buffers changes on the source side, at least up to a point, and catches up when the connection is restored. Flow control and backpressure handling are important here: Kafka notifies producers if it can’t keep up, perhaps because if its internal buffers are full, and the CDC process should pause or throttle capture so as not to overwhelm any component.

Finally, remember security and governance. In an enterprise setting, the CDC pipeline might carry sensitive data. Kafka supports encryption in transit, access control, and other security features, which should be set up to protect the data stream. While this is tangential to performance, it’s a necessary consideration for a production deployment in industries with compliance requirements.

By designing with these considerations in mind, a CDC-to-Kafka pipeline meets all the aspects of enterprise data architecture: high throughput and low-latency data flow that is reliable even in the face of surging load or component failures. This makes the pipeline trustworthy as a backbone for real-time data across the organization.

(Webinar) Architecting for in-memory speed with SSDs -- 80% lower costs, same performance

Discover how Aerospike’s Hybrid Memory Architecture (HMA) uses high-speed SSDs to deliver in-memory performance at a fraction of the cost. Watch the webinar to explore the design behind sub-millisecond reads, massive scale, and unmatched efficiency.

Use cases and benefits of CDC to Kafka

Integrating CDC with Kafka supports a variety of useful enterprise jobs, especially those requiring quick insights and responsive processes. Here are some examples where CDC-to-Kafka pipelines shine, along with the benefits they provide.

Real-time analytics and dashboards

One of the primary uses for CDC is the need for up-to-the-second analytics. By streaming database changes into analytics platforms or data warehouses through Kafka, businesses analyze events almost as soon as they occur. 

For example, an e-commerce company captures every transaction and cart update with CDC and pushes those events into a real-time analytics system. Live dashboards then reflect current sales numbers, inventory levels, or user behavior without waiting for batch updates. 

The benefit is clear: decisions get made on fresh data. If an anomaly or opportunity emerges, such as a surge in demand for a product, it shows up in the metrics right away. CDC to Kafka means the analytical systems, such as a streaming SQL engine, an operational data store, or a tool like Spark or Flink retrieving from Kafka, are never far behind the database. 

This continuous data flow makes BI reports and AI/ML models more accurate as well. Models get trained or updated on the latest information, making them more relevant. In summary, CDC pipelines support real-time analytics, which helps businesses compete in fields such as finance, with fraud detection on live transactions; telecommunications with monitoring call/network events; or online retail with real-time personalization and recommendations.

Data synchronization across systems

Enterprises often have a heterogeneous data ecosystem: operational databases, caches, search indexes, and legacy systems all contain overlapping data. Keeping these systems in sync is a challenge. 

CDC with Kafka offers a solution by acting as a central event hub for all changes. When the main database is updated, changes get broadcast with Kafka to update a caching layer to keep the cache consistent, to push to a full-text search index such as Elasticsearch, and to notify any other service that maintains a replica or derivative of that data. This approach supports eventual consistency across disparate systems in a matter of seconds or less. 

For instance, a financial institution might use CDC to propagate account balance updates from a core banking system out to branch office systems and mobile app databases, so no matter which interface a customer uses, they see the same up-to-date balance. Before adopting CDC streams, such synchronization was often done with periodic jobs or ad hoc API calls. 

With Kafka, it becomes a cohesive publish-subscribe system where each system that needs the data subscribes to the relevant topics. Moreover, using Kafka as the transport provides a clear audit trail of all changes, which is useful for debugging and compliance, and decouples the source and targets. Each target system processes incoming changes at its own speed without affecting others. 

The benefit to the enterprise is consistency and integrity of data spread across many systems, with little manual intervention. It also reduces duplication of effort. Rather than each integration writing its own point-to-point sync code, the CDC pipeline acts as a universal feed. This pattern is also seen in multi-region or multi-datacenter deployments, where CDC events in Kafka distribute data changes to other geographic locations to keep regional databases in sync for global applications.

Event-driven microservices

As organizations move toward microservices and event-driven architectures, CDC to Kafka supports inter-service communication. In an event-driven design, services react to events such as “order placed” or “customer updated” rather than relying on direct REST calls or database polling. 

But where do these events come from? Often, they come from database changes: When a transaction is committed in the Orders service’s database, that is effectively an “OrderPlaced” event. CDC captures that and sends it through Kafka, where any number of other services use it to trigger their own logic. 

For example, consider an “inventory” microservice that needs to update stock levels when an order is placed, and a “notification” service that emails the customer with an order confirmation. By streaming order changes with Kafka, inventory and notification services all subscribe and react in real time, without the Order service having to call each one directly. 

This results in a loosely coupled system. The Order service writes to its database as usual, and events fan out to whichever services are configured to listen. The benefits here include scalability, because services get added or removed as subscribers without changing the publisher; resilience, because if a consumer service is down, Kafka buffers the event until it’s back up, so no data is lost, and clarity, because the event log in Kafka provides a history of what happened in the business. 

Essentially, CDC turns the database into a source of business events. With CDC, even legacy systems that weren’t built with event generation in mind participate in an event-driven architecture. This pattern is increasingly used in industries such as e-commerce, where one customer action cascades to many microservices. Using CDC streams means each service is working off the same source and reacting quickly to changes. 

Over time, this helps break monoliths into microservices: As the monolithic database’s changes are published to Kafka, new microservices get built to read those events and gradually take over functionality, all while keeping data in sync during the transition.

Beyond these examples, CDC to Kafka has other applications. Use it for zero-downtime data migrations by streaming data from an old database to a new one via Kafka, feeding machine learning feature stores with live data, replicating data to cloud systems for hybrid cloud architectures, and powering operational dashboards that monitor activity in real time. 

In all cases, the overarching benefit is real-time data flow with decoupling: Changes are available to all who need them, quickly and consistently, without adding stress to the source systems or creating tightly bound integrations. For enterprises focused on high performance and low latency, this opens the door to building reactive systems that respond to events as fast as they happen, whether that means stopping fraud, delighting a user with a timely recommendation, or scaling infrastructure in response to load, all based on fresh data streaming through the organization.

Aerospike and CDC

The Aerospike Database is built for low-latency, high-throughput operations, and complements CDC pipelines through its native support for streaming data out. In fact, Aerospike’s Cross Datacenter Replication (XDR) feature incorporates a CDC mechanism to feed systems such as Kafka with change events as they happen, turning Aerospike into a fast, reliable source of streaming data. This means enterprises trust Aerospike to be the system of record for high-speed transactions and funnel those transactional events into Kafka for analytics, indexing, or microservice consumption.

The same XDR framework that streams changes into Kafka also supports Aerospike’s multi-region deployments, including active–passive, active–active, and other advanced topologies. This helps organizations maintain synchronized clusters across regions for disaster recovery, data locality, and continuous availability.

In practice, integrating Aerospike with Kafka helps organizations get the best of both worlds: an ultra-fast operational database handling important workloads, and a distributed streaming Kafka backbone, distributing the data to all the right places in real time.

Aerospike’s design means this integration doesn’t add much latency, preserving the speed applications need. If you’re building or revamping a data architecture for real-time performance, it’s worth exploring how Aerospike could serve as the engine of your data ecosystem.

Try Aerospike Cloud

Break through barriers with the lightning-fast, scalable, yet affordable Aerospike distributed NoSQL database. With this fully managed DBaaS, you can go from start to scale in minutes.