Aerospike vs. Cassandra: Databases don’t need to go down to break your application
Databases don't need to go down to break your application. See how Aerospike and Cassandra compare on latency stability, node failure, and recovery behavior.
A database does not have to go offline to cause damage. In many production systems, the real problem begins more subtly. The database is still up, and requests are still being served, but the application no longer behaves as it did before. Utilization rises and performance shifts as the system moves closer to capacity. A new feature changes the access pattern and makes earlier optimizations less effective. A failure occurs and, although the database remains available, its behavior after the event is no longer the same.
That is how many systems break in practice: not through a clean outage, but through a gradual loss of stability.
A new benchmark from benchANT compares Aerospike Enterprise 8.0.4 and Apache Cassandra 5.0.3 through exactly that lens. Rather than focusing only on peak throughput or ideal steady-state latency, it looks at how each system behaves under two production-relevant conditions: sustained mixed workload as the cluster fills, and node failure during an active workload.
Availability is not the whole story
Distributed databases are designed to remain available under stress. But availability alone does not tell you whether an application will remain responsive, stable, or operationally predictable. This benchmark makes that distinction clear.
Under sustained mixed workload, Aerospike delivered 481,315 ops/s on average, compared with 138,028 ops/s for Cassandra. This was not a fixed steady-state run. Both systems were observed from a lightly populated state through to near capacity. Across that progression, Aerospike’s throughput stayed tightly bounded, declining by less than 2% over the course of the run. Cassandra’s throughput, by contrast, declined by about 25%.
The latency story was even more revealing. Aerospike kept average p99 latency below one millisecond for reads, writes, and deletes, and around 1.3 ms for updates. Cassandra operated at materially higher p99 latency throughout, with average read p99 at 8.41 ms. More importantly, as the cluster filled, Cassandra’s latency degraded while Aerospike’s remained much more tightly bounded. The gap widened further in the extreme tail. Cassandra’s p99.99 latency was not only significantly higher on average, but in some 10-second windows, read p99.99 latency reached 188.03 ms, compared with 4.56 ms maximum latency at P99.99 for Aerospike. That is not just a performance gap. It is a difference in how the system behaves as conditions change.
Why tightly bounded latency matters to applications
Modern user-facing applications are becoming increasingly data-dependent. A single user interaction often fans out into many underlying database accesses, and as fan-out grows, even rare slow operations become far more likely to affect the user experience.
For example, if one interaction requires 100 database lookups, around 1% of interactions will be exposed to the data layer’s p99.99 latency. In other words, behaviors that appear extremely rare at the database layer stop being rare from the application’s point of view. That is why tightly bounded latency matters so much. In these systems, average performance is not enough. To keep applications responsive, the data layer must remain not only fast, but predictably fast.
A system staying up is not the same as a system recovering well
The second part of the benchmark makes the point even more clearly.
benchANT introduced a node failure four hours into an active workload and observed how each system behaved during redistribution and recovery at both 50K ops/s and 100K ops/s. Aerospike remained operational at both workload levels, automatically initiated redistribution, and kept latency tightly bounded. At the point of failure, its p99 latency increased by only around 0.1 ms.
Cassandra remained operational at 50K ops/s, but with materially greater degradation and variability than Aerospike. At 100K ops/s, it could not maintain stable operation after the failure. The cluster initially survived the node loss and completed redistribution, but later became unstable, further nodes failed, and the benchmark ended in cascading failure. During redistribution at 100K ops/s, Cassandra’s p99 latency rose by roughly 6.0 ms and became highly jittery, while Aerospike’s remained tightly bounded.
The question is not only whether a database remains available after a node fails, but whether it continues to behave in a way the application can rely on while the system is under stress.
What this benchmark really shows
The conventional way to talk about databases is through speed, scale, and availability. Those things matter, but they do not tell the whole story.
For production systems that sit directly in the path of user interaction, the more important question is often this: Can the system remain predictable when conditions become less favorable? That is the broader significance of these results.
Aerospike did not just deliver higher throughput or lower latency in isolation. It maintained tighter bounds as the system filled, behaved more consistently across operation types, and remained more controlled during disruption and recovery. Cassandra, by contrast, stayed available in some scenarios but became materially less predictable as stress increased.
And that is exactly how applications get broken without a database ever going down.
Download the full Aerospike vs. Cassandra benchmark
The most important takeaway is not that Aerospike was faster in this benchmark; it is that Aerospike remained more predictable as production conditions became more volatile. For the full methodology, charts, caveats, and raw results, download the benchANT paper.
