The three price tags: How Redis unpredictability costs you infrastructure, engineering time, and UX
Redis unpredictability shows up in three places: your infrastructure bill, your engineering capacity, and your users' experience. Here's why it happens and what it costs.
Redis became a standard because it solved problems. Low-latency reads, simple data models, and fast time-to-first-value made it a good starting point for teams that needed speed without complexity. Most companies that adopted it had made a reasonable decision.
But when the conditions under which Redis performs well are no longer true, it takes a team a while to recognize that they are managing structural problems.
There are three places where Redis' architectural limits turn into organizational costs:
Infrastructure costs
Engineering time
User experience
Moreover, these problems reinforce each other and tend to compound until a crisis.
What makes Redis unpredictable at scale
Understanding these costs requires understanding the cause. Redis was designed as a single-instance, in-memory data store. That design is efficient and fast when the working set fits in memory, access patterns are stable, and operations stay simple.
But when those conditions shift, which nearly always happens in production, the same architecture that made Redis fast makes it fragile.
The single-threaded constraint
Redis processes commands on a single thread. This means that no matter how much CPU capacity you provision, only one core does the work at a time. One slow command, a large scan, or a blocking operation introduces latency across every concurrent request in the queue.
This means Redis doesn’t take full advantage of the hardware it runs on. Teams often compensate by running multiple instances per host, but this adds coordination overhead and complexity without fixing the problem.
Memory as the only first-class store
Because Redis stores all the data in memory, it eventually runs into problems. When datasets grow beyond available RAM, Redis has two options: evict keys according to a configured policy, or reject writes.
Neither is a good solution. Eviction loses data, while rejection results in application errors. What’s worse is that in both cases, failure isn’t obvious until it affects production.
Persistence creates a performance tradeoff
Redis supports persistence through two mechanisms: point-in-time snapshots (RDB files) and append-only log files (AOFs). Both approaches involve tradeoffs. Because RDB snapshots are asynchronous, they lose data written between checkpoints. AOF logging is more durable but introduces latency and throughput overhead when writes are frequent.
In practice, many teams disable or relax persistence settings to preserve the speed they adopted Redis for in the first place, but lose data in the process.
Clustering adds overhead without addressing the root causes
Redis Cluster distributes data across multiple nodes and is the standard approach for scaling beyond one instance. But the cluster proxy layer introduces coordination overhead, and the single-threaded design still applies within each shard. Horizontal scaling gives you more nodes, but it doesn’t improve performance linearly. Moreover, resharding introduces latency spikes.
The first price tag: Infrastructure costs that grow faster than data
In many cases, organizations try to solve this problem by throwing hardware at it, either more nodes or more storage, or both. But these costs add up without improving performance much. Here’s why.
RAM pricing compounds
Every byte of data in Redis requires a byte of RAM. For small datasets, this is fine. But with terabyte-sized datasets, it costs money. RAM is more expensive than SSD storage, and the gap widens as datasets grow.
The cost model is also nonlinear. To keep performance levels, teams routinely overprovision memory. The closer Redis runs to its memory ceiling, the more eviction pressure affects behavior, so staying safely below that ceiling requires allocating excess RAM. That buffer is effectively dead capacity that costs money every month.
An independent benchmark by McKnight Consulting Group comparing infrastructure costs at 1 TB, 5 TB, and 10 TB datasets found Aerospike infrastructure costs running 78% to 87% lower than Redis across all three dataset sizes.
The node multiplication problem
As datasets grow, teams add Redis nodes, either to distribute the dataset or to give it higher throughput. Each node requires RAM provisioning, monitoring, configuration management, and operational overhead. Every new node adds more engineering work.
When Adjust migrated away from its Redis-based fraud prevention infrastructure, it reduced its server count from 40 nodes to 6, an 85% reduction, while improving latency and failover consistency.
This shows how much infrastructure Redis needs to handle a dataset that another architecture handles with a fraction of the hardware.
The second price tag: Engineering time spent managing Redis' limits
Hardware is a high enough cost, but in many organizations, the highest cost is the people. As Redis runs up against its limits, that means more engineering time solving its problems.
Workaround accumulation
Every Redis limitation a team encounters tends to get solved with an additional layer. Hot key problems get solved with local caching. Eviction problems get solved with TTL tuning and explicit pre-warming scripts. Persistence tradeoffs get solved with a separate durable store. Clustering overhead gets solved with read replicas and proxy logic.
Each workaround is reasonable in isolation. Together, they form an architecture that is more complex, and each component in that stack requires someone to understand, maintain, and debug it.
This engineering time cost isn’t obvious but accumulates steadily. It is the hour spent investigating an eviction spike. It is the on-call page when a Redis node failover causes downstream write loss. It is the architecture review where the team decides whether to add another cache tier or finally reconsider the data layer.
Failover behavior in practice
Redis replication is asynchronous by default. During a primary failure, the replica that gets promoted may not have received the most recent writes before the primary went down. How much data gets lost during a failover depends on replication lag at the time, which is often unknown in advance and varies under load.
For teams running high-write workloads, this means additional safeguards such as backup processes, write-ahead queues, or dual-write patterns that add coordination complexity and engineering overhead.
Capacity planning without predictability
Teams running Redis at scale discover that capacity planning is difficult because its performance characteristics are not linear. Adding memory does not produce proportional latency improvements. Adding nodes does not produce proportional throughput improvements. The single-threaded constraint and clustering overhead create a system where headroom calculations require empirical testing rather than straightforward projections.
The result is that organizations tend to overprovision, not because they expect peak load to require it, but because they cannot confidently predict where performance will degrade. That overprovisioning costs money: time spent sizing, testing, and validating clusters so they do not reach unpredictable limits under untested conditions.
The third price tag: Uneven user experience
Benchmark numbers Redis produces are often measured under controlled, stable conditions. But production conditions are not stable. Workloads shift throughout the day. Access patterns change after product launches or content updates. Datasets grow. Cache hit rates vary. Conditions that produce Redis' headline latency numbers are just some of what production systems encounter.
What matters for user experience is not average latency but tail latency, or the behavior at P99 and P99.9. A system that delivers 1ms average latency but 50ms P99 latency seems occasionally slow. But at scale, a P99 response affects roughly 1 in 100 requests, which in a busy system means a lot of degraded interactions.
The McKnight Consulting benchmark found Aerospike P99 latencies running 17% to 48% lower than Redis, with throughput 11% to 24% higher across all tested workloads.
While they’re similar for average cases, it’s at the tail where users see a difference, and the tail experience is what users remember.
Latency compounding in multi-step pipelines
Many enterprise systems do not serve user requests with one database lookup. A fraud detection check might require querying transaction history, device fingerprints, behavioral patterns, and account metadata. A personalization response might depend on user preferences, session context, content signals, and feature store lookups. A real-time bidding decision might involve dozens of parallel reads against data that needs to be fresh within milliseconds.
In these architectures, user-facing response time is limited by the slowest operation in the chain. When latency for individual responses is variable, which it is for Redis under realistic load, that variability compounds across every step. A system where each of fifty operations has a 1% chance of hitting a slow path encounters a slow path in almost every request.
This is where Redis' tail latency behavior shows up: not in simple lookup patterns, but in the orchestrated, multi-step data access patterns that power the high-value features that enterprise teams care most about getting right.
When persistence kicks in during peak load
Redis' background persistence processes of RDB forking and AOF rewriting are designed to reduce the effect on read and write performance. In practice, they introduce measurable latency and throughput variability, particularly on write-heavy workloads or when the working set is large. These processes tend to run during high-load periods, because that is when write rates are highest. In other words, just when performance is most important, background processes introduce the most interference.
For systems serving real-time user interactions, this is a meaningful reliability concern. The database is most likely to introduce latency at the moments when the application is under the most pressure to perform well.
What reliable behavior at scale requires
Architectures that avoid these problems mean taking production conditions as the design baseline rather than the exception.
Systems designed for predictability under changing conditions handle storage differently. Rather than coupling performance to the amount of memory, Aerospike's patented Hybrid Memory Architecture enables data to be stored on SSDs and maintains indexes in RAM. That means its latency is almost as good as in-memory performance without requiring enough RAM to store the entire database. This decoupling saves money and eliminates the memory ceiling as a scaling constraint.
Threading matters too. A system such as Aerospike that takes advantage of multi-core processors does not face the single-threaded bottleneck that limits how much work Redis can do per node. More work per node means fewer nodes for the same throughput, which means less hardware to manage and simpler capacity planning.
Here are some examples:
Wix replaced its Redis-based personalization stack and cut costs by 45% while reducing latency from 18ms to 2–3ms.
PayPal cut TCO by 80%, reducing server count by 87.5% while handling more than 8 million transactions per second for fraud detection with sub-millisecond latency.
These are the results you get when the underlying architecture stops requiring workarounds to perform at scale.
Consistency under failure also changes what failover events look like in practice. When replication is synchronous and write durability is guaranteed, a node failure creates a recovery event, not a data loss event, which is easier to handle.
Resiliency benchmarks in the McKnight report showed Aerospike throughput dropping 1% to 11% during simulated node failures compared with Redis' 7% to 18% drop, with stronger write consistency.
Changing from managing workarounds to running a system that handles production conditions on its own is what the engineering time cost is about. Teams that have made that transition consistently report that reducing the amount of work they have to do is as significant as the infrastructure savings. Plus, it shows up in sprint capacity rather than in a line item on the cloud bill.
Intrigued? Here’s how to find out more
Of course, every organization is different. But if your Redis installation is feeling more and more sluggish, and if you feel like you have to work harder and harder just to stay in the same place, it’s worth researching what might be causing the problem in your specific situation.
Here are three resources to help you figure it out:
The Aerospike vs. Redis benchmark report puts the infrastructure cost and latency comparisons on the same page by using the same dataset sizes and the same workload types, measured independently. It's the clearest source for quantifying whether what you're managing is a configuration problem or a design one.
If you're past curiosity and into evaluation, the Redis to Aerospike migration guide covers what’s involved in the transition from Redis to Aerospike.
And if you want the direct side-by-side before going deeper, the comparison page is the faster read.
Frequently asked questions about Redis predictability
Find answers to common questions below to help you learn more and get the most out of Aerospike.
Redis's performance is tied to memory. When a dataset fits comfortably in RAM with room to spare, Redis runs at its best. But as datasets approach the memory ceiling, it starts implementing eviction policies, cache hit rates drop, and the system must make tradeoffs between accepting writes, evicting existing data, or rejecting new keys. Additionally, larger datasets increase the cost of persistence operations such as RDB forking, which create brief but measurable performance effects. The degradation is a consequence of an architecture where everything gets stored in memory.
Sort of. Adding nodes distributes the dataset across more memory, which raises the ceiling before eviction is required. But it does not address the single-threaded processing constraint within each shard, and the cluster proxy coordination layer introduces overhead that limits throughput gains as you add more nodes. Resharding events also introduce latency spikes. Horizontal scaling with Redis improves performance, but adds costs as well.
The risk depends on the replication lag when the failure occurs. Redis replication is asynchronous by default, which means the replica may not have received every write that the primary processed before it failed. In low-write environments or when replication lag is consistently low, this doesn’t happen often. But in high-write environments or when network conditions create replication lag, it happens more often. Teams that cannot accept data loss often have to implement additional safeguards such as dual writes and external queues, which add complexity.
It depends on how much data you have. For small datasets, Redis' memory costs are manageable. But at terabyte scale, storing data in RAM versus on SSD costs a lot more. The McKnight Consulting Group benchmark quantified this at 78% to 87% lower infrastructure costs for Aerospike across 1 TB, 5 TB, and 10 TB datasets.
