Memcache users - Shift, don't Shard

Caching with Memcache

Many developers, like those at Snapdeal, start with a relational database and find that performance is insufficient. The pressures of throughput and/or latency force them to use a caching solution like Memcache (aka Memcached) and RAM rather than disk. For small instances where a single cache can hold enough data, this may prove sufficient. In other cases, a single instance of Memcache is not enough to deal with the volume of traffic, so other strategies are required.

Sharding to Scale

Companies may use sharding inside the application or some other similar technique to make use of more than one Memcache instance to create additional capacity. There are problems with these strategies that get progressively bigger as you try to scale this to many Memcache instances. These include one or more of the following:

The cache size itself can be problematic. While it is easy to think that the cache will hold the “most valuable data,” this is often very difficult to predict. Choosing a cache too small renders it useless. Some use cases may require caching all the data and using the relational database as a backup. The hardware costs of storing this much data in memory can be very high.
Sharding becomes increasingly difficult. Making changes to the sharding algorithm as you add nodes is disruptive, invalidating the existing data cache and forcing a rebuild. During this time performance will at best degrade, but may also run into service outages for a time. You may find that some shards have a grossly disproportionate amount of data or traffic, forcing you to spend a lot of time manually tuning these.
There is no automated fail-over to another cache if a Memcache instance dies. While Memcache is very stable as a product, it is still subject to hardware fault. If a power supply dies, the data it held in its cache will be inaccessible. This will require developers to code their application to either go back to the relational database or simply produce an error.
There is no persistence and when a Memcache instance dies and all the data in that node goes with it. As a result when a host goes down to due to a hardware fault, the cache on the host must be rebuilt (if possible). This can be a very time consuming process.
Developers must think about and manage the data between the relational database and Memcache separately. This often results in making multiple reads or writes – one for the cache and one for the relational database. Although caching is not a difficult concept, it can complicate development work.

Shift don’t Shard

Instead of managing a sharding layer (and its many separate shards) with a relational database, customers are now shifting to a single Aerospike cluster and a cache-less architecture. This gives them:

Automatic and immediate failover when a node fails.
Replication of data to 2 or more copies. Even if a node fails, the Aerospike cluster will work to maintain the proper number of copies of the data.
Predictable performance with the same level of latency to any record across the cluster. There is no need to determine the most valuable data, because it is all handled the same way: with very high throughput and low latency.
Lower cost using SSDs rather than memory. The costs calculated here are for an extended period of time. While the way that some databases use SSDs may lead to SSD hotspots that can erode a disk, Aerospike has its own mechanism for evenly wearing SSDs. Customers have used SSDs on Aerospike in extremely high intensity environments for years with very few failures. In addition, a single server with SSDs can generally handle much greater volumes of data than with RAM. This leads to fewer servers, which lowers costs even more.

Replace Memcache or replace Memcache and MySQLOne set of customers have been replacing Memcache with Aerospike and keeping their relational database. This gives them:

Immediate failover in the event of a lost node.
Persistence of the data on disk.
Choice of RAM or low-cost SSD to store the data.

Another set of customers replace both Memcache and MySQL. Here the application simply reads and writes all data from an Aerospike cluster. Rather than maintaining separate connections to Memcache and a relational database, applications simply write to an Aerospike cluster. This gives them all the benefits listed above, plus:

Simpler application development. Developers need to only worry about interfacing with an Aerospike cluster rather than reading and writing from both a cache and a relational database.
Simpler operations of a single (self-managing) Aerospike cluster as opposed to a relational database and many independent Memcache instances.

Aerospike provides the same performance and latency of Memcache, while being much easier to manage.

Memcache users – Shift, don’t Shard

Caching with Memcache

Sharding to Scale

Shift don’t Shard

Additional resources

KV cache tiering: Why GPU memory alone won't scale your LLM app

Fail fast, stay resilient: How to stop hidden gray failures in Aerospike on AWS EBS

Determining the best machine learning and AI databases

The three price tags: How Redis unpredictability costs you infrastructure, engineering time, and UX