Five reasons to replace Redis
Please accept marketing cookies to view this content.
“Hey, everybody! This is George Demis from Aerospike. I’m gonna wait, uh, just a minute or two for any folks that are logging in, so stand by. But you’re at the right place. This is ‘Five Reasons to Replace Redis,’ with the venerable Ronen Botser from Aerospike and myself from the product marketing team. We’re gonna start, as I say, in just a minute or so. In the meantime, I suggest maybe humming a little tune to yourself to provide music.
Ronen, how are you feeling today?”
“Good. I feel venerated. It’s pretty nice.”
“Good, that’s only appropriate. Alright, oh, by the way, there is a chat and a questions function in the tool, so if you have any questions or statements or comments, we’re happy to hear from you. Despite the title, this is not trying to be a hit piece on Redis, but we find a lot of Redis installations, when they get to a certain scale, turn to Aerospike. So, we’re going to talk about why that is and try to do it as dispassionately as we can. You know, no real big issues with Redis. It’s a great platform, people love it, and it does extraordinary things. But, as we’ve seen, at a certain scale it starts running into some problems.
There’s been recent excitement with Redis with their licensing announcement, but we’re not going to touch on that today. For those of you who are in situations where you’re using Redis and maybe starting to approach its limits, we hope you’ll remember your friends at Aerospike and the amazing technology that our engineers have created.
I’m two and a half years into the company and still pretty amazed by our capabilities, so I’m happy to talk on the subject. Ronen has long experience with Redis. In fact, if you want a good dose of Ronen in his pure form, go and look up that garbage benchmark video from an Ignite session he did. It is really something. So, thank you for that, Ronen. Whenever I’m feeling down, I go and look at that video. Alright, so three minutes have passed. We’re going to start ‘Five Reasons to Replace Redis.’
One is probably the most common, and that is for improved scalability. Aerospike lives in a rarified air of the million, 10 million, 100 million transactions per second, while still offering very low TCO. The design of Redis is very heavily dependent, obviously, on in-memory performance. So, to serve large data sets, Redis clusters tend to sprawl. They also, when you get to a certain scale, become a bit more predictable as far as performance, especially if the amount of data surpasses the amount of memory available to a Redis node.
There is then the problem of the carbon footprint by running much larger clusters than you really need and associated cost efficiency reasons. So, we’ll talk about each of these in turn. First, we’ll talk about improved scalability. This is probably the number one reason that people look to move on from Redis, especially for mission-critical workloads. You know, that is to say, Redis is running plenty of mission-critical workloads, but up to a certain scale. So, the question then is, Ronen, what is behind the scalability issues?”
“First, it’s the architecture itself. It was designed as an in-memory data store and cache. It’s a single instance architecture, and they have, in recent years, added clustering and replication and all that, but that’s additional CPU overhead. Can you talk about that for a moment?”
“Yeah, of course. I mean, as opposed to Aerospike, Redis was never designed to be a distributed database. So, a lot of the concepts, in order to make it work as a distributed database, you’ve got to do several things. Either, you know, use the old school approach where you shard it on the application side and then you know, you manage that in your application, and adding nodes, dropping nodes, doing replicas becomes kind of a nightmare. Or you use a proxy, whether it’s the Enterprise version of Redis or something that’s been put out there by the community. Those proxies actually make it into a distributed database because each Redis instance is completely unaware of any other instance around it. It’s not aware of how many cores are on the machine or what the resources are like or anything really. That proxy requires CPU and memory of its own, right? So, you run these things on the same host machine, and you must dedicate a certain amount of resources to the proxy layer. That means fewer cores are available for Redis because the instances are going to want to just go as hard as they can on the cores given to them.
So, in the end, because it’s not designed to be parallelized and it’s not designed to share information across threads or deal with locking between threads or any of those things, you end up having these extra systems like proxying, the persistence layer, you know, if you’re dealing with any type of storage. You know, you’re putting in RocksDB or whatever replacement for Rocks they’re going for now. Those things by themselves are CPU-heavy applications, and you start getting contention between them. The persistence layer also is one of those dirty secrets about Redis. When they publish benchmarks, they’re always really happy to show you the numbers when there’s nothing else turned on, like, for example, not using persistence to an AOF or RDB or whatever persistence layer they’re using. The second you turn those on, the fact that it’s in-memory doesn’t matter anymore. The slowest component of the system is the one that’s going to dictate the speed. The persistence layer typically creates back pressure and that reduces the throughput, right?
So, there are issues with scalability because, in reality, you have to have a proxy layer. Every application in the world is actually a distributed application, and all databases need to be distributed data stores. You must use persistence unless you’re using it only in tiny caching use cases.”
“Yeah, and we have a customer called AppsFlyer. They are a mobile attribution marketing analytics platform, and they migrated from Redis to Aerospike. They deal with billions, with a B, of devices. So, they went from dealing with 1.4 billion to five billion devices in the first year. One of the kind of killer apps for an attribution company is reducing false positives, and they were able to bring that down to zero and enhance their accuracy. So, definitely a mission-critical application for attribution and marketing analytics. So, that is one proof point that Aerospike can scale. Actually, if you go to the Aerospike customer list, it is chock-full of customers that are doing a million transactions per second, 10 million, even 280 million transactions per second on an Aerospike cluster. Now, very few companies need that type of throughput, that type of performance, but in terms of headroom, it’s almost unlimited with Aerospike. Scalability is probably the least surprising reason for people that have tried to scale Redis. So, that is number one.
The second reason is that as Ronen mentioned, it really depends on having enough memory to house your data. Redis really needs more memory than data, and if you have really large data sets, you want to bring additional servers into play to provide a larger memory footprint. That means more servers. Only so much memory per node, and then the scale out because of a number of things that Ronen just mentioned. The result is more servers, which means more power, more HVAC, more data center real estate. It just uses a lot more resources. We believe Aerospike is built into our design that we thrive on fewer higher-density nodes. We don’t rely as much on memory to get great performance. We have some really good high-performance storage technologies built into Aerospike. So, we think that smaller numbers of higher-density nodes are the way to go.
One of our customers, Adjust, had an issue with server usage and wanted to improve their accuracy for their fraud prevention capabilities. They used Aerospike to reduce a 40-node Redis cluster to a six-node Aerospike cluster. Lower infrastructure costs, improved latency, better failover, better accuracy. So, that is a good example of how efficiency and smaller clusters with fatter nodes give you a lot of benefits. Ronen, anything you’d like to add to that?”
“I mean, there’s a few things there. Number one, these higher sprawl clusters, again related to density, you can only put so much memory on a machine. Redis also takes quite a bit of overhead, not something that they publish. If you go to Aerospike’s capacity planning guide, there’s very detailed explanations of the overheads in terms of memory beyond the data you’re actually trying to store. Redis tends not to tell you about that. What they recommend is when you’re running out of CPU or running out of memory, add more nodes. That’s it. That comes down to capacity planning there.
So, as you keep adding more and more nodes, you may be in a situation where you’re in the cloud and it’s just harder to get the next node. Capacity in the cloud is not infinite. You go to an AZ, and there’s only so many machines that you have. If you need more machines and you need to scale up, that becomes an issue, especially in busy periods like towards the end of the year. That’s number one. Number two, higher meantime to failure. More machines just mean more things are going down, more instability in your cluster, and therefore, also higher costs to manage it. When you have larger clusters, you need more systems, more people to manage and make sure that they’re up and running. The stability with Aerospike, the reduction in server cost, the fact that it simply runs and works without needing to constantly manage it, all of those are advantages for our customers who have chosen to move
to Aerospike from other systems, including Redis.”
“Yeah, and here’s, on our website, just want to include, this is not just a Redis issue. It’s hard to scale distributed systems, and different platforms do better or worse at it. But one of the consistent reasons why people move to Aerospike is this efficiency, this doing more with fewer nodes. So, it’s Redis, it’s Cassandra, it’s Couchbase. Up to a certain scale, these clusters get very costly, and all those associated costs that Ronen just mentioned. So, check us out.
Alright, third challenge is predictability, and this has a lot to do with mixed workloads. A lot of databases do great with read-only workloads, but when you start getting mixes of reads and writes, especially at a huge scale, lots of platforms have trouble with it, not just Redis. But Redis does have an issue with this. The persistence layer causes, as Ronen mentioned, back pressure on the database. That balanced or write-heavy workloads are especially susceptible to this. Ronen?”
“Yeah, because in the end, you need to actually log those to a persistence layer, whether you’re using the old school AOF or some other thing. You are then going to be limited by the performance of the persistence. So, the fact that it’s in-memory is all cool if you’re going to be a cache only, no problem. You don’t need to have any of that. It could be volatile. You can go as hard as the system will allow you to go. But the second you actually want to use it as a database, that persistence layer is going to be the thing that causes back pressure. Now, you can improve it by swapping all kinds of different persistence layers, but up to an extent. These are things where the performance for any given read and write is not going to be consistent. It really depends if you’re writing to memory, if it’s already in memory, or if it’s on disk. If it needs to reach the disk, you’re going to get latency variations. When you’re doing maintenance operations related to that persistence layer, something that is LSM tree-like that defers doing defragmentation and compaction, all of those will then kick in. You get these other behaviors that then constrain and give you less predictable latency that you’d want.”
“Yeah, and so Redis did a high-profile acquisition to attempt to solve some of these problems with the acquisition of Speedb, which is a fork of RocksDB. That addresses a higher-performance storage platform for Redis. It’s a little early; they have to go through the process of then integrating that with Redis and making it a seamless solution. But let’s assume that that is on the way and people have tested it and it works. The challenge with that is that those are two databases to solve one problem, and so they are going to compete for CPU and all the other threads that mentioned the proxy and other processes that take up CPU and take up time. So, Speedb is certainly an attempt to address that issue. Ronen, anything else we should know about that?”
“Yeah, I mean, in general, anybody that’s been an engineer is aware that handoff between threads is where you get contention, right? So, the fact that you have completely different systems running on independent cores and threads means that you get contention on these handoffs. So, first thing, they need to use different resources, and you cannot use all the cores on a host machine for Redis instances as one. There’s also just straight-up contention in the passing over data that you need between one and the other. In Aerospike, we’ve really reduced those considerably. We have all these lockless data structures. We have our threads that literally do everything end-to-end: pick it up off the network, go and get information off the storage device, do the work, put it back on the network card. There is no thread handing off to another thread. We’ve worked extremely hard to get the most out of parallelizing across all the cores and threads within a node. It’s something that by design Redis cannot do as well. That’s one thing.
The other thing is, I mean, Speedb, I’m sure, is great and better than RocksDB. But in the end, congratulations, Redis now acknowledges that storage is an important thing. It’s only starting about 14 years after Aerospike, where we’ve been putting a lot of work from the very beginning into very low-level optimizations to get maximum performance out of the storage layer. Basically, using NVMe drives the same way we use memory or anything else. It’s all one unified storage format across the different storage engines. I’m just, you know, great improvement for Redis users, but it’s going to take some more.”
“So, another part of that predictability is, actually, I’m going to spend very little time on this, but Ronen, you had mentioned that Redis benchmarks are done without a persistence layer, which is great for in-memory speed, that’s for sure. But the real test, especially for large workloads, is with a persistence layer. We’re eagerly awaiting to see what Redis will publish in that regard. The other thing I’d like you to comment on, Ronen, is the nature of how key lookups are done with document data.”
“Sure. I mean, when you’re doing a test in a key-value system like Redis that has one data type for one key, if you’re looking up one key, cool, it’s very fast. But in reality, people tend to need to model with many data types. You have multiple strings, lists, integers, map data, etc., that you need to represent an object with. In Aerospike, all of those are in one record, a single key. You can have all the data types side by side, similar to an RDBMS table, but you don’t have a schema. You can add and drop them at will, and the application continues to work. So, if somebody needs to update multiple fields or read multiple fields that are associated logically with one object, that requires lots of key lookups. You’re going to need to do multiple keys to basically assemble one document together. Whereas in Aerospike, that is a single key lookup. That’s where we see, actually in our own customers, significant improvements in latency. Even though Redis is allegedly the fastest database in the world, the fact is, once you’re modeling an object as a set of keys, you’re going to need to update multiple keys or read multiple keys or do a combination of those. In Aerospike, you can do these multiple operations, single record transactions that have multiple operations all under one key access, one lock. That in itself gives much better performance, even when you’re serving data out of NVRAM.”
“Great, and one good example, by the way, I should mention that this webinar really is based upon a blog that I created which really just talked about these customers who have migrated from Redis to Aerospike. Another example, and germane to predictable performance, is IronSource. They’re an ad tech firm. We have a lot of those as customers because they have very tough database requirements. They wanted a more dependable solution and to be more efficient. They were able to enhance their performance with 130K reads per second and 75K writes per second with nearly 100% uptime with sub-millisecond latency. This is another kind of common theme with Aerospike customers. I’ve heard dozens of times customers saying, ‘It just works,’ and that means uptime. A lot of customers really are approaching 100% uptime for years on end, something we’re really quite proud of.
So, they got more efficient with their ad serving, and that efficiency enables new use cases, new business capabilities and functions. What really does that, Ronen, is our hybrid memory architecture. So, let’s take a moment to describe what that is.”
“Sure. At large, before we even get to the storage layer, the Aerospike client is fully aware of the partition map, so it can get to the correct node in one hop. It doesn’t need to communicate to anything else, doesn’t need to have a proxy, doesn’t need to go through some name node. It knows always, and it keeps track of the cluster changes, so it knows how to get to the correct node with a single hop. On that machine, the data is looked up, the metadata about that record is looked up in the primary index. That is, you know, a hash table in front of a collection of ARB trees. So, it’s very consistent to find the metadata, no matter how many pieces of data there are, you know, whether you have 10 million or 100 billion objects in your cluster. The kind of amount of time it takes to find the metadata is the same.
Once we find the metadata about the record, these 64 bytes, which by the way are cache-line optimized, so that’s exactly what the CPU expects ideally, those 64 bytes tell us exactly which device the data is on. If you’re storing it on SSD, you know, you could also do it in memory or on PM, but it tells us exactly what device it’s on, what the byte offset is to the beginning of the record, and then how far ahead we need to read. One I/O gets you your entire record because it’s stored continuously. Now, this plays to the strength of SSDs, which allow you to do a very large number of concurrent read and write operations all at once. Aerospike is highly parallelized in order to take advantage of the highly parallelized nature of SSDs, so we get extremely high throughput by using these optimizations. We bypass the operating system, we don’t use like a page cache, it’s direct raw device access where Aerospike is managing information about the blocks that it writes. It writes in consistent block sizes which really maximizes write throughput.
And the other thing is that we do defragmentation operations continuously in the background. We don’t treat things like an LSM tree,
you know, Cassandra, RocksDB, whatever it is, where we just keep taking and taking and taking writes and we just defer the compaction that needs to eventually happen to later. Once that happens, you get spikes in latency because of all the disk I/O. So, defragmentation is this constant background amount of disk I/O that’s being kind of a tax on the system, but it also creates a situation where you don’t get big latency fluctuations. Really, Aerospike from the ground up was designed to be a distributed database and from the ground up designed to use NVMEs as primary storage.
Now, again, Aerospike allows you to use persistent memory or shared DRAM, meaning actual RAM as the storage engine, but in the majority of configurations, the use of SSDs is at the most optimal possible way, and that is what gives us latency and throughput that are competitive with a purely in-memory system like Redis. And we do it actually at different price points and for different use cases for different reasons. So, in-memory, of course, is the fastest and it’s great when you can afford that, and you get the lowest latencies. We have customers that are using up to terabytes of data in these configurations. Then you have the hybrid memory, the use of memory and flash technology that Ronen just described. You can run Aerospike in an all-flash configuration, and then we have just introduced in Aerospike 7 the support for networked NVMe-compatible storage, like network-attached storage from someone like Dell or what have you, as well as NVMe-compatible block storage in the cloud with Amazon’s EBS, and Google and Azure both have block devices you can take advantage of. So, you can obviously deal with tons and tons more data and you’ll have some higher latencies, I mean still less than 10-millisecond latencies with these networked block devices.
Basically, you get to choose your price point, choose the best option for your needs, and that is something that Aerospike has kind of fanatically gone after. Ronen, anything else you’d like to add to that?”
“Yeah, look, something like EBS, there are different variants, right? Something like an io2 version of EBS gives you sub-millisecond latencies and you just pay for what you use. We’ve benchmarked it, and definitely for slightly lower throughput use cases, what’s nice about it is you get much better economics. You get the ability in something like AWS EC2 to be able to just provision compute nodes that don’t have locally attached SSDs and then get as much EBS as you want with it. So, first thing, you control the ratios as opposed to the storage-optimized instances that have a constant ratio between RAM and disk, which typically works for Aerospike users. If you have your 64 bytes of metadata in memory, if you’re using the primary index in memory, and then you have, let’s say, a K on disk, that works really well with instances like I4i. But you may want to have a different ratio. You may also have slightly lower throughput. So, there is an inflection point, and that’s something that we can work with customers if they’re interested, where it’s cheaper if you have higher throughput to actually get these storage-optimized instances, but beneath it, it’s cheaper to get some instance that has CPU and memory, attach as much EBS as you want to it. So, the point is flexibility.
There could be reasons that somebody wants to use an in-memory database. It may be because they just create software that needs to be deployed on all kinds of hardware, and they can’t ensure consistent good NVMEs. But it also could be because they want to get things in the low hundreds of microseconds rather than sub-millisecond. It could be all kinds of things. The point is we want to allow customers to have the flexibility to decide what is the hardware that fits their use case best and the cost-performance associated with it. Aerospike is agnostic of these storage layers. It doesn’t need to be in-memory, doesn’t need to have any kind of high memory demands, which is why you can get a much smaller node count with Aerospike. These are things where we’ve worked really hard to do support for all these different storage engines. The other thing to point out is that an Aerospike cluster can have up to 32 namespaces.
Those namespaces can each have different configurations for storage. You can have an in-memory cache that basically has no persistence at all. You can have an in-memory database next to it that has data stored in memory but persisted to EBS. You can also have ones that store just the primary index in memory, 64 bytes per object, and then data on SSD. You can do it all in one single cluster in order to use less hardware to do more stuff, support different applications with different characteristics. That’s what’s nice about Aerospike.”
“And once again, returning to scale, we have numerous customers that are dealing with petabytes in a cluster. But as everyone knows, everyone is seeing more data that looks good and wants to be thrown into the equation. So, the ability to handle petabytes or even tens of petabytes requires some serious thinking on the finances. You want these low-cost options. Actually, Ronen, this feature, the support for EBS in particular, was actually, I believe, a customer request. They didn’t need sub-millisecond, but they needed petabytes.”
“Well, they did want to get as low latency as possible. The point was just not necessarily the throughput per device, operations per second per device was not necessarily one that required them to use local SSDs that are like nice NVMEs. We just want to allow people to create applications on one platform. A developer can use a single container instance on their laptop, a million objects, 10 gigs, whatever for test data, hand it over to an SRE team that is actually running a 60-node cluster with 200 terabytes on it. The point is that you should be able to decouple development from the concerns of the SRE side of the world. It is kind of bad design to require you to have in your own code access to caching layers, disk-based caching, and then like the dual database and do workarounds for the operations system in the code itself, where to find the data, because you want to make sure that the production system doesn’t fall on its face and catch fire. The fact that Aerospike is a distributed database that performs as fast as any in-memory cache, that can scale to petabytes, you can test on 10 million objects, but you can actually run the system with 100 billion objects and have petabyte scale.
That is used by our customers for many, many, many years, some of them 10+ years in production. There is a proof point that you don’t have to worry about your own growth crushing you in the future. So, these options are really given to the SRE side of the world. A developer doesn’t need to care about any of these things. Whatever they do on their laptop should then ideally work without any workarounds or spaghetti code to address a big complex tech stack whose only purpose is to make sure that things roughly perform okay in production. We know that Aerospike in production gives that performance regardless of where you store the data. The SRE team may need to have different considerations on what is the ideal kind of cost performance. You may not want to put data in memory because you have a petabyte, and you don’t want to have a thousand nodes to do it. Or you may say, ‘I don’t want to use an NVMe device because it’s mostly sitting idle. I just want to use EBS where I know I can pay for a certain consistent performance, get low latency, pay for the operations, but really get guarantees from the vendor itself, and then not need to have a really, you know, sort of monster NVMe that’s mainly sitting idle.’
So, the developer can develop their applications, the SRE team can choose where they deploy things in a way that makes the best stability and the most cost-efficient decision. So, that’s what’s nice about this.”
“And that actually segues nicely into the fourth reason, and that is sustainability and carbon footprint. With all that is going on with AI and LLMs, as well as blockchain and other things that are using incredible amounts of compute and electricity resources, IT organizations are starting to look at this. It is also a cost center: energy and HVAC, etc. So, having large sprawling clusters doesn’t really help you meet efficiency and sustainability goals. That’s the bad math: CPU and memory-heavy operations consume energy and generate heat, it’s just the laws of physics we’ve been dealing with. But it’s getting more and more crucial that companies start to address this. The number of servers, the data size divided by the available RAM per server in these in-memory systems.
So, once again, the scalability, the server sprawl we talked about all affect the amount of heat generated and the amount of electricity used. Also, technologies that underutilize CPU, that can’t really efficiently take advantage of multi-core processors, actually have an impact on carbon footprint. Save the planet, man. That was an issue at TomTom, their navigation and mapping solution provider. They wanted to slash their operational expenses and shrink their environmental impact. They were able to reduce their carbon footprint by 86%, lower their operational cost. They are definitely related. In the meantime, with Aerospike, they were able to improve their reliability and uptime.
I’m not going to go through the video here, but efficiency, before sustainability was cool, Aerospike was fanatical about efficiency, about the use of memory, about how writes operate, how memory is used. Server reduction is a big part of why customers choose Aerospike. But the knock-on effects of way less carbon footprint and
improved TCO make it a perfect fit for an Aerospike cluster.”
“Alright, let’s skip the video. The final reason is cost efficiency. That scale-out inefficiency really translates directly to cost efficiency. It requires much more memory, more nodes, higher costs. The inability to use denser storage requires more machines. Server sprawl becomes a budget sinkhole. Large clusters also require more human resources. The SRE teams have to do more with larger clusters, it’s just the way it works. This is very related to the other reasons, but once again, a big reason why people turn to Aerospike is we strive for the most efficient and the highest efficiency and the lowest TCO possible.”
“For sure. Look, it’s all about, in any organization, efficiently allocating resources. When you have fewer people that need to deal with keeping your infrastructure, and a database cluster is the infrastructure of any modern application, fewer people doing that can just be doing other things: other automation, other development. Instead of developing your own database, switching to something that’s already out there, that has over a decade of very heavy production use case in all kinds of different configurations, different storage engines for the data storage, different scales, different number of objects. Trillion-object databases, multi-petabyte databases. Aerospike is used for all of those things. Incredibly high uptime, very little resources relative to any other competitor to run these extremely large clusters. That just means you can use your resources elsewhere. People that have been developing the in-house database or the workaround solution around Redis and other layers to make it into a distributed database, or just to run what is out there, you could do other things with those people. So, that’s the whole idea of what we’re proposing on the human side of things.”
“And one company that took this to heart is Wix. Many of you know them, the web development platform. They have a lot of personalization features. They wanted to enhance them, do more while cutting costs and reducing latency. They were able to migrate from Redis to Aerospike, reduce their costs by almost half, and lower their latency from 18 milliseconds to just two to three milliseconds. As Ronen kind of hinted before, database operations are often multiple operations to accomplish one task, so 18 milliseconds to two to three doesn’t seem like much, but if that is 20 puts or 20 gets or whatever, that starts to add up. They were able to improve their personalization and their user experience and reduce their costs. Good on them.
So, we’re getting near the end. That was our five reasons: improved scalability, reduced server sprawl, more predictability within your performance parameters, reduced carbon footprint, and finally, cost efficiency. Everyone wants to save a buck. That is our presentation today. I want to thank Ronen Botser, venerable product manager from Aerospike. With that, I’m going to say thank you to you all. You can get in touch with us, and as Ronen suggests, try the product. You can do a Jupyter notebook, and there’s a great developer hub on Aerospike.com, or that 60-day free trial that he mentioned, or the community edition. Lots of ways to play around with Aerospike.
We just, you might have read, raised $115 million. We’re going to work on making it even more fun for developers, add new capabilities, and we’re off to the races. So, once again, thank you very much, and I hope you have a great day. That ends our presentation for today. Thanks very much.”
About this webinar
As data volumes soar in high-workload environments, organizations often face challenges with their data management systems, such as scalability, server sprawl, and unpredictable performance. In this webinar, we delve into the benefits and challenges of moving beyond Redis and how to leverage new solutions for better performance. Watch the webinar for:
Real-world case studies on improving efficiency through strategic database transitions
Proven strategies to enhance database performance
Actionable knowledge to implement profitable data management practices