Four essentials of a modern, distributed database
Welcome to today's webinar. My name is Matt Bushell. I head up product marketing for Erospike. It's my pleasure today to have Srini Srinisavan present today The Four Essentials to a Modern Distributed Database. Today's topic as Srini will be presenting is a key technology discussion to the underpinnings of Erospike and some of the foundational principles of how it was created. And so without further ado, Srini, do you want to begin?
Srini Srinisavan:
Hello everybody. I'm Srini Srinisavan, CTO and founder of Erospike. In this presentation I'll take you through some of the essentials of a modern distributed database. We are going to talk about three major topics, scale out, scale up, followed by how we implement distributed transaction management with strong consistency. But first, let me take you through... Excuse me. Let me take you through the mission critical use cases that are the basis of the work that we are describing here.
As you can see this across a wide variety of industries and use cases, there is a requirement for very high throughput, low latency transactions, which can be highly available. For example, something like fraud detection would require to generate a fraud score for transactions in progress by looking at a couple of thousands of items to generate a really good fraud score. And this has to be recent data, behavioral data, and this should all happen within a couple of hundred milliseconds. And there are many other use cases I won't go into, but every one of them requires a real-time SLA with the ability to handle recent real-time data in order to make accurate and better decisions.
So going forward, as I pointed out, we will first spend some time on the basics of scaling out in a distributed database. In Erospike, we have tackled this problem by building out a shared nothing database cluster. Such a cluster essentially consists of a number of nodes which is shown in the bottom part of this picture.
Each of this node is identical to every other node and that's really important and every node is not only identical in terms of its capabilities in terms of the amount of storage and CPU and networking capacity memory available in the particular node. It is also identical in terms of the code that runs in it. So there is no single master. Essentially it's a distributed system with every node being identical and be able to perform various tasks that are required to run such a distributed database. Additionally, the system, because it has no single point of failure, is able to also run with no hotspots and there's a lot of engineering involved in ensuring that.
One of the special things about Erospike is the fact that the clients themselves are intelligent and are able to share some of the mapping data of data to nodes so that they are able to reach the target node for a particular transaction in one hop. And the node balancing is done on the client in software and essentially it is able to adapt to changing cluster conditions and that gets into dynamic cluster management. The cluster is dynamic in the sense that nodes can arrive and depart at any point in time without interrupting the flow of transactions and the ability of data to move around based on node addition in order to rebalance the system properly is all done automatically and rolling upgrades of course can happen with no outage of the service, et cetera, et cetera.
Finally, in the presence of node arrivals and departures, there is need sometimes to move large portions of data in order to rebalance the data within the cluster. And these are typically long running transactions and these have to run with the right level of priority without impacting the SLAs committed to these other transactions which are typically very short sub millisecond reads and rights with very high throughput and requiring low latency. So the interplay of the long-running tasks and transactions is something which is also done in this system by proper prioritization of these different tasks appropriately.
Erospike actually uses a divide and concur algorithm. In the data partitioning scheme used by Erospike every key is hashed using the RIP MD one 60 function, which is a cryptographic hash. It is extremely random. And out of this digest, which is created by the RIPEMD-160, which is a 160 bit or 20 byte digest 12 of those bits, and these are randomized bits as I explained, are used to determine the partition ID for that key. 12 bits give a 4096 or a 4K set of partitions and each of these partitions can then be mapped to nodes in the cluster.
On the right side of this picture, you see a table on the top which maps the 4096 partitions to nodes in a five node cluster. If the cluster has two copies, the first two columns are relevant. If it is three copies, the first three columns are relevant in this particular partition map and so on and so forth.
When in this particular case, node five goes out of the cluster and it has been there before and either it is removed due to essentially the node going down for a rolling upgrade or maybe there was a network glitch which made it essentially isolated from the rest of the nodes in the cluster. That partition map is suitably adjusted to remove node five, and then you see that in the second row on the right side of the picture. And then the node five is added back into the partition map when the node comes back.
Now this kind of mechanism is essentially computable given the list of partitions, which is always 4096 and the IDs of the nodes. The way this partition map can be assigned is either to minimize the data migrations that are happening in the system when nodes arrive and depart, and that's essentially the original algorithm that Erospike used.
What we discovered was in large cluster sizes running up to 200 nodes, when you end up having just 20 partitions per node in average there could be skew of up to four or five partitions between two nodes. So one node could have 18 partitions, the other one could have 22 or 23. And what this created was an imbalance in terms of the transactions that are being directed, which are proportional to the amount of data in the node because this whole thing is fully randomized anyway. And this meant that the node which had more partitions ended up hitting limits earlier. And even if one of those nodes hits the limit, all the nodes essentially are now performing at the lower level of transactions that the first node hit.
Now, in order to adjust this at Erospike, we implemented certain uniform partition balance adjustments. This algorithm essentially assigns the partitions randomly up to a point and after a while, as it starts assigning partitions to nodes, it starts to balance them out.
What the trade-off therefore becomes essentially you are making it more expensive to do data migration. So the trade-off of minimizing data migrations is relaxed in order to essentially generate uniform data partitions. And as you can see in this picture, there is more readjustments required when we use rack awareness, which we will describe in a few minutes.
The most important thing to notice here is there are trade-offs and Erospike provides choices to deployments to take advantage of whatever it is that the particular requirement for a deployment happens to be. More details on all of this and the algorithms described in this particular talk are present in both internal documents but also in published papers that VLDB, the latest one being the one published in Vancouver in August of 2023, and you can definitely get access to those.
The system based on the partitioning algorithms I described earlier is self-managed. Essentially there are multiple components. One is the heartbeat subsystem, which discovers when nodes are arriving and departing from the cluster, the clustering subsystem, which without having a single dedicated master able to elect a master temporarily in order to form a cluster and then maintain the list of nodes in the cluster.
The partition map that I talked about is based on the presence of these nodes in the cluster and given nodes are added and removed from the cluster and the partition map being changed based on the arrival and departure of a node. What happens is there's an exchange subsystem which manages the asynchronous copying of data or migration of data of partitions, if you will, from one node to the other. And you can see an example on the picture to the left where we are adding an additional node to a three node cluster, a three node cluster with two copies or replication.
Factor two in steady state will have one third of master partitions available in each of those nodes and one third of the replica partitions available in each of those nodes. When you add a new node, one 12th of master partitions from each node has to be delivered to the new node and one 12th of the replica partitions has be delivered to the new node. So eventually when all of these movements or migrations are completed, the data rebalancing is complete. You will end up with a four node cluster with one fourth of the master partitions in each of the node and one fourth of the replica partitions in each of the nodes, which means half the data is available in each node. It's a two copy system with four nodes. That actually makes sense.
Now having described scaling out, we also focused a lot in Erospike on scaling up. What that means is the ability of the system to take advantage of every aspect of processing, storage, and networking capacity that is available within a node. In order to do that, we leveraged SSDs very heavily.
One of the most important things about Erospike algorithms is our ability to treat the data available on SSD to be real time. What we mean by real time is we are able to deliver sub millisecond access times on data available in SSDs for both reads and writes. In order to make this happen, we have implemented certain specialized algorithms which leverage the random access reads from data from SSDs. So we split the data and the index in such a way that we keep the index fully resident in memory. And because the index is fully resident, it points to the actual data item on the SSD, so it's able to read it within a millisecond.
Now writes cannot be done in place on SSDs because of wear leveling issues where you could burn out portions of the SSD if you overwrite on those cells too much. So what we do is we implemented a log structured file system with large block rights and copy and write semantics so that SSD writes are gathered in memory in large buffers and then flushed to disc.
What this also allows us to do is to deliver and very heavy read loads in the presence of very heavy ingestion or write loads because of the way we do the writing, everything is highly paralleled in the access to SSDs because you can stick dozens of SDS on nodes in some of the large nodes. And what this also implies is with multithreading and the ability in the background of running a continuous, so we don't run out of space on discs at very high throughput of writes, for example.
Other databases are not able to deliver the uniform predictable performance sub millisecond access to reads and writes across the entire database because they typically use a buffer pool based system where you have to bring the data from storage into a page cache in order to read it. And this essentially means that there is always some of the initial reads which could miss the page cache, and that's never going to happen in Erospike because the index is fully resident and the data access is directly to SSDs.
And the end result of all of this is we have increased the amount of real-time data that you can store in a node. And this is now proportional to the amount of storage, the amount of SSD storage available on the node and not just the DRAM available on it, which means the number of nodes in a large Erospike cluster is typically an order of magnitude slower in than those of compatible systems which have to store all of the real-time data in DRAM.
Just storing all of the data in SSDs increases the amount of data that is available for real-time access per node by 10X or even more. In order to make this actually work in practice we also have to be able to run 10 times the number of transactions per second on each node. SSDs can be set up in parallel as I explained already, and this means that these particular transactions need to also go in parallel through memory.
And Erospike uses multithreading. It's written in seeds. It's highly efficient. We use techniques like CPU pinning, NUMA pinning, and also binding threats to specific network queues using parallel network queues and making sure a single CPU doesn't bottleneck on network dispatching network requests. And there's a whole bunch of techniques that we use to be able to drive very high throughput in terms of transactions per second through a single node.
You can see on the right some basic numbers on a simple NUMA CPU pinned system. This was done in collaboration with Intel a while ago, but about 8.6 million transactions per second can be pushed through the system. However, Intel also had invented some technology on the NIC card, which is called application device queues, and what this means is it enables the application queues on the network to be directly linked to the CPU's threads, which means we can deliver a lot more transactions per second when you use those kinds of isolated and network cards which are aligned basically with threads and CPUs.
Now in practice, it's very unlikely that we would need to run this many transactions per second on a single node, but the fact is the headroom exists in order to compress all of the real-time data into smaller set of nodes. And that's essentially how Erospike is able to run very high throughput, low latency workloads at very high scale on small cluster sizes as small as 10 or even fewer than that, all the way up to 200 and more.
All of the performance that I talked about in terms of both the scale out and the scale up Erospike has used to develop transaction algorithms, which provide strong consistency. And the high performance enables us to take certain possible routes in building these algorithms that are probably not available to other systems which might be much slower than us in terms of leveraging the various underlying infrastructure.
Here is an example of how you implement a roster based strong consistency scheme. Erospike does not use quotem. Our quotem essentially is write all and read one copy because we can get away with it due to the performance of writes and reads. In this particular case and due to the networking optimizations I talked about as well as the CPU and storage optimizations. Now in this particular case, these algorithms that we've designed enable us to make the system be available in certain common situations.
Now let me take you through a couple of interesting rules here. In this particular case, I have mapped a single partition onto a roster. A roster is a set of nodes that form the cluster. This case it's a five node cluster, nodes one to five, and the particular partition P has the master copy on node five and the replica node four. When the system splits in this case, the cluster splits into a cluster of nodes one, two, and three and another cluster of nodes four and five. This partition P based on the rules that we have contains both the master and the replica. This is a two copy system and there is no other data for this partition in any of the other three nodes. So this particular partition stays alive in both for both reads and writes with strong consistency in the smaller cluster, which is sub cluster, which is the nodes four and five.
Now, if the split happens in a different way, that is say the master copy, the node five splits away. Either it's down for a rolling upgrade or maybe it got disconnected through the network glitch, and so it's just running by itself. Either case it's isolated, and the other four nodes 1, 2, 3, and 4 essentially are together as a cluster. Remember that the entire roster is known to every node so they know that one node is missing, so all of that can be computed.
Now in this particular case, the rule goes that because this is a majority cluster, it's got four nodes and this one has to leave one, which is out in the minority cluster five. Even though it has the master copy of the partition, the data becomes unavailable. And what happens in the other copy in the other cluster, in this case the cluster with nodes 1, 2, 3 and four, what you have there is the replica which gets promoted to master and an additional replica of the partition is created because every write has to have two write in order to make sure that it is consistent and durable. And essentially that's essentially what happens.
Now if the node five rejoins this cluster, let's say we are doing a rolling upgrade, node five went down, it's got upgraded, it's come back and joins it, and then node four is going to be taken down next between the node five, which has all the old changes, the old copy before the cluster was split, and node three, which has all the new data, between those two nodes, you have all of the latest data required to answer any request from a read or a write. However, a read might end up being slower in the situation when node four is not available, but node five and node three are available is essentially because you might have to check with both node five and node three to get the latest value for a particular read because node three could have the latest copy and node five might have the older copy because it was out for a little while.
Now all of this is actually fairly complex and we have documents and papers we can share with you both published publicly as well as internal Erospike white papers. I encourage you to read some of these items and also essentially in order to get more knowledge about it.
Also there are cases and which is what is shown in the last row here where the split brain or the number of clusters, the way in which the cluster is split due to network glitches might end up with no availability for the partition P in this case. In this case there is a cluster with one, two, and three got split from another cluster, which is just note four, and another sub cluster, which is note five. In this case, partition P is not going to be available for read or write anywhere here, so this is essentially the result of the cap theorem.
There are cases where availability will be kind of sacrificed in a strongly consistent system because that's kind of how the cap theorem works because partitions are inevitable, network partitions. So fundamentally what Erospike does with all these rules is we can maintain 100% availability of a partition somewhere in the split brain, the two-way split brain. It may not be in both of the split brain clusters, but it'll be at least in one of the two. And 100% percent availability of the system on a rolling upgrade where even with the replication factor two.
Here's a simple example of how we do writes. Writes, as I mentioned, was a write all kind of mechanism. So in this particular case, it's a three copy system. The write goes from the client to the master. The master coordinates the transaction across the other two copies and essentially to do this in parallel of course. And then once all of the copies have responded to the master, master will commit the transaction, the write transaction and then it'll be communicate the commit state to all the replicas.
Now, the important thing here is we can support linear rise ability. We also can support a slightly lower form of strong consistency, which is session or sequential consistency. So all of that is for reads typically. And the writes are always consistent. There's no data laws and so on.
Erospike also supports rack awareness. In the presence of rack awareness, right? I mean rack awareness has a very good advantage that sometimes when one rack is available, in this particular case, if you have two copies and two racks, all the data is available in both racks, so reads can happen very well with low latency across both those racks. Again, if you're using sequential consistency or you can just access a local copy. Writes, on the other hand, essentially have to go through both copies.
Now in certain situations, you can allow one of the racks to continue. In this particular, we can configure the majority rack in a two copy system when the other rack is disconnected or down to continue with both availability and consistency. And we will use that in this particular case of a distributed setup.
Here is a system that is typically used in certain kinds of applications like interbank money transfers, where it is okay to trade off what I would call as a higher write latency. In this particular case, the system is a three copy system with three racks distributed across the continental United States. And what you have here is the ability for local apps because it's a three copy system with three racks, the local app can read locally because every site here essentially has a full copy, so your apps can always get reads of less than one millisecond in steady state. And a write because of the way I described it earlier, it's going to write to all three copies.
It's going to be somewhere between a hundred to 200 milliseconds depending on the various networking delays, speed of light considerations, and so on. The end result of all this is it is a completely active system. And if an entire rack goes away or is either in a split brain mode or it's down for maintenance, what you can have is the other two racks can continue to run with both availability and strong consistency.
There is no conflicts. There are no conflicts in the system because it's completely avoided given the write all read one strategy. And when a node goes out with no operating intervention system continues, I'm sorry, when a rack goes out with no operating intervention system continues when a rack comes back on and rejoins it again with no operating intervention. All of the data, like I described, the whole migrations and how it works, all of this is actually taken care of automatically, so it generates very high uptime.
Of course, you do pay for it by having three copies and three notes, but the uptime is really worth it in this particular situation and the trade-off is to have a higher write latency, which is actually acceptable to the applications. There are set apps I am not going to describe in this presentation, which are asynchronous where you do not have to trade off the write latency, but you trade off somewhat in terms of consistency to do asynchronous application across multiple clusters. And this one is basically asynchronous active active application across a single cluster split into racks.
Finally, to give you an idea of the kind of performance and the results that are capable in the system that we have described so far, essentially I have here a petabyte scale benchmark. Essentially it's a petabyte database which can be compressed about one fourth, so it's two 50 terabytes of data for a single copy with another copy given a two copies here, 500 terabytes of data because of replication factor two and the uncompressed data size is one petabyte for a single copy. And it has about half a trillion unique keys. Essentially, all of this can be run on 20 I3EN nodes. It's probably going to be slightly fewer if you use I4Is these days, but fundamentally, we can run millions of reads per second with latency of less than one millisecond.
Now, if you relax the latency, you can probably go up in terms of TPS, but we tend to keep the TPS measurements as long as you can keep the... That's kind of how we measure this. The latency is less than a millisecond. As you can see, even in the 80/20 read write case, we can do about 3 million reads per second with 100% of those reads with less than one millisecond, and about three quarters of a million writes per second with 99% out of one millisecond. All of this is pretty impressive results, which I think many of those use cases I described earlier leverage to their advantage.
Finally, to find more information, you can go to developer site, you can also find some of the early chapters of the book on Erospike as well as join our discord community, and if you can find examples on GitHub and so on. Thank you very much for your time. I am happy to answer any questions.
Matt Bushell:
Okay great, Srini. Yeah, I'll coordinate some of the questions coming in. In terms of first one, Srini, you mentioned since Erospike has 10X fewer nodes, each node tends to do 10X the transactions. How does a node not get overloaded or bottlenecked and/or how can Erospike process even more throughput for any one given node? It's a couple questions there.
Srini Srinisavan:
Yeah. So the interesting thing here is there are modern processors or multicore, and there are techniques that we use CPU pinning to threads, NUMA pinning in terms of across these processors as well as our ability to align network requests to individual processors, as well as if you use something like a NIC card like the Intel's A DQ, we can get further stream lining, if you will, of requests coming into a single node.
A simpler way to describe all of this is to say that we really leverage the parallelism of everything that is possible within a node, be it processor, be it networking queues, be it the number of SSDs in it so we can actually reduce all the bottlenecks within the system. And many times we talk about Erospike running at network speed because once you get into Erospike, all of our efforts have been about engineering a system which avoids bottlenecks within the system. And that's how we are able to achieve this level of transactional throughput on a single node.
Matt Bushell:
Interesting. Yeah, so if SSDs are one of the elements, I imagine you could just add more SSDs to any given server, set of server nodes and then increase the throughput it sounds like.
Srini Srinisavan:
We can increase it so long as we are able to keep the parallelism going on the processor side and the networking side. So there are obviously always limits, but I don't think we have hit those limits or even though when we hit those limits, there is new advances in SSDs from the beginning. Erospike has been running on SDS for a long time now, and then we had PCIE, we had NVME, we got a lot of it, and we even had something like Intel's persistent memory which came and went, and there is more on the way with CXL. So we intend to be making use of all of these technologies as they come and go in some way, and then we are ahead of the curve, so to speak, and leverage them in order to drive more transactional throughput on larger amounts of data with a lower latency.
Matt Bushell:
Or I guess if you're not at DRAM nor CPU limits, you could add more SSDs most likely it sounds like. Okay.
Srini Srinisavan:
Absolutely. In fact, we've had systems which have dozens of them, like 50, 60, 100 sometimes. Very rarely. Most people have about anywhere between a dozen to 20 SSDs, but you can do a lot more and we've seen that 10 years ago even.
Matt Bushell:
Okay, very good. Next question, maybe basic, but are partitions in Erospike parlance the same as shorts? And what if the records are different sizes? Couldn't there be hotspots still introduced in the Erospike system?
Srini Srinisavan:
That's actually a very good question. The interesting thing about, I mean, let me ask, take the first question. You can consider them as shorts. We say partition because we have partitioned the key space. Our partitioning is, I would say, based on the hashing I described, it goes to exactly 1496. We don't change it.
So the whole philosophy of Erospike is Erospike using properly randomized cryptographic hash algorithms is able to distribute data in the best way possible that it is not possible for somebody else to do a better job because it is the math of the algorithm and the way we distribute it, as well as the adjustments like the uniform partition balance algorithm, which make it happen. So fundamentally, we are able to uniformly distribute this data across. So shorting typically involves how it's used is you understand the data characteristics and you try to short it differently based on different data. We have kind of tried to stay away from that. And the system automatically takes care of it so the applications don't have to worry about clustering and moving data around and so on.
Matt Bushell:
Okay. Next question. You talked about active, active synchronous replication. Does Erospike not have asynchronous and what are the high level pros and cons of each?
Srini Srinisavan:
Yeah, today I've been focused on how we deliver a strongly consistent system using Erospike's high performance algorithms. The reason I did that was because typically no SQL essentially has a repetition of giving up consistency in order to deliver performance. Erospike actually delivers very high performance and also delivers a strong consistency. Additionally, Erospike also has the ability to relax this using an asynchronous application strategy where we have a system, we have a technology called XDR, stands for cross data center replication. It's really about cross site replication because data centers in the presence of cloud regions and so on don't really make as much sense anymore. They're just one of many different sites.
So we can actually have asynchronous replication between Erospike clusters and that has a different trade off. Essentially, you've traded off consistency in order to have better write performance, for example, and the consistency itself varies. And we can set up a system in some cases, which is able to deliver what is called eventual consistency or convergence or we can also deliver something which is strong eventual consistency in a subset of cases, which is also possible. And essentially Erospike has the ability to trade off strong consistency systems which are active, active versus I would say asynchronous active active systems which support convergence.
Matt Bushell:
Got it. Got it. Okay, I guess we have time for one more question. You talked about on a slide, Erospike's hybrid memory architecture. Are you telling me that Erospike cannot do in memory for its system? Why should I care one way or another?
Srini Srinisavan:
Can you repeat the question please?
Matt Bushell:
Yeah, I think the question has to do with the comparisons and contrast between in memory and hybrid memory, and is hybrid memory the only way that Erospike can manage?
Srini Srinisavan:
That's a great question. The reason I brought up hybrid memory was in the scale up context. Erospike also has a fully functional in memory system where we put all our data and index in DRAM and we are able to deliver high performance on that. The only thing is it will not scale up beyond the amount of DRAM that you have in the node.
And as we go forward, we will support compression and so on in memory systems, so that will also increase the ability of having larger in-memory deployments. But hybrid memory deployments will be able to deliver compression in terms of server sizes, and some of it is important, right? Putting more data in flash as opposed to DRAM means you use less power, less space, and you have a more sustainable data center for handling huge throughputs, so to speak.
One of our customers, for example, we're able to, after they moved from a different system, which was a purely in memory system, it was not Erospike, they moved to our Erospike's hybrid memory architecture, and out of six data centers, they actually literally shut down two of them. So there's a huge sustainability argument in order to have real time systems run on the hybrid memory architecture and for smaller size scale systems which require high throughput, you can run an in memory configuration of Erospike also. In fact, many telcos do that.
Matt Bushell:
Very good. Well, I don't think we can end on a stronger note than saving customers two out of six data centers. So with that, just to remind folks, we will be making the slides available. Of course, there will be a replay of the session. Erospike invites you to visit us at any of these QR codes. Srini, any final comments?
Srini Srinisavan:
Well, I hope you have learned something about how Erospike has leveraged some of these technologies to build strongly consistent distributed database systems. I hope this has gotten you curious enough to take a look at us and we'd be very, very happy to get your feedback. And if you have any questions, you can reach out to us through developer or Erospec.com. Really appreciate the time you've spent today. Thank you.
Matt Bushell:
Thanks everyone.
About this webinar
Join Aerospike CTO Srini Srinivasan for a deep dive into the four design principles that are essential to a real-time DBMS for mission-critical systems. This webinar will explore:
Optimizations for modern system architectures
Strongly consistent transactions
Massive parallelism with indexing
Geo-distribution for Active-Active systems
Aerospike is a real-time DBMS that has been used in some of the world’s largest mission-critical systems that require the highest levels of performance and availability. In this webinar, you will learn about the techniques that Aerospike uses to achieve these levels of performance and reliability.