IT-Tage and Aerospike webinar: Faster, cheaper, and greener big data
Please accept marketing cookies to view this content.
Behrad Babaee:
Thank you very much. So good morning everybody, as I said, I'm very sorry that I can't speak German, but if it's any consolation, English is also my second language, so we are just using a bridge here to communicate. My name is Behrad Babaee, I'm a technology Evangelist in Aerospike. Aerospike is a database company, I'm going to talk a little bit about Aerospike during the talk so you're going to learn more about Aerospike as well. But the title of this talk is Faster, Cheaper, and Greener Big Data. So without further ado, let's start the conversation.
Basically, I'm pretty sure that you saw a Venn diagram like this before, that says that there are things in the world that are fast, there are things in the world that are good, and that are things that are cheap, and we all know that the intersection of fast and good is usually very expensive, the intersection of good and cheap is well very slow, the intersection of cheap and fast is, how should I put it, is chocolate ice cream. And then we have this triangle in the middle that is the intersection of fast, good, and cheap, and a lot of people believe that it's like a unicorn, it does not exist. But if you look at things pragmatically, you can say that that actually fast, cheap, and good unicorns are kind of similar to rhinoceros by definition. Rhinoceros is a unicorn so I don't necessarily agree with this model. I think as an individual, as a company, as a consultant, whatever you do, you should always focus on the triangle in the middle and find things that are inside the triangle because it's going to benefit your business, it's going to benefit your career, it's going to benefit your customers, so we should always consider what is inside that triangle.
But the difficulty of something like this is that the definition of fast, the definition of good, and the definition of cheap are very elusive, they are constantly changing throughout time. So whenever you are thinking about these things, you have to be a little bit careful what you are describing as fast, good, and cheap. So for example, if you look at 1990, the definition of fast is seconds. If you have an application that can do things in order of seconds, it's considered to be very, very fast. And because we didn't have too many pieces of software back in the day, something that was capable was really good, if it could do that thing, it was amazing. So it wasn't like now that we have abundant of choices where everything can do something similar to what we need to do. So something that was capable and something that was capable of doing things in seconds were considered to be fast and good.
Definition of cheap is a relative concept, and throughout this presentation, I'm going to define cheap, but to give you an example of something that was fast, good, and cheap in 1990s, I can say Oracle. I'm not necessarily talking about Oracle itself, I'm talking about relational databases, but Oracle is basically the largest of them at the time. So Oracle was in that triangle. It was fast, it was capable of handling very, very complex queries in order of seconds, it was very capable, relational model is a very capable model and it was able to basically give us a lot of benefits, and it was cheap because if you look at this type of the hardware that we had in 1990s, the amount of disk that we had was very, very limited and also access to disk was very, very slow. And what Oracle and basically relational databases we're trying to do was to minimize the number of times that you need to access disk, putting the data that is related to each other on the same place on the disk so you can read them sequentially. So they were trying to optimize for the most expensive and most limited resource that was available in your application, so that was why Oracle was considered to be cheap in the '90s.
But as time goes by, and consider that this is a very long time, 15 years in computing industry is a very, very long time, they slowly started to move out of that because firstly, the hardware that they were optimized for, which was disk, was getting constantly faster, and it was getting constantly cheaper, and also larger. So as it was getting cheaper and faster, Oracle has started to move away from being cheap and it started to be moving away from fast as well because now something that was working in an order of seconds was not necessarily considered to be fast enough, we had significantly larger use cases that required lower latency, so it started to move out. But because it is still very capable, it could do a lot of interesting things, we are still using Oracle but it's not necessarily in that triangle still.
If we move forward, if we look at 2010, the definitions are again slightly different. So fast is still hundreds of milliseconds, possibly lower brackets of hundreds of milliseconds, but the definition of good is now a scale-out. So a scale-out system is a system that you can run it on several instances. So instead of running it on a one massive machine, you'll have 20 machines and we are running the tasks that you want on that is scale-out hardware. And that also helps in reducing the cost and making it cheap. Instead of buying one machine that is exponentially expensive, you have several machines that are relatively cheaper.
So at that time, around 2010, HBase is right in the middle of that triangle. So it is fast, it can work in hundreds of milliseconds, it is a scale-out system, and it is cheap because you don't need to buy a very, very expensive piece of hardware. But again, as you move forward around five years, the definition of fast change to tens of milliseconds and HBase was not necessarily able to do tens of milliseconds on large operations, and the definition of a scale-out changed to a scalable. So the difference between a scale-out and a scalable is that the scale-out is a system that you build with 20 machines, the scalable system is that if you build a system with 20 machines and it's not enough, you can make it 21 machines. Disclaimer here is that the HBase is actually scalable, you can add nodes, but if you ever done it, yeah, you would've known that it is actually very, very difficult to scale so that's why I'm saying that it is not.
But it is still very cheap, you can run it on spinning disks and it works amazingly well but it doesn't have the features that we require for something to call it fast and good. Another example, 2012, definition of fast is tens of milliseconds, the definition of good is modern. So what does modern mean? So if you look at the history of databases, we've been using relational model of for almost 50 years, and relational model is amazing but it has some limitations. Sometimes you just want to dump a document there, an aggregation there, and just fetch that document and just show it, you don't want to make like 50 joins to get the data that you want. So for some modern use cases, we needed another model. So around that time, around 2010, there were lots of different models, new models other than relational were coming out, and MongoDB at the time was right in the middle of that triangle. It was very fast, it was modern, and then when I'm saying modern, don't forget that it is still including capable scale-out and a scalable. Of course, scalable is already a scale-out but you still must be capable, you must be scalable, and now it should be modern as well. It's being added to what we used to.
And then well, it was cheap as well because comparing to the other technologies that were in the market, it was considered to be cheap at the time. But again, moving forward eight years, definition of fast started to become milliseconds. Again, if you have large queries and a large dataset on MongoDB, it's going to be hard to reach to millisecond latency and it wasn't considered to be cheap anymore, mostly because of the competition from the rest of the NoSQL market. But still it's a modern database, it's an amazing product, it's very capable and it can do lots of interesting thing. Last one I promise. 2015, definition of fast was milliseconds, definition of good was scalable and well, cheap is relative, right? Cassandra around that time is in that triangle. It's very, very fast, it can do things in less than millisecond, very complex and very massive one, loads of data, you can do that. And it is scalable, you can easily relatively easily scale it, and it is cheap because well, it's working on commodity hardware. But moving forward, well firstly it wasn't very efficient, so it wasn't using the commodity hardware that efficiently so you required many, many nodes, so it wasn't considered to be that cheap, and it wasn't really easy to operate, so it started to move out of that triangle. It is still a very fast database.
So now I want to give you a little bit of a background about Aerospike. So Aerospike, the first version of Aerospike was released in 2010, and it was released for a very niche part of the market. So there's this industry called Ad Tech. I don't know if you know of them, but I'm pretty sure every day you are dealing with them. So Ad tech company, this is the workflow that they have to do. Whenever you visit the webpage, that webpage says that this person with this ID that could be something that is stored in a cookie or your IP address is advertised for these Ad tech companies, that this person is visiting this page, that has this content, and this is basically some kind of their ID. So now these Ad tech companies look at this ID in their databases to find that if they know the user profile, if they know what kind of things this person is interested in. If they can find this person, then they look that if they are running any campaigns that is related to this person. If they are running a campaign for this person, then they're going to bid for this place in the web page. If they win that place, now they are going to provide the actual ad to the user.
So this is a very, very complex process, but it is taking a very, very short time. I'm pretty sure you have experience of this that you went through a website, and before you could see the content that you are there to see you saw that the ad has been loaded. The reason that it's happening is that the person that can bid first is possibly going to win, so these companies are trying to be as fast as possible. So at the time in 2010 to be as fast as possible, the only way it was to put everything on RAM. To put everything on RAM was very, very expensive. RAM was not only very expensive, it was very limited per machine. In 2010, you couldn't have more than 128 gigabytes of RAM physically on a machine. We didn't have these massive modules that we have and a motherboard doesn't have as many ramp slots as we have today. So it was physically limited to a relatively small amount of RAM. So if you wanted to have advertisement for internet at this scale, it was very, very expensive.
So founders of Aerospike at the time thought that, okay, we can improve this. Instead of storing the data in RAM, we can store it in SSDs and we can provide them with similar latency as if the data was stored in RAM, and they created this technology that was ridiculously fast. So at the era, the definition of fast was tens of milliseconds, Aerospike was walking in less than a millisecond latency. But I put this as less than a millisecond latency in fast and I didn't put it in cheap because although it was a cheaper than putting everything in RAM, but SSDs were ridiculously expensive back in 2010. Cheaper than RAM, but it's still very, very expensive. So unless you needed something similar to Ad tech industry, you needed it to be less than a millisecond latency, it wasn't really that cheap. So it was very expensive, it was very fast, and I didn't say that it was good because although it was modern, it wasn't scalable. It had these features that modern databases were supposed to have but it wasn't really capable. It was just the key value store that was able to store. You could just put a key and a value in it and you could retrieve it very, very fast.
But fast, moving forward, in seven years, in 2017, Aerospike, seven years is a long time, they added a lot of features into this stack. They started adding things like secondary indices, things like complex data types, backup and recovery, a lot of things that you expect from modern database trial. So when these things started to be added, it was considered to be fast and good, but still it wasn't cheap because still in 2017 SSDs are cheaper than 2010, but 2017 still when you bought a laptop, it was coming with a spinning disc, if you wanted a SSD, you have to pay extra money to get that. But again, moving forward, going to 2023, now the definition of fast is millisecond. Aerospike has been working in less than a millisecond latency for past 13 years, so it is still considered to be fast. The definition of good is now changed to kind of efficient, you don't want to use a lot of hardware to do the thing that you want to do, and Aerospike is capable of doing that. And now it is considered cheap as well. Actually, it happened in 2022 that the cost of one terabyte of SSD in 2022 for the first time become cheaper than one terabyte of a spinning disk back in 2010.
So the SSD, solid-state drives started to become cheaper, and cheaper, and cheaper, and now they're not that more expensive than spinning discs. Back in the day, they were like 20 times more expensive, now they are like 20% more expensive. The benefits that you get from SSDs is so much that most laptop providers don't even have an option with spinning disc. We have SSDs almost everywhere, all of your phones, everything is running on solid-state drive. So it started to move to Aerospike, unlike the other technologies that at the beginning there at that triangle. It started from somewhere outside of that triangle but given that the technology has moved on and the hardware that Aerospike was working on started to become cheaper, and of course the engineering things that the team of Aerospike did, they started adding a lot of features into it, and now Aerospike is right in the middle of that triangle.
But if we want to look into a crystal ball and see what's going to happen in the future, if we skip the bit that we are maybe waiting for the Third World War, if you skip that part of looking into the crystal ball, you can guess that it might be that the definition of good changes to green. Unless you are living on the rock, you possibly heard everybody's talking about becoming greener, and improving the CO₂ emissions, and the toxic waste that the companies that we are working for are basically producing. So if we consider this, that, okay, fast is still milliseconds and cheap is relative, if we say that okay, good is going to be green and we change good with green, so now we have fast, cheap, and green, something very interesting happens. If you squint your eyes a little bit, if you squint your eyes and look at this diagram a little bit, you can see that these bubbles start to join with each other and it's somehow as if that fast, cheap, and green actually mean the same thing.
So let me explain what I mean. So imagine you have a CPU. The CPU can do certain number of operations per second. If your CPU do 2.4 gigahertz per second, 2.4 gigahertz per second basically means the CPU can do 2.4 billion operations in one second. The CPU that I have here can do 50 operations per second, just an example. Now imagine I create a task that this task takes 10 operations to complete. So it means two things. So one of them is that this task, because my CPU can do 50 operations per second, if one takes 10, it's going to take one fifth of a second to complete, basically 200 milliseconds. The other thing is that this task, if I want to run it continuously, I can run five of these tasks in one second in the CPU that I have.
So now if I have a business that wants to do twenty-five of these operations per second, what I can do is that I can add more resources to my infrastructure till I can do 25 operations per second. This is something that we've been doing since scalability has started to be a point of concern in software engineering since like 2002, 2003. So since then, this is how we were solving the problem when we wanted to do more and more. But this is not the only way that you can do 25 operations per second for your business. Another way to do that, which sounds a little bit trivial but we usually don't think about it like this, is that we reduce the number of operations that this task takes to complete from 10 to two. So basically if we make this task run five times faster, instead of taking 200 milliseconds, it's going to take 14 milliseconds to complete but now on one CPU, we can do 25 operations per second.
So when it became faster, we could reduce the amount of resources that we need. And resources have a cost so if we go back to this Venn diagram, we can say that fast means cheap. But let's be clear here by saying fast, I don't add more RAM to make something faster, make the task run faster, make it more efficient. If you can make something run faster like that, you can make it cheaper. But you know what else is cheap? Talk is cheap. Let me show you something so it make a little bit more sense what I mean by cheap. So this is a test, it's a video, I recorded it. If you want, I can show this to you live, it doesn't matter, but this video is showing just like 30 seconds the test that I'm running on an Aerospike cluster.
So on the left-most hand column, it's showing the number of seconds that this test has been running. So every second it's printing one line and the first column saying what second is that. So it's 3,700 seconds, it's more than an hour that this test has been running. As you can see, it's doing around 169,000, write operations, around 31,000 update operations, which basically updates and writes, so both of them are causing write on this. So I'm doing around 200,000 write operations and around 400,000, 410,000 read operations per second on this cluster. And to make it a little bit more clear, I have a Grafana that was for 30 seconds, this is for 30 minutes starting slightly before 06:20, and it finishes slightly before 04:50. So in this half an hour, it was doing almost the flat line, 200,000 write operations, and it was also doing in that same half an hour, it was doing around 400,000 read operations per second.
So none of these things that I showed you so far means fast, just means that yeah, it can handle a lot of throughput. Basically all of the technologies that I mentioned so far are capable of handling something like this, but none of them can do this. This is a percentile chart, so it says that the talk chart is about the read percentile. It says that if you, in that half an hour, again, this is the same half an hour, in that half an hour that I was doing 400,000 read operations per second, which is around 720 million requests, if I put 720 million requests in a list and sort them by by their response time from the fastest here to the slowest here, if I go to the middle and take the item in the middle, that's the 50th percentile. If I go towards the slowest part and take the 95th one, 95th percentile, that's going to be the number.
It says that if I go to 99.9 of the sorted list, that one is still going to be less than one millisecond. That's something that Aerospike can do that other databases cannot do. And it's the same for the write as well. So for the write, still 99.9 is faster than one millisecond. Cassandra is considered to be a very, very fast database Cassandra at p99 works less than 10 milliseconds, 10 times slower than this, but if you get to p99.9, you're usually looking at something around 100 to 200 milliseconds. So this is something that other databases cannot do. And the other thing that other databases cannot do is this one. So this test was running on four i4i.4Xlarge machines. There's a fifth machine, that fifth machine is the machine that I was running the test from. It's a little bit larger, but to be able to handle 600,000 throughput, basically drive 600,000 requests per second, I needed a machine with lots of CPUs. But these machines, these i4i.4Xlarge are not that large, and if you look at them, my laptop is larger than this. My laptop has thirty-two virtual CPUs, it has 128 gigabytes of RAM, and I have more than four terabytes of disk.
So these are four laptop-sized machines, and these four laptop-sized machine, I can store three terabytes of data with redundancy. So if a node fails, the database can continue, you don't have no data loss or whatever, and it can handle 600,000 operations a second. And cost of these four machines on AWS, just to give you an idea is around $35,000. If you want to run the same load and something like DynamoDB, it would cost around $2 million. AWS has this calculator, we can put DynamoDB and put three terabytes of data, 600,000 operations, well, 200,000 write per second, 400,000 read per second, the number is something close to $2 million. So well, yeah, it is fast, as you saw, 99.9 percentile is less than one millisecond similar to a cache, and as you can see, it's very cheap because it doesn't need too many resources to handle that.
So okay, we discussed so far about fast and cheap, but what is the point of being green? So this started a couple of years ago. One of our clients published something that was very interesting, but before I talk about the thing that they published, let me tell you who they are. So Criteo is an Ad tech company, well, almost all of the largest Ad tech companies are our customers. Criteo is one of the largest a tech companies in the world, and certainly the largest in Europe. They serve around 5 billion ads per day, so if you see an ad, if very likely it is coming from Criteo and somehow some stuff related to that ad was stored in Aerospike. Their revenue is more than $2 billion, and the SLA that they have for that process that I said, identifying you, finding the campaign, bidding, and serving is 50 millisecond. So they want to be able to do all of that in 50 millisecond. And that generates 250 million operations per second on their database, 250 million operations. Every four seconds, they're doing a billion operations. So ridiculous, ridiculous numbers.
So before they move to Aerospike, they had infrastructure like this. They have a database, an open source database that many of you might be familiar with. This database is famous for having a database part and a cache part, so the database itself has a cache. And then in front of this database with a cache, there was another caching technology, and all of this together was serving all of the clients that were using the database. So there were more than 4,000 nodes of two different technologies that were capable of handling the workload that they wanted. When they migrated to Aerospike, they could reduce the number of nodes from 4,000 to 300. Of course, as you can see in the picture, the nodes are slightly larger. So Aerospike nodes require a little bit more RAM, and a little bit more CPUs, and things like that, but if you sum up the number of CPUs in this side and the number on the other side, the report said that they could reduce their hardware resources by 80%.
That's something that we're used to in Aerospike to reduce the amount of resources that you use by 80% for the reason that I explained, is we are faster, we don't need as many resources as other technologies. But one thing that they said that was very interesting was that they could reduce their CO₂ emissions as well, which, well, I'm a skeptical man, I thought that, "Okay, now everybody's talking about reducing their CO₂, how much it could be, how much CO₂ emissions is storing some data is going to produce?" And that made me do a calculation to see that how much CO₂ emissions is storing one petabyte of data is going to generate. And before I tell you the process, I just want to explain that why I chose one petabyte of data. One petabyte sounds a lot of data, and we do have customers that store petabytes of data, but there are very, very few of them. But if you think about the organization that you work for, it is very likely that you have several databases that are in the order of tens of terabytes, several databases that are in order of a few terabytes, and lots of databases that are in order of a few hundred gigabytes. And then all of these environments have pre-prod, so they're usually replicating the exact thing that is in prod.
And then you have dev environment, then you have test environments. If you put all of these together, the capacity of all of the databases in your organization, their very likely around the petabyte, although you might not have one petabyte production in one database. So the reason that I chose one petabyte is that, but these technologies are all of them scalable. If you are not happy with one petabyte, divide it by two, you get 500 terabytes and that could be the number that you're looking for. But if you go with one petabyte, so I said that, okay, if I have one petabyte of data, if I want to store it on Aerospike, how much resources would I need? And I did a sizing practice and I found out that, okay, that around 20 i4i.32Xlarge machines on EC2, I can store one petabyte of data.
Then I said, okay, the components that are inside these machines, I can find that what is the manufacturing emissions of each of these components? The machines in EC2, they have a lifetime of five years, I can divide that number by five, basically amortize the amount of CO₂ emissions that each of those components produce per year that they are being used, so that's going to be the manufacturing emission per year. And then I can also find out how much energy each of these components use during the year, that's something that I can do as well. So I found out about energy consumption of each of these components as well, and then if you know where these components are located, you can go to that country energy production data and find out that for each kilowatt of energy, how much emissions that country is producing. So this test that I did, I did it for Ireland, I checked Ireland. Ireland is actually one of the reasonably green countries in Europe, so it's a reasonable number.
So I looked at the amount of emissions that the energy consumption of those resources produce, and I came up with something like this, that's around 45 tons of CO₂ emissions per year is the manufacturing of those components, and around 65 tons is the energy component, energy consumption of those components. And putting them together, it was 110 tones, which is well a number, but to be able to say that if this number is a good number or a bad number, I needed to compare it with something else. So before joining Aerospike, I used to work for another database company called Datastacks. I don't know if you heard about Datastacks. Datastacks is the main contributor to the Apache Cassandra project. I've been a Cassandra user and I worked for Datastacks for many, many years so I can consider myself as someone that knows Cassandra really, really well.
So I did the same practice for Cassandra. Basically I found out how many resources Cassandra would need and how much energy consumption and manufacturing emissions of those resources are going to be. And the final number was 678 tons of CO₂ emissions. So basically from 110 tons of CO₂ emissions, it jumped to 678 tons of CO₂ emissions, which is a massive number. And then at that point I started think that, okay, again, sounds like a massive number, but what is the equivalent of being in Cassandra and not moving through a more modern technology like Aerospike? And the numbers were staggering. It's like every year, if you are burning 280 tons of coal, this is an interesting number, or it's flying to the moon and back three times. We're cutting 55 hectares of trees, so yeah, you can quote me on that, if you're not moving from Cassandra to Aerospike, it's like you're cutting 55 hectares of trees.
But well, it is something quite significant and the result was interesting enough that I decided to publish it in IEEE. So it was published in IEEE Access. If you want it, IEEE Access is a peer-reviewed scientific journal, it is already available there, but it's a very long paper and I'm explaining everything, what is my methodology for calculating all of those. I have written an executive summary that is available here. You can scan the QR code and have a look at that or you can just search the reduce CO₂ emissions, the title, and you should be able to easily find it. Then inside that, there are links to the actual white paper in IEEE Access as well. So if you're interested, you can look into that as well. Well, the result was also very interesting, but to basically describe the framework that I suggested, I was saying that, okay, if you have a workload and a technology workload was one petabyte of data and technology was either Cassandra and Aerospike, we can always come up with a resource size estimation that technology needs to handle that workload, and then you can calculate the manufacturing emission and energy usage emission, put them together to give you the platform emission.
But interesting thing here is that if you look at this platform emission, it is driven by the resource size estimation. So if you have a technology that requires fewer resources to handle that workload, it means that you're going to produce fewer emissions. So if you go back to this Venn diagrams that we had, for the same reason that fast was cheap, basically fast meant you're going to use fewer resources, and because you are using fewer resources, you are cheap, because you are using fewer resources, you're green. So that's why I'm saying that that fast, cheap, and green are basically the same thing. If either of these is your goal, you're going to achieve the other two. So that's a good trade-off to have, most other industries don't have this luxury. If you're buying a faster car, it's going to be more expensive unless green. So in computer industry, it's usually not like that. If can improve the efficiency of your applications, you can make them run faster, you can make them cheaper, and you can make them greener. And I want to finish the talk with this last thought, that if you are going through hell, maybe perhaps you should consider that you might need to keep going a little bit faster. So thank you very much.
About this webinar
NoSQL databases have evolved to address not only matters of speed, efficiency, and scale but also a more universal concern: sustainability. Aerospike Database is leading the way in the green data space by helping businesses reduce their carbon footprint with modern, efficient technology.Join us for a lively discussion on the evolution of NoSQL databases and the role of Aerospike Database in promoting sustainability in big data. This webinar is for tech enthusiasts and savvy businesses seeking to balance a competitive edge while promoting sustainability initiatives. This webinar will cover topics such as:
The benefits of using NoSQL databases for big data applications
How Aerospike Database compares to other NoSQL databases
How Aerospike Database can help you reduce your CO2 emissions