WEBINAR

Aerospike @ PhonePe: An SRE Perspective

Video cover

Hear from PhonePe’s Site Reliability Engineers as they share their insights on how they manage hundreds of clusters across many data centers using Aerospike’s real-time data platform.

Speakers

mihir-vyas
Mihir Vyas
Site Reliability Engineer 2
tharun-moorthy
Tharun Moorthy
Site Reliability Engineer 1, PhonePe

Vikram:

So as the next speaker lines up, which is it's going to be a joint talk from Mihir and Tharun from PhonePe's SRE team. They're going to speak about how efficiently they manage so many clusters of Aerospike and the knicks and knacks within that. As they're going through that, there are refreshments at the back, so you can walk out and take them. And post the third session, which is going to be done by Tim Fox, who's a chief developer advocate at Aerospike, we also have a dinner organized over here. So hold on for that in there. And then, like I said, if you have any questions, we'll take it again post the three sessions are done, okay?

So I also have somebody from the PhonePe team who would like to make a announcement.

Pallavi:

Good evening, everyone. Thanks for attending session. My name is Pallavi. So we have a few people from our team also present here in case if you have any questions related to TA, you can reach out to them post the end of the session. Thank you.

Mihir Vyas:

Yeah, so good evening, everyone. I'm Mihir and Tharun with me. We are from SRE team at PhonePe, and today we'll be sharing you how we manage and scale thousands of Aerospike servers and ensure their reliability and uptime. So this is our table of contents today. Yeah. So, Tharun, could you please tell us what are the guiding principles we look out for while choosing a database?

Tharun Moorthy:

Yeah. So at PhonePe we have our Aerospike database. And when we choose a database, so based on what choose a database? So first of all, based on use case we choose a database. And on top of that, there are certain guiding principles that we have based on which we choose a database. So these are the guiding principles that we are going to walk through now.

On-premises. So if you all didn't know, how many of you all knew that PhonePe has its own data center? It doesn't run out of cloud. So we have our own data centers. And since PhonePe is completely based on on-premises data center, the database that we are choosing, we should be able to set it up and manage in our own data centers and it should not have any external dependencies. If it is dependent on some other external cloud service where it had to always constantly make some requests to it, that wouldn't help much. So that's the first guiding principle.

Then high availability. When we set up databases, or any service for that matter, how many of you all have exactly had a peaceful week, and then when it's Friday evening or Saturday morning, Sunday you get an alert or you get some production issue? The whole week would be peaceful, but the weekend it's messed up. See, those are unavoidable situations. But even at such situations, the database should not completely lose its data or operational capability. It should be completely operational. So that's what we call high availability and fault tolerance. So the data should be capable of that.

And talking about resource utilization. When you allocate certain CPU and memory to that service or any database, it should be able to efficiently utilize all of what you're allocating to it. We cannot just keep on allocating and the database is utilizing only 10% or 12% per node and you just keep adding node. That's not useful.

And then we have security and auditing, where you have the database should have authorization authentication, and you should have audit logs to log it. Fair load balance. If you have a distributed cluster, all the nodes should be able to efficiently distribute the operation's [inaudible 00:03:58] cluster nodes. So that's fair load balancing.

Coming to low latency at scale. As the read and writes scale to the cluster, the latency should also not keep scaling with it. That's low latency at scale. And cross-site replication. As you all know, if you have multiple data centers the DB should work across data centers. This probably won't happen, but in case a database becomes unreachable, then we need to have data on another site and we should be operationally capable to function out of another data center. The DB should give us that ability.

And ease of management. Yeah, the scale, upgrade, monitoring, maintenance, everything should work out of the box for us and in an easy manner. That's ease of management. And horizontal scaling. Of course, just vertical scaling would not help. How much would you keep adding on disks? How much will you keep on bumping of the cores? You need to be able to scale horizontally. So that's about horizontal scaling.

Mihir Vyas:

So, Tharun, could you please highlight Aerospike [inaudible 00:05:06] and PhonePe?

Tharun Moorthy:

Sure. Sure, Mihir. So yeah, this is a nutshell. So we have around more than thousand Aerospike nodes at PhonePe, which includes also hundreds of Aerospike clusters that we have. And they're a mix of virtual machine and bare metal. You all know what a virtual machine and a bare metal is, right? So they're mix of both, but we try to keep it homogeneous. We don't mix up things.

Then we have, just an example, our biggest cluster is around 70 node bare metal cluster, where each bare metal you can imagine it's around 600 gigs of 40 core machine around 1.5 terabyte NL SAS disk for the data and several 1.5 terabyte NVMe disk for data and several 2.2 terabyte NL SAS disk for the logs and digest logs. So that's how it is. And up to 650 kps QPS are certain clusters also handle that on a day-to-day basis.

And we have an in-house change management system. So you all are also part of engineering and at your companies also you would have some sort of change management system. So when I went through that, interestingly I found that we push more than fifty-plus Aerospike changes on production at PhonePe, and this excludes the stage changes that we do, or the perf tests that we do, or the new cluster setup that we do. It excludes all of that and still we are able to push so much to production. And we have trillions of data points that Chaitanya was explaining. And we have up to 150+ terabyte of data per site. That's the Aerospike footprint at PhonePe.

And yeah, we have a lot of features also. It's a huge infra, so we must be using a lot of features. Some of which even Chaitanya highlighted. But let's look at certain features from the infra point of view, from SRE point of view, which Mihir will walk us through.

Mihir Vyas:

Yeah. So these are the features we use at PhonePe. So starting off with Aerospike tools, this is divided into four categories. We have Aerospike cluster management tools, then data management tools, then we have performance assessment tools, and finally dump restore tools. So in case of cluster management tools, we have asadm, asinfo, things like that. Asadm is used for giving a bird's eye view of the cluster. What is the XDR information? How many nodes are present in the cluster? And network information and so on. How many records are present, replicas, et cetera?

Then we have something called asinfo, by which we can actually push a dynamic change even though when the Aerospike server itself is running without having to restart the service, and obviously we make it persist in the core so that it is persisted across restarts. Then we have something called AQL in data management where we can actually log into the cluster and run queries.

Then we have something called asvalidation. So this tool is particularly used for CDT bins. Let's say you are migrating one cluster to another and the versions are different. There is a possibility that your CDTs might become corrupt. So this particular tool will help you to identify and fix those CDTs. This we have faced once in the past and this tool helped us with that.

Then we have some tools like asloglatency and asbench. So log latency is used to pass the log files and it will give you the information in a particular format and you can even provide the timestamp. So this we use normally when we get complaints from the Aerospike, "Hey, there is some latency." Then we have something called asbench, which we obviously know is for benchmarking purposes. And finally we have asbackup and asrestore. Asbackup is just for taking backup, and asrestore is used for restoring that backup, obviously.

Then moving on to second feature, this is an enterprise-only feature, TLS plus Auth. So all our clusters are TLS enabled and that means the traffic is secure and we also have auth enabled. So you need a username and password to connect to the cluster along with the appropriate privileges. Then we have data-in-memory as well. This we will touch upon later because we have one slide dedicated for that.

Then we have XDR. So XDR is a feature where you can actually have two clusters in two different data centers and you can ship record from one data center to another. So in XDR, this is in asynchronous format. So we have unidirectional as well as bidirectional. Unidirectional as the name suggests, it will be one way, and bidirectional is both way. In bidirectional, you can have your app write on both sides.

Then we have AP and CP, which Aerospike calls it SC. So you can config... Well, before touching that, just briefly tell what the CAP Theorem is. It says that at any point of time you can either have consistency, availability, partition tolerance. Out of them, you can only have two. So partition tolerance is obviously inevitable because networks splits can happen anytime. So you are left with AP and CP. So at PhonePe we have both the namespaces. We have AP as well, we have SC as well. Then we have custom monitoring pipelines along with Aerospike monitoring stack. That we will touch upon later. Again, because there is a slide for that.

Then sometimes like we have XDR, there might be a requirement where we have to transfer data to some other data source center. So for that we use Kafka outbound connector. So your Aerospike cluster is there. Then your Apache Kafka is there. This cluster will sit in between them, and itself the cluster so that we don't have a single point of failure. So data comes from Aerospike. From there it will go to Kafka outbound connector. And from there it'll go to the Apache Kafka.

[inaudible 00:10:25] most of the data 4.X, but we are slowly migrating to 6.X because of the new features available. And finally we have integrated the HashiCorp Vault also. So Vault just basically present a secret management system. And that you can actually create dynamic credentials, super credentials, which will expire after certain set of time. So that's about Vault.

Is it? Yeah. So, Tharun, could you please explain role of SRE?

Tharun Moorthy:

Sure. Thanks, Mihir. So yeah, now coming to the main topic, SRE role in managing Aerospike. So we at SRE do a lot of things. Like the broad categories of task, we have classified them into five buckets here so that we can quickly present them. We have classified them into create, monitor, maintain, scale, and upgrade.

Let me start with create. So, create the various cluster setups, the various connector setups. Connector is nothing but just now you would've heard Mihir speak about the Kafka connectors. So we SREs are also responsible for setting up those. And namespace creation because one cluster can have various namespaces in it. It need not just have just one namespace. So we do that as well.

And write scripts and tools. So see, automation is the way forward. You do not want to always keep manually doing stuff. So we write scripts and tools to, say, step cluster, verify them, migrate cluster, and write the scripts to monitor the clusters as well. So that's what we do. And XDR setups. So as Mihir was talking about the XDR thing, we do have clusters that have unidirectional XDR, bidirectional XDR as well. And they mean what the word is actually, uni and bidirectional. So this is about the create flow.

Now let's look at the creation flow. So how does this happen? So the devs raise a new approved Jira that they need an Aerospike cluster. And then we discuss with them as to what they need and how much size. What is the use case? Can it be shared with some other cluster? And a lot of other discussion. And then after approvals are in place, we have dedicated team that helps us with provisioning the nodes. We get the notes provisioned.

And then we use solstack here, which helps us in keeping the configurations, and rolling them out are importent as well. And since it also involves a lot of peer review process, it also avoids a lot of mistakes. Because we are humans and we all do mistake, and that's completely all right. But it's good that we don't repeat those same mistakes. That's the only thing. So solstack helps us with that because also we have integrated with GitLab and stuff. So we have peer reviews before everything is merged and approved and stuff. So then we install... And then we also create the Aerospike role and users, because I was talking about authentication. So setting up monitoring and then peer review and we hand over the cluster. This the creation flow in a nutshell.

Now the next slide please. Yeah, now the monitoring stack. So we saw a huge infra of Aerospike. And monitoring it manually as a human, it would be a very tedious task. And we cannot monitor the cluster 24/7. So Mihir will tell us how they manage monitoring stack at PhonePe.

Mihir Vyas:

Yeah, hello. So monitoring is a critical aspect of any of where we have. So in monitoring what is the need? So most important aspect is the need for monitoring. Because of that, we can find out early issues. And if we do some early detection, we can actually take a call on capacity planning or if there is some performance degradation then we can take a call on that. Or let's say later on if some things go wrong, it might even help us to file an RCA because of the metrics available. So that is why. And overall, obviously it'll help you in uptime and reliability of your services.

So at PhonePe we have two types of monitoring services. One is Aerospike's and one is in-house, which we have built. So speaking of Aerospike monitoring stack. So here what we have is a combination of Aerospike Prometheus exporters and a Prometheus server and a visualization tool called Grafana. So what will happen is, let's say you have a three-node Aerospike cluster. You will have a Prometheus exporter running on all three nodes. And then finally you will have all three nodes sending metrics to the Prometheus server and you can use any tool like Grafana to visualize the metrics. And the graphs of these metrics, the graphs which are available on Grafana, are actually open source by respect so you can find them on Github as well.

Then comes the custom monitoring daemons. So along with this we have custom monitoring daemons, which will actually send out a phone call to us or whoever is on call. So in that, I'll highlight some of them, although there are plenty. So first one is, let's say, Aerospike cluster size. You expect a cluster size of 10 and suddenly it becomes nine. So for that there is an alert configured.

Then there is some alert around high water mark. So as we know, the data is distributed in Aerospike. And the high water mark is configured around 50-70% depending on the requirements. Let's assume the high water mark is around 50%. So the way we have written this monitoring, let's say we have a ten-node cluster, the monitoring script, the high water mark script, will actually predict a one-node failure and then calculate high water mark and then alert us in advance. So because of that, we have a one-node failure tolerance here.

Then apart from that we have some other monitoring also. For example, like clock skew. Aerospike has a tolerance of 27 seconds. So we have a monitoring service around clock skew, which is using Aerospike's variable only, not the data from the system, and it is exposing that metric and we are using that metric and we have kept a threshold of one second. So along with that we'll get this alert as well. And apart from that, some other alerts like disk available percentage, stop writes, et cetera, they are also there.

Now how these systems work is we are using something called Riemann. Riemann is event processing. So what it expects is to give it a structure and the structure will have some parameters, a key value pair basically. Let's say you can pass state of the system, then you can pass host name, then you can pass the time delay of the service, and you can pass the service name and so on. Based on what you send, the metric will then further forward it to either InfluxDB or Aggregator. Aggregator, can think of it like a rule engine which will have rules of which person and which team to send alert to for this particular host. So that is the purpose of Aggregator. And InfluxDB is, again, like Prometheus, a time-series database so you can visualize the metrics.

Then finally this particular tool also has the ability to snooze alerts. So let's say I'm not on call but I'm working on something, I'm doing some maintenance, I'll be shutting off the services. That will certainly result in a phone call. So I can snooze the call so that the on-call will not receive the alert.

Then finally for network monitoring we have something called LibreNMS that will give us network statistics. And along with that it'll also give some system stats monitoring. So like disk I/O load, et cetera. Then...

Tharun Moorthy:

Yeah. Let me walk you all through the maintain aspect. So under maintain aspect, the different things that we have here is fixing production issues, where we'll talk about a couple of production issues next. And backup and restore. We just backup the data and restore them when we have to migrate Aerospike clusters.

And then performing truncates. So data cannot forever be kept on the cluster. Based on the need, we'll be truncating the data. That's what performing truncates is. And config changes. The various aerospace config changes that are there, if possible we do it in a dynamic [inaudible 00:18:01], and if not we take a rolling restart to apply those configs. So that's about maintain.

Mihir Vyas:

Yeah, so speaking of scaling. So in [inaudible 00:18:11], in order to cope with increasing traffic, we have to scale. So scaling essentially means adding more resources. Or you can actually add more servers to scale out basically. So two types of scaling you can do here. One is horizontal scaling and one is vertical scaling.

Most likely if it's a bare metal server, we will do horizontal scaling because we obviously can't keep adding disks. But if it's a virtual machine we might even prefer increasing vertical. Maybe, let's say, the virtual machine has around 200 GB of RAM and you have given only 100 GB to a namespace, so you can actually change that. Then... Yeah, that's about it. So these are some general operations we do around scaling, adding prospect node, increasing file size, migrate from VM to VM, and remove XDR and so on.

And when and how we scale. So this is very much similar to what Tharun told in create slide. So what we have is we manage everything via gate. So somebody will send a merge request, add a new node in some host file, and accordingly somebody will do. And finally we will add this in off-peak hours. Note that this doesn't require any downtime, but since you add a node, that will trigger a rebalance across cluster of the records. So that might add to the latency. And how we scale is we add the node to the XDR side first and then we'll add to the source side. So that's our scaling.

Tharun Moorthy:

Yeah. Now talking about upgrade, how many of you here have set up actual production clusters? How many of you all are SREs here? Okay. Oh, okay. Only so many SREs we have. Okay, nice. So yeah, we have set up the production cluster, but how many of you have wished that we could keep it up forever? Like we need not make any changes there, we need not migrate them? It's a very tiring process because it involves a lot of things. Like you have to do a proof of concept, you have to do perf test. Those perf tests should be with the right configs and all those things. You have to plan the migration and all those stuff.

So yeah, unfortunately we have to do upgrades and there are several legitimate reasons why you want to upgrade. Either the OS is going end-of-life, or the service itself is going end-of-life, or there is a bug patch or new features are there in the new version. So you'll have to move out someday. So that aspect of Aerospike, we SREs take care of.

Okay, moving away from data-in-memory. So Mihir was talking about data-in-memory true versus false. So let's now look at whether PhonePe majorly is moving towards data-in-memory true or data-in-memory false and how they arrived at it.

Mihir Vyas:

Yeah. So we skipped that part, "storage engine type." So that was the reason for that. So basically we can config. When we configure a namespace in Aerospike, there are five different ways you can decide where you want to store it. One is you can use a persistent memory, example Intel Optane, but that is ruled out because we don't have that hardware. Then there is something called All Flash where even the primary index will be in flash. So we don't have that as well.

So we are left with three options. One is storage engine as device, where we can actually choose whether a primary index as well as data will be in memory, which is called data-in-memory true. And there is one more option where you'll see data-in-memory false, which means primary index will be in memory but our data will only be on the flash. And there is one more category where you will have storage engine as memory, where you won't have any persistence at all. So let's say there is a power outage in DC, you'll lose all your data.

So this particular slide only focuses on storage engine device, and that is the one which are maximum using. So speaking of data-in-memory true versus false. So when we did some perf test, we found out that the performance of false was as same as true. There was no difference as such. So most likely the SSDs which we have, that also can be a reason for it. Tharun will be sharing more over on this perf test results.

Tharun Moorthy:

Yeah, so this is the architecture of the perf test setup that we had. And yeah, we had this Java benchmark client 4.X, 6.X was version testing. And data-in-memory true was as false, with an increasing client threads up to 512. So we have a dedicated perf test team here at PhonePe who helps us with these things.

So based on these parameters, there were certain findings. And if you can look at this graph here. Okay. So the throughput increases with increase in the client threads of the benchmarking tool. So if you look at this X axis, we have the increase in client threads. And if you look at the Y axis, we have the QPS. So we can see that the throughput increases and CPU load also was limited to 50%. It didn't go beyond that with the increase of client threads. And we were able to touch around two lakh QPS. And by the way, this test was a write intensive test, not involve a read intensive test. So the summary was the latencies and the QPS were same for data-in-memory true versus data-in-memory false.

Like Mihir told, we use NVMe SSD at PhonePe. This could be one reason is what we concluded, but we would love to hear from Tim, we would love to hear from the Aerospike team as to what they have to tell about this.

Yeah, Mihir, please. Yeah, Mihir, please walk us through this.

Mihir Vyas:

Yeah, one second. So some of the prod issues and how we resolve them. Yeah. So queue too deep. So this only happens when you are having storage engine as device. So what happens, whenever a write comes, it will come to a write queue, which is there inside the ram. And from there, if it is successful, it will send an acknowledgement to the client, "Okay, the write is successful."

But while it is going through this, that process is happening asynchronously. So what happened in one of our VM clusters was we saw that this error was coming in. The writes were failing because the write queue was full. So as a quick fix, what we did over there was we increased the write queue size. There is a variable max_write_cache, if you tune that you can increase the write queue size. And finally it got fixed. But that was a quick fix. The permanent fix was we moved this to a bare metal cluster.

Then second issue which we had was around clocks queue. So like I said, Aerospike has a tolerance of 27 seconds. So at that time we didn't have any monitoring around that. So what happened was clocks queue breached around 40 seconds. And then as a quick fix we logged in and we fixed the time. After that we deployed this monitoring. So from that day, we never received any outage due to this clocks queue issue. That's about outages. Huge shout out to Aerospike's support team for helping us during outages.

Tharun Moorthy:

Along with that we have feature requests to the Aerospike team as well.

Mihir Vyas:

Yeah, so although Aerospike is packed with features, but we still have some feature requests which would actually help us. So speaking of user creation, password hash. So for example, when I want to create a user now, I have to run the command generate create user so-and-so name, [inaudible 00:25:01] password and so-and-so password, which will be in plain text. So if we could have some similar feature where we can actually provide a hash instead of a password itself so that somebody standing behind me cannot see and somebody who has sudo cannot even log in and see the history file what all commands were ran, so that password is secure. Also because we are using Vault, we might actually use hash over there.

Then there is something around logging of slow queries, which we don't have. So earlier we had something called a job manager, but now that is not there. So if we can have something around that, that will be useful.

Tharun Moorthy:

And we faced a hotkeys issues where we were not able to detect the hotkeys. So it'll be helpful if we have some command that we can run to find out what are the hotkeys currently from the cluster. Last thing is we would love a centralized dashboard which has all the Aerospike clusters that we manage, and we should be able to download the reports. Where this helps, I'll tell you. When we are doing a capacity planning or when we want to find out how much resources does Aerospike take up at our infra, it is really helpful. So yeah, these are the few feature requests that we have. This is all we have to sum up. Thank you. Thank you so much. Thank you.

Mihir Vyas:

We can start with Q&A.

Vikram:

Mihir and Tharun, any burning questions? One or two around here?

Speaker 5:

Okay. Hello.

Mihir Vyas:

Hi.

Speaker 5:

So your largest data cluster is 70 nodes. So I have a hypothetical question. Let's say there is a partial data center outage and you lose 30 or 40 [inaudible 00:26:36].

Mihir Vyas:

We can't hear.

Speaker 5:

Hello?

Tharun Moorthy:

It is seven. Yeah.

Speaker 5:

Let's say out of seven machines, you lose 30 or 40 machines from the cluster. Do you have any automation in place to bring it back all together?

Mihir Vyas:

No, there is no automation around that because manual intervention is superior. And also the way it will be handled depends on what kind of configuration we have. Is it AP or CP? So that will determine the behavior.

Speaker 5:

Okay.

Mihir Vyas:

Also, there is a concept of fast and cold start in Aerospike. That also will kick in here. So because you said we lost the nodes, right? Are they shut off or how are they?

Speaker 5:

Let's say they just died.

Mihir Vyas:

Okay, let's say they are shut off. Then it'll take a lot of time to bring them up. More than four hours, considering the memory size we have given.

Speaker 5:

So as an SRE, do you have the runbooks ready with them for all possible cases?

Mihir Vyas:

Hmm? Sorry?

Speaker 5:

As an SRE, do you have runbooks for everything?

Mihir Vyas:

Yeah, more or less.

Tharun Moorthy:

And since we do a lot of things, we do not have a runbook for everything. So how many of you think SREs do a lot? Thank you for empathizing with us.

Mihir Vyas:

Thank you. Thank you so much.

Tharun Moorthy:

Thank you. Thank you guys.

Mihir Vyas:

We are not done yet. Okay, we are not done yet.

Vikram:

Yes. So if any more questions, we'll take it up towards the end. Once again, thanks, Mihir and Tharun. There was some great coordination in terms of how they are doing the talk.

Mihir Vyas:

Okay, thank you everyone.

Vikram:

And your feature requests are widely heard. We get a lot of feature requests from PhonePe, which actually in turn help us build a great product out there. So we've heard to it, especially the password hack one has already gone into effect. There are others that are coming up. So yes, we are working on it and they should be out.

Tharun Moorthy:

Thank you. Thank you, Vikram.

Vikram:

Thanks.

Speaker 6:

[inaudible 00:28:15]

Mihir Vyas:

Thank you. Oh.