WEBINARS

Observability and management updates in Aerospike

Please accept marketing cookies to view this content.

Steve Tuohy:

All right, hi there. Let's go ahead and get started. Few people still are coming on, but we will catch them on the recording if they need to. So I'm Steve Tuohy, Aerospike product marketing director. Thank you for joining us for today's session on Aerospike Observability and Management. We'll jump into the content and demos shortly, but let me give you quickly some of the standard webinar logistics. You are indeed muted, and we're going to keep it that way for today. We do want your feedback and your questions, though. We will try to address those as we go and certainly at the end. Please use the Q and A function in the bottom of your Zoom screen to input questions.

We are recording this session and we'll send that to you if you miss anything or would like to share that with a colleague. At the end we'll also share some ways to engage further for more extensive topics that we don't have time for today. All right, let's move on. So with that out of the way, I'm really excited to have Adam Hevener join us today. Adam is Aerospike's main man for observability and management. He brings over six years of different O&M experience to Aerospike at different vendors such as VMware, and he's been with Aerospike just about a year. I think what Adam's telling us in this picture is that he is focused on observability even in his free time. Adam, tell us about this shot.

Adam Hevenor:

Yeah, a little bit of an amateur photographer. This is a selfie in Rothenberg from just a few weeks ago, so.

Steve Tuohy:

Awesome. All right, well, I think what I'm hearing is observability is kind of like a selfie stick. It gives you a full view of things.

Adam Hevenor:

Fisheye lens. Yep.

Steve Tuohy:

All right. Well, so Adam's going to give a pretty high-level overview of observability and management at Aerospike, but the bulk of the time is going to be hands-on, demonstrating the tooling in action. The way we set it up or the way Adam's going to lay it out is breaking it into specific jobs to be done. So, Adam, that was a new concept to me when we first started talking about it, so why don't you lay out that concept and dive in?

Adam Hevenor:

Great. Yeah, for those that don't know the Jobs To Be Done framework coined by Clayton Christensen in The Innovator's Dilemma, it's kind of the go-to framework for product managers to think about how users hire different products and services to fulfill jobs they need to be done to accomplish their goals. So we're going to kind of take a look at different features of Aerospike, what the job they're trying to accomplish is, whether they're trying to share an Aerospike cluster across teams, or replicate data to a different data center, troubleshoot, identify and fix issues, or manage config, we'll kind of put those in the framework.

The framework is I want to share cluster across teams so that I can gain operational efficiency in that case. So I'll try to reference those jobs to be done as we get to them in the slides in the demo. As Steve said, we're going to try to make this really as interactive as possible. So I'll hit questions at the end, but feel free to drop them in as we go if you think of them.

Steve Tuohy:

Fantastic.

Adam Hevenor:

Just using this Jobs To Be Done framework, kind of the meta-question here is what are people hiring Aerospike to do? We are selected as the real-time data platform really anytime that large scale of data and total cost of ownership come up. Those are kind of industries that need our speed and performance but also have tremendous amounts of data and need to keep their TCO in check. We are a favorite of a lot of the ad tech companies out there. Online transaction processing is another common use case for us. Fraud detection, IoT, again, these are use cases that really have a high volume of data, they need that low latency, and with that, TCO always becomes a consideration. We'll touch a little bit on graph and things. We're dabbling in the ML space. Graph is now officially GA, and we also along had geospatial support, so we do get chosen by ML teams as well.

So as I explain our stack, I'm going to kick us off with a poll. So look for a poll. You should get prompted, and please respond with your primary approach for observability. I know there's probably lots of tools that you're using for observability. We'll leave that up for a few minutes, and I will start going through this architecture diagram. So the Aerospike observability stack consists of really sort of four components. The first component is our Aerospike Prometheus Exporter, and that's the piece of our stack that really, regardless of your tool of choice, is consistent across all deployments. So this lives on each of your nodes. It calls APIs within the database to expose relevant metrics, and it puts specific tags on them and exposes them to the rest of the stack.

What we're looking at here is what we refer to as our observability stack. So that includes Time-series database, a Grafana UI, and an alert manager. But you can also configure Exporter with our OpenTelemetry collector, and that gives you the option to choose any monitoring platform of your choice. So if you're prefer a SaaS provider like Datadog or New Relic, you can do that easily, and you'll get the same metrics and the same tagging conventions that we're going to review here in those tools as well. So everything that we're talking about here, you can set up in your SaaS offering also. Let's take a look at the poll results. Cool. So it looks like most folks are using this stack, so that's great to hear. I'm going to take a second, just kind of break down the pieces.

The Prometheus piece here, it kind of refers to the Time-series database at its core. The name Prometheus can mean several things. The alert manager is also part of the Prometheus project, and we package alerts for the alert manager, and that's a really important piece of our offering. Then we also include a curated set of dashboards to be used in Grafana. We'll be going through those dashboards in detail and kind of revisiting that Jobs To Be Done framework. But I want to call out all of these dashboards and alerts are included in our license and supported. So if you have technical questions and you are having trouble or have questions about these, you can feel free to reach out to us. We'll leave some links later in the presentation, and you can open support cases on it.

Steve Tuohy:

Adam, I see a few familiar names on the attendee list among our customers. So obviously, this diagram with focuses on Prometheus and Grafana, what are some of the most commonly deployed third-party applications that would appear among customers you're working with?

Adam Hevenor:

Yeah. Off the top of my head, Datadog, New Relic, Splunk, Dynatrace, Elastic, these are all pretty common. We definitely see that for some customers having a SaaS, especially in a lot of clusters, it really takes the burden away from managing this yourself. A couple others that we've seen lately, Grafana Labs, Chronosphere, those support Grafana dashboards natively, so you can actually take our dashboards we're going to show here and put them into a SaaS as well.

All right, so this was kind of the setup, not really in our Jobs To Be Done framework. Let's hop onto our first job to be done. I'll hop over to our demo here and kind of kick off the interactive demo portion of things. Really, the job to be done here is understanding what to monitor. When you are setting up your Aerospike and real-time data platform, it can be kind of overwhelming, and we've really put a lot of thought into organizing a set of materials that make this approachable. I like to start with our metric reference. It's kind of in the weeds. The question this answers is what does a metric mean. A lot of our competitors, they'll just have a single line of description about the metric over in some GitHub repo.

We've put a lot of effort into tagging this, detailing when a metric was introduced, providing bi-directional links to other metrics that are relevant, giving you searchability in the metrics and also alerting conditions. We'll talk more about alerts, but anywhere you see this if-then statement we've defined in alerts in the alert manager, and we have a nice how to use this reference section that kind of explains what this looks like when you're over in Grafana. So as mentioned, I've been in the observability space for six or seven years and really never seen anything like this. It really provides that companion guide for building dashboards, understanding dashboards, and as I mentioned, it's independent of what tool you're using. So if you are using Datadog, you'd still use these same details. Of course-

Steve Tuohy:

So step one is read about the 471 or 500 metrics A to Z, right?

Adam Hevenor:

Right. The first thing you do before you use a new product is, of course, to read the directions, and we expect everyone to do that. I'm being sarcastic. That's the in-the-weeds companion. What I have on my screen now is what I call our real-time data overview. So this is a nice dashboard if you're new to Aerospike, it does a good job at highlighting your overall platform. I like to talk about how our observability stack is about observing your real-time data platform, not just monitoring your database. I'm showcasing some of our most popular features here. We have a couple of different clusters, as you can see on my screen.

Each of these clusters has multiple nodes in it and we have those clusters connected through our Cross-Datacenter Replication or XDR. We're kind of getting those high-level queues on all these things from this view. We have multiple users set up. So like I said, this is kind of when you're new to Aerospike, you're just learning about the feature and functionality, this is a good place to start.

All right, I'm going to kind of drill into some of these specifics. If we kind of go back to our agenda, we're going to cover a couple of the different jobs to be done. I'm going to kick us off with the use case around multi-tenancy. So all of the panels on this view are clickable, so they will launch you into the relevant dashboard for that piece of functionality. In this case, we're taking a look at our multi-tenancy view. What I am referring to when I say multi-tenancy is kind of what I mentioned before, that our operators, really to take advantage of all the functionality that Aerospike offers and really gain operational efficiencies, they grow their use case of Aerospike beyond a single application or a single team within their organization, and they expand that out to what we call multiple users.

So each of these users has multiple ways that we can help enforce usage of Aerospike. We have, first and foremost, a quota system. That's what I have on my screen in front of me. You can see in my example here user one is using 53% of the quota for read, 63 for writes, and user two doesn't have a quota set. So it is up to you to set the quota ahead of time, and this gives you insight into how that's being utilized. We have alerts also in place for the quotas, so if you do set a quota, you'll get an alert when that user passes the... We have 50% alert, and then this is an 80% here when you get to the red band.

Steve Tuohy:

So a lot of information here, Adam. This is great. So would the administrator set the quotas in a view like this, or is this just the visibility of it?

Adam Hevenor:

Yeah, the quota configuration is something that you set up in the configuration file. We do have a peek at the configuration tools that we offer in a bit later on the demo. So we'll take a look at how you enable quotas from your IDE. There's kind of some default values that will help get you started and that's kind of the quota setup. Really, our observability stack is about observing and alerting on things happening in your platform. Our CLI tools provide hooks that you can integrate into your IAC platform and kind of manage your configuration with your existing tool chain.

Steve Tuohy:

Got it. So the quotas haven't been set up for the second user here, but the first user has distinct read and write quotas. Is that 53%, what's sort of the time horizon of that?

Adam Hevenor:

Yeah, so right now, we just happen to be looking at the last five minutes. You can look at different time periods. It looks like over the last 24 hours, we've actually exceeded this write quota. Have been using one of our benchmarking tools to put quite a bit of load on our cluster. Another thing we didn't talk about is this concept of chargebacks. This is another kind of job to be done when it comes to sharing your cluster. If you are a large organization and you have team one that has a certain budget, they're sharing the database with team two who's not using it at all, it's good to have a way to communicate how much infrastructure costs they're incurring for the shared database, and that's the chargeback concept. I just put this in as a customization to the dashboard. So that's really easy to do.

You can come in here and edit the queries and set a specific formula for your chargeback formula. We have a number of different metrics that are relevant to chargeback both at the user level and the set level. You'll see another chargeback metric on one of my dashboards in the demo here. It is up to you to kind of come up with an approach for chargebacks. I would kind of describe our approach to multi-tenancy as soft multi-tenancy. So we're really conscious about noisy neighbors. That is what the quotas prevent. Any single workload really cannot disrupt your overall platform. So you don't have to worry about that change your developer pushed up on a Friday afternoon. But when it comes to something like security isolation, we kind of have limited functionality. So opening up your Aerospike to the entire internet is not something we recommend. You probably want to do this within your organization.

Steve Tuohy:

Interesting. Yeah, thanks.

Adam Hevenor:

Great. So we kind of hit one of our jobs to be done here. Like I mentioned, multi-tenancy, great way to kind of get more out of Aerospike, use it for more use cases, and again, included in your license to take full advantage of. We see really almost all our customers head this direction over time. Another feature in that category is our Cross-Datacenter Replication. So you'll see from our map here that I have a primary data center in North America and a secondary data center in Europe. We see our customers deploy as many as six or seven replication sites in various topologies. So we're keeping it simple for this demo, but it's a really powerful feature and included in your license. Again, we have a unique approach to pricing, and we don't charge for replicated data.

You only are charged for unique data. So you can really take advantage of that and get all the performance characteristics you need, get all the failover characteristics you need. The why behind XDR is really to provide that failover protection or you can build a backup system if you need the ability to look at an archive of your data. Content distribution is another one. I mentioned some of our customers have lots of sites that they're going to, and this is, like I said, a really common approach to having our data replication in place. Again, we highlight kind of the key metrics here. The real sort of indicator metric here is lag hosted in one of the cloud providers, and not seeing really any lag at the moment. Those undersea cables must be working pretty well, but plenty of throughput.

So again, these are clickable. You can drill in through to the dashboard, see what's going on with your XDR replication. It's worth pointing out this is just one of our jobs to be done. These are part of a curated set. So we have several general dashboards, just taking a look at alerts or namespace, but we also have these jobs to be done around specific deployment and features. We're looking at Cross-Datacenter Replication, multi-tenancy view, maybe you're performing an upgrade or rolling restart, we have dashboards for that as well, as well as troubleshooting latency and miscellaneous things for benchmarking as well. So we keep these tagged and organized and really give you that as a starting point.

Steve Tuohy:

So, Adam, I think it's your second tab that's the overview one, but you can navigate to say that multi-tenancy view via that map. Yeah, let's stay here for a second. So this is all out of the box, right? Day one, hour one, this is what it looks like. If my organization cares about write over one millisecond, right? You've got writes over four milliseconds, there's a count of four there right in the middle or the lower left of the screen. How easy... Oh, you're going to show me.

Adam Hevenor:

Let's do it. Yeah, so you can't read the prompt QL. That's fine. Easy enough to come in here, replace one with four, do the right thing and update our label here also and-

Steve Tuohy:

Oh, all right, so it just hopped to 172.

Adam Hevenor:

Hopped to 172. That's somewhat expected. We'll talk more. Well, we won't go into too much detail about how Aerospike can really take advantage of the hardware you're running on. I am running on kind of the smallest instances from the cloud provider for this demo. So we're not going to see the tremendous performance that you can if you are kind of optimized for your storage, but pretty easy to go in here, change. You should have said five milliseconds. But yeah, really simple and easy to do. Here, I'm going to discard that.

Steve Tuohy:

[inaudible 00:25:27] only green on your screen here.

Adam Hevenor:

Yeah, I mean, this is a demo, so of course we're praying to the demo gods. Oh no, something has gone wrong. Yeah, we like to showcase dashboards that are great for demos, but really the important thing with our curated observability offering is alerts. Demos are great to put up in your knock and let teams get familiar with the product. What's really going to trigger you to engage with the product is an alert. We have a whole set of alerts at various levels that we include with the offering. So I went ahead and did something offscreen. We don't really need to panic and triggered this namespace set quota alert so you get details about what is firing on this alert view. I went ahead and triggered something critical, but we have a variety of alerts and we're consistently adding to it.

So we have just info details. When a configuration is updated, for example, we set an alert that that's occurred. If you are failing what we call a best practice check, we throw a warning alert up. If you are experiencing something like client connection errors, we'll throw errors for those. All of these things are indications that something could be going wrong, but actually, we're not seeing anything functionally happen wrong yet. We'll take a look at this particular alert that I triggered here in a second. This is another safeguard we have in place for noisy neighbors. So we'll talk again about our multi-tenancy jobs we have.

So that's the different levels of alerts. As I said, everything here is clickable, so you can click into a particular node, and it may be the case you have a lot of alerts firing, so you can just join and look at that alerts on just that node. Or, in our case, we're going to drill into the namespace view in a little more detail. So again, drilling into these, you'll kind of see some common conventions. We saw this, it's called a bar gauge visual in our user quota. We have the same concept at what we call the set level. For those new to Aerospike, a set is similar to a table in a SQL database. In this case, I can see that one of the quotas has exceeded its configuration value.

Again, everything's clickable. So I can go ahead and drill into this particular set, see that the quota has been exceeded. Again, a set is a table. It's another common way that we see organizations organize their tenants. So I've included another cost metric here or a chargeback you might apply to that set. But this gives you another way to kind of zoom in on what's happening in your particular cluster, in this case, in what we call the set view.

So at this point, you're troubleshooting, you hop over, and you want to attach to your cluster. I'll go ahead and do that and log in with our ASADM tool. As I mentioned, your actual configuration is going to happen at the command line. This is kind of best practice for implementing infrastructure as code. We also like to try to make this experience consistent with what you're seeing in your observability experience. So when I logged in here, you can see right away that I got this warning. Your cluster is currently in stop-writes. What stop-writes are a fail-safe mechanism that prevents the database from using too much memory and encountering an out-of-memory exception or kill?

When you run a database, you're usually trying to use all the resources of that host, and that can mean you can run the risk of crashing, but the stop-writes functionality prevents that. When I log in here, I'll just go ahead and run this show stop-writes command. Sure enough, I can see this particular set is configured on a single node to be exceeding it. So these other ones are kind of running more a normal set of quota, and again, we try to help make this as intuitive as possible. This config value is something you can set dynamically right here from the command line, and we offer tab completion. So you can say manage config, namespace, set, test set, and then our stop-write size.

So I'm not going to spend too much time in the command line, but this is kind of our dynamic approach to configuration setting. We'll kind of take a look next. I mentioned infrastructure as code several times, so this might be what you do in an emergency situation. We'll also take a look at how you can do this in your IDE and run that through a validation check make sure you're not making any mistakes in your syntax.

Steve Tuohy:

Adam, let me make sure I've got the context here. So the dashboard, you showed us lots of information synthesized nicely, and there I forget if you showed us, you might see that you're in stop-writes and then... Yeah.

Adam Hevenor:

Yeah, kind of the details will be surfaced in your alert name here. So that's kind of the indicator. At this point, if you're new Aerospike, you may kind of find yourself asking some of these questions, and that's where logging into your ASADM portal, we kind of again put that information right in front of you. It's sometimes the case that you are going to see multiple errors stack up at once. So you may have more of these stop-writes by the time you log into your ASADM portal.

Steve Tuohy:

Got it. So you can configure the alert. So you find out about this proactively, you can see it on the dashboard, and then you're hopping into the command line to actually go one more step, more granularly, and then actually resolve it with the config.

Adam Hevenor:

Exactly, right.

Steve Tuohy:

Got it. Okay.

Adam Hevenor:

I'm going to hop back over to the slides, see if we missed anything. We talked about how to know what to monitor, our curated experience. That's where this all starts. As I mentioned, definitely use the metric reference as your companion guide as you're exploring this. We talked about multi-tenancy. So we have a blog post on the stop-writes topic about to publish, I think. Hopefully it'll be published by the time this webinar ends, or at least by the time the recording drops. So look for that on our blog. We have, as I mentioned, some of those techniques for implementing chargebacks if you need to allocate budget to specific teams.

We touched briefly on our Cross-Datacenter Replication. The why here is the failover capability, improved performance through localization. So if you have a content distribution network, you're operating at the edge, Aerospike is a great choice for that. Also, backup and archiving system, that is another common use case for XDR. There's more to the data replication story. We haven't dug into multi-site clustering with strong consistency, whole other approach and maybe another webinar. So kind of just scraping the surface there.

You saw me kind of briefly scroll through our All-Flash dashboard. This is a use case specific to All-Flash. One of the key differentiators for Aerospike is our hybrid memory architecture. What that really is is taking advantage of the hardware you're running on, whether that's some of the most expensive instances that you can get from your cloud provider, or at least they used to be until GPU shortage was a thing. So you can really take advantage of those high memory systems or you can also take advantage of commodity hardware. If you have older systems that you want to deploy at the edge, you can do that as well.

All right. This is the one more thing. So I'm going to hop over for a last demo then we'll get to Q and A. If you've used Aerospike before, you've probably encountered our aerospike.com format. You may be familiar with it, but you probably have to go look things up each time. It is similar to JSON but not quite JSON. It's technically a little bit more robust, we have some more features than JSON does. But it's supported really by our C libraries pretty specific to Aerospike. We have found that in Kubernetes and Docker we needed a YAML specification. So we have a conversion tool called asconfig. We have a schema validation set up. So you can load that into your IDE and you'll get auto completion.

So you were asking how do you set up quotas and this is how you do it. You can come in here and enable the security stanza, and just like our metric reference, we have a config reference as well. Once you have this in your IDE, you'll get that tab completion, you'll get validation as well. So really helps to make sure that you're putting in appropriate values, you understand what they are, you get the defaults, big time-saver for managing config. I think we're onto Q and A, so let's see if you've got any questions in the chat.

Steve Tuohy:

All right, yeah, good stuff Adam. So everyone please share your input in the Q and A panel. We had one or two come up that we addressed. Adam, on alerts, there's a comment here on alerts, so I'm just going to reinterpret it. Working with customers, what does that look like? I can appreciate alert fatigue and getting too many, not enough. Many Aerospike customers have this sort of running their business on Aerospike and can tolerate zero downtime and need fast alerts. So how do you see people balancing that for the sort of yellow alerts versus red?

Adam Hevenor:

Yeah, so pretty common approach to alerts is to integrate it with some of your other tooling. Slack and PagerDuty are really common tools for this. So you might have a Slack channel that's collecting all your noncritical alerts. You can go in there, see what's going on. For your critical alerts, a tool like PagerDuty or VictorOps gives you the ability to kind of escalate things as is appropriate. It will send a text message to the person on call, then to their boss and to their boss' boss. So the alert manager provides that as an integration. Then what we see is a lot of people put on additional services like these, you might hook it into a ticketing system like ServiceNow or Jira. Lots of flexibility with the alerts manager.

Steve Tuohy:

Awesome, thanks for that.

Adam Hevenor:

We have a question about our latency dashboard. Let's talk about the latency dashboard real quick. Yeah, so the way to kind of take a look at our latency dashboard is the color coding represents the percentage of, in this case, read that are in that bucket of latency. So looking at this, we have a couple of reads happening at this 16 millisecond latency, but really the red filled in latency items are 99 percentile latencies are happening in our, or 99% of the requests are happening at this low latency level, and kind of the same on the read side. I don't have a particularly interesting workload here. So we kind of end up with a stripe of most everything is fast. There's only an occasional slow read and slow writes.

If you are seeing a real tall spike here with a lot of red on it at the top, that indicates that you have a lot of slow read or write behavior at that timeframe and that is a cause for alarm. You will get an alert on that as well. The kind of details here, the P95, 95% of latencies are below one millisecond for write, and 99.9 are below three milliseconds. So that's kind of how to utilize the latency dashboard. Yeah, so the dashboards I presented today are the latest. There's a couple of customizations going on here with the chargebacks I showed. Those are not included and most people run their Grafana in light mode. We've added a couple of places for branding here. So it will look slightly different when you load things up, but the functionality will be the same.

Steve Tuohy:

Adam, I think you've said it and I know I'm going to say it at the end, but if I'm an existing Aerospike customer, this is all there for me today, right?

Adam Hevenor:

All there for you today. I'll load up our links slide. Yeah, you can get these on our downloads page and get started with it in a free trial, on your existing license, community edition. It works with all of them. So yeah, they're ready to go.

Steve Tuohy:

All right. Well, I think 45 minutes is a good amount and so let's close out. Adam brought me to this conclusion slide. Appreciate that. So thanks everyone for being here. Hope you enjoyed the mostly demo, a little bit of slides. Yeah, just to close it out, upper left. If you're not already using Aerospike, get started. Click the QR code and do the free trial, sandbox. Upper right, if you are a customer or once you get in, as we just said, this is all available to you today in our download section, including all the preset dashboards.

We have a very dynamic and broad community. Different ways to engage. Discord is one of those that we've newly established and a lot of new good interaction, questions and input there. I know Adam's been on there with a couple of observability conversations. Then lastly, Adam mentioned, I think one or maybe mentioned two blogs on this topic that have come out recently or about to come out around the, well, I've got it written here, the show stop-writes and YAML and overall here. Thanks everyone. Good interaction. Thank you, Adam, in particular, any final thoughts to leave people with?

Adam Hevenor:

Bring your fisheye lens.

Steve Tuohy:

Good one. There you go. The selfie stick for the Aerospike database. You heard it here first. All right folks, have a fantastic rest of your day. Take care.

About this webinar

Join us as we explore recent updates to Aerospike’s Observability and Management stack. In this webinar, you’ll learn how to:

  • Set up and monitor a multi-tenant, multi-cluster Aerospike deployment

  • Simplify setting up configurations, diagnosing stop-writes, and managing multi-tenant user quotas

  • Isolate failure states in complex deployments

Speakers

49853fa3668050d8494757a13108bf3f
Adam Hevenor
Director of Product Management, AI
Steve Tuohy website
Steve Tuohy
Director of Product Marketing