Starburst, Aerospike, and the future of SQL Analytics
George Demarest:
Good morning, good afternoon, good evening, wherever you are. This is George Demarest from Aerospike, and welcome to this BrightTALK Webinar. This is the Starburst, Aerospike and the future of SQL Analytics webinar, and it features presenters from Aerospike and Starburst. I'll start by introducing Rick DeMare from Starburst. Rick has been in the big data space from before it even existed really, was that thought spot for a number of years, and is the owner of go-to-market and business development for OEM Technologies from Starburst. Good morning, Rick.
Richard DeMare:
Morning, how are you?
George Demarest:
Thanks, and I'm fine. And Matt Bushell. Matt is a veteran of Aerospike and head of product and solutions marketing at Aerospike. And will present the Aerospike story today and we have an exciting announcement that just hit the wires. That is the announcement of Aerospike SQL powered by Starburst, and it's pretty exciting, a new product that is the result of a partnership and work from the two companies. And this presentation will cover the marketplace, the new product, and what each company brings to the table. So with that, I'm going to start the slides here and we're going to start our webinar and I'm going to hand this off to Rick and Rick, why don't you take it away.
Richard DeMare:
Outstanding. Well thank you. Thanks everybody for being here today. Again, my name is Rick DeMare and I'm here to take you through how Starburst and our SQL engine will work independently as well as with our friends here at Aerospike. But prior to doing that, I just want to step us back in time just a little bit because we've been through some times here the past couple of years, and these experiences have changed the expectation of the users of technology.
So if you remember back to the early days, pre COVID, 2019, only about a third of us worked from home but by March of 2020, let's face it, everybody worked from home, crowded malls and restaurants, streets, they were all empty. What we really saw in the economy was a world war type of response to COVID, and this response affected everything and it fostered adoption, mass adoption of new technologies as a result.
Everything literally went online from all communications and entertainment and customer service, what we sold, what we bought, how we interacted with folks, everything went online forcing people to understand and appreciate what these new technologies can actually bring to the marketplace. And it fundamentally changed how the economy operated and these changes are no longer temporary. They will last forever because the user's expectations of how they access data and information have completely changed. And in this COVID economy, it's the ability to adapt to this change that we fundamentally believe our clients can use as a strategic advantage.
So let's talk about the world of data as it stands today. In the data world today, we are faced with these COVID expectations to get information to end users as quickly as possible with a fundamental problem. That fundamental problem I refer to as the four Rs.
How do we get the right data to the right user at the right time, at the right cost, because all these have an impact on user experience and your bottom line? Use cases for analytics in particular today can span multiple data sources. So we've already got to deal with a lot of sources of data to start off with. You add to the fact that data in and of itself is doubling every 18 to 24 months.
And the fact that we're going through this great cloud migration where analytic workloads, 90% of them are moving to the cloud, and you add in the factor that 90% of those workloads that are moving to the cloud are going multi-cloud and you've got this jigsaw puzzle of a problem you need to solve. The fact is that data is literally everywhere in the world today. It's on-prem, it's hybrid, it's in the cloud, it's in multiple regions in the cloud, it's in multiple clouds.
And on-prem, you still have multiple instances of data. You have different ways to store data. You've got these streaming engines on the edge like Aerospike, but you also have enterprise data warehouses on prem, you've got cloud data warehouses, you've got lakes, you've got lake houses, you've got NoSQL storage, you've got operational data, it's coming at you at different volumes, velocities, concurrencies, types, and you got to secure it and govern it and deal with egress costs. It's a mouthful. I said it really fast, because it would take a long time otherwise. It's a lot of stuff that needs to be dealt with in today's data world. And that really sets up the problem statement and the issue that I want to address right off the top. And that is we have to address the elephant in the room when it comes to analytics. For 20 years plus, as George mentioned, I've been doing this a while, right?
For 20 years plus, people have tried to take data and consolidate it in a central data warehouse to perform analytics. It was hard to do 20 years ago. In today's post COVID, hopefully post COVID world, it's literally delusional to think that this can maintain and go the same way. The average enterprise today has multiple analytic systems, as I outlined in the last slide. According to some recent studies, major enterprises have seven to 10 major warehouses or lakes that they support today.
How does that impact what we do? Well, we have these folks called data engineers and their job is to get the data to the analytical systems. Now, unfortunately, 70% of the data engineer's time is wasted, copying, pasting, moving, ETLing, ELTing data to a centralized location, which is delusional. It didn't really ever exist. These processes are manual. They take too long, they're too complex.
In many cases you have systems that are NoSQL stores, so they lack the ability to actually have a SQL interface, which is what analysts want to attach to those systems. So what do you do in those scenarios? Well, you need to have a developer write code to pull the data off and move it someplace, and then you end up with multiple copies of data which impact security. The impact of all this data movement is not only wasted time by your data engineers, but it's the lack of ability to really do analysis on your entire dataset because these centralized data stores, these EDWs, CDWs as they're called, they're really just a narrow view of all the data. At best, it's a small fraction of what your users want. At worst, it's strangling your end user's ability to interact with data and take advantage of the opportunities that could be presented to them.
I will argue that we need to change our focus when it comes to analytics. We need to change it from how fast does that one engine go, right? So I write a query, it goes fast, to how long did it take me to put all that data into that engine prior to me getting the answer? We call that time to insight. How long does it take to get to insight on these systems? And this is what it looks like. We can have a very fast data warehouse, for example, but if I ask a business question and it takes me days, weeks, or months to get the data I need to answer the question into the warehouse, then what's the value to the answer I get? Do I even want to go through the process to get to the answer in the first place? Do I remember why I was asking the question by the time I get the answer? These are the issues we face today.
So I'll ask you couple questions. What if the data engineers could reduce complexity, reduce the movement of data to get more reliable, more accurate information quickly and ensure data security and governance? What if the analysts didn't have to wait for data access? What if we reduce their dependency on data engineers to access data? What if they could seamlessly join data from multiple sources and use their own BI and AI tools to access the data directly in a self-serve manner? And what if the business could focus on increasing profitability and growth, because they have speed to insight, they could answer questions very quickly and easily.
They could use their resources more efficiently from an analytics perspective and they could de-risk their business decisions. That's really what we do at Starburst. We are focused on getting people access to information quickly. How do we do it? Well, we're transitioning from this single source of truth to something different.
So these data warehouses, these centralized homes for analytics, like I said, they've been around for 20 years and just because the cloud has evolved doesn't mean the cloud really made it a whole lot better from a technology perspective. What the cloud has really done from a technology perspective is they've proven one main point and that one main point is that you can separate storage from compute.
Now what do I mean by that? Well, in a cloud data warehouse, for example, our friends at Snowflake, what do they do? Right? Well, in the old days, in the Teradata days on-prem, you put storage and compute in one big box and then your infrastructure teams had the battle how much data, how fast the queries are going to go by the configurating, this big piece of hardware. In the cloud, you can separate storage from compute, so you can scale the compute infrastructure and the storage infrastructure separately.
Why? Because the network is fast enough to shuffle the data between those two infrastructures. So in our world here at Starburst, what we ask is, "Hey, if you can separate storage from compute and the network is fast enough to shuffle the data between those two sources or two systems I should say, then why do we care where the data's actually stored?" Think about that. If I can separate storage from compute and the network is fast enough to shuffle the data, why can't I store data anywhere?
That's what we do. We are a compute unit and we can connect to any data source that you like. And what that means, let me see here, is that we can provide you with a single point of access to your data. So instead of having a database connection connect to a single database, we can have our connection connect to many sources of data.
It's a one to many prospect and it looks like this. So from a technical perspective, like I said, we're a query processing engine and we have two main aspects of this engine. We have the query processing engine, which is an ANSI SQL, massively parallelized in-memory system that processes SQL and we have connectors.
On top of that, we've layered enterprise grade security and governance that can do data masking, encryption, query auditing, fine grain course controls, like row and column level security, but as I said, we can connect to many sources of data. So relational databases, Snowflake, Teradata, things of that nature, data lakes, S3, Azure, things of that nature. We've got streaming engines and of course we've got NoSQL stores like MongoDB and of course just like our friends at Aerospike. So Aerospike did not have a SQL engine until we started working with them, right?
So we are now the SQL engine for Aerospike instances. So you can access Aerospike with your SQL tool of choice because Starburst is now embedded with that technology. And not only can you access a single Aerospike instance, but we can federate across multiple Aerospike instances like so. Our product Starburst itself can federate from multiple Aerospike instances to a relational database to a data lake, to a separate streaming engine or a different streaming engine. So we can federate one query across all of these and we don't have to move data to make that work. Not only can we attach to all these data sources, but we can connect to them anywhere. So I mentioned earlier that data is literally everywhere. Our product can connect to sources that are on-prem, run in a hybrid mode, run in a cloud run in different regions within those clouds, run on different clouds or a combination of all those things.
And we can do this in a manner that solves things like GDPR, compliance, data sovereignty, and also address data egress costs because those cloud companies will nick you in the backside if you start pulling data from them. We're doing this all literally without moving data. Where do we sit? Where do you sit? Well, you sit up top, you can use any popular analytical tool, AI tool, machine learning tool, connect it to Starburst and then leverage your Aerospike data directly or leverage Starburst to access all these data sources from any infrastructure that you like. So that's what we do here at Starburst. I'm going to hand it over to Matt now who's going to bring it all together here from the Aerospike side.
Matt Bushell:
All right, thanks so much Rick. One of the things you said that I really liked was talking about, in a way separating compute and storage, that you can get in a way the best of both worlds. And the reason I like that is if you look at the Aerospike architecture and how we play and how we're deployed and how we access and how we ingest data, Aerospike has high performance storage engines. Our users today are very aware of our very highly patented hybrid memory architecture and our memory storage options between being fully in memory using persistent, again hybrid or even all flash. And so for our customers, we give them these options based on their use cases. And so we view our database's abilities as really world-class and best in class. And so if you were to take a step back, all the things, Rick that you were saying before about the market and our ideally post COVID world, there are statistics out there that show yes, data growth is exploding, but if you look at the underlying data types that are growing the most real-time data is accelerating and it's the use.
And that's one of the things that really helps drive the adoption of Aerospike. And it's because they're again, an explosion of real-time data sources and the need to ingest that data. And that's the beauty of Aerospike as a database and as a data platform to really ingest that data in real time. And so we have these multiple data models today. So a lot of people know us as Key Value, Aerospike 6 really kind of brought to the fore our document capabilities. We have customers that use us for graph and time series use cases today. And now as George had mentioned at the outset, now we have the SQL model part and parcel with our database. And so if you want to access SQL, the best way you can do that today with Aerospike is Aerospike SQL powered by Starburst. And again, there are multiple ways to deploy multiple ways and use cases that Aerospike can be used and you can access Aerospike in many ways. So again, to level set, I just want to show our data platform here.
If you were to look at the combination of really the best of both worlds of Aerospike and Starburst together, there's a reason Aerospike selected Starburst and is underneath Starburst is Trino and Trino, formerly known as PrestoSQL, has the most committers in the four walls of Starburst, about 80% probably Rick is my understanding. And so it was really a no-brainer to partner with Starburst. And so you get the best of both worlds. You get with the product Aerospike SQL, you get enterprise create support, one location source to go to. So a lot of the benefits of the Aerospike data platform, very predictable performance even at very sizable scales, which is important because you want to analyze, you want to use SQL to pull in access and analyze large amounts of data. Aerospike 6, highly paralyze secondary indexes really enables very powerful SQL queries because now you can get very granular and very well structured queries to get meaningful data out.
And I think that's part of what you want, you want to democratize your data. And with Aerospike, we say that the most strategic real-time assets are in Aerospike, and now you can unleash them. You can really unleash them with Aerospike SQL powered by Starburst. And so the beauty again is really this participation. I think Rick, you guys talk a lot about being a key in the data mesh. And so now Aerospike can participate in the broader Starburst data mesh and we ourselves, as you saw in the Aerospike architecture, are deployed in many different manners from the edge. If I were to go back to the prior side from the Edge also to a system of record, be it a feature store, there are multiple deployments of Aerospike and along now is SQL.
So if we were to look at a little bit how Aerospike SQL powered by Starburst works, we would say that one of the key differentiators, one of the beauties of partnering with Starburst and vice versa Aerospike for Starburst is that we are both very highly parallelized systems. And this is not a point that I think people are fully aware of, at least on the Aerospike side. I won't speak for the Starburst side, but the fact that we now can, Aerospike can deliver massively parallel SQL analytics on petabytes of data. And so at a high level architecturally we match up very well. The Starburst Trino workers themselves are highly parallelized and they can access multiple clusters of aerospike. Again, we are a distributed database data platform. Aerospike has up to 4,096 partitions per namespace is how we define a database, and they can also be accessed in a highly parallelized fashion.
And so you can now start to appreciate what kind of speed and performance a query run with Aerospike SQL power by Starburst can execute with. So again, you're democratizing access to Aerospike data with tools that people use commonly today. And really if you think about it, every company, Rick is a data company, whether they realize it or not, it's the data that matters the most. And so as Aerospike is increasing its footprint in our customers, so too, do you want to leverage and use that data for more and more participants in the organization. And so that's we think one of the benefits of having this product, this joint product go to market together. And so for us offering the market is this single point of installation, configuration, operation and support. And you can do things like federate queries across Aerospike clusters.
So if we were to look a little bit deeper at some of the key capabilities, yes, you can run ANSI SQL queries now against Aerospike data. You can federate queries across... I'm sorry, were you going to say something? Okay. If you wanted to federate queries across Aerospike clusters, again, Aerospike runs a very geo distributed, you can do so. And again, just as Rick was saying, you don't need to move the data. You can create dashboards against Aerospike data. You can run these very highly performance SQL queries, not only based on parallelism, you can do predicate push downs in using our secondary indexes. So again, you can have very, very powerful queries can now be levied against Aerospike data. You can also leverage the cost-based optimizations. So whether you structure the joins, how you want to optimize the data being read in by each of the clusters.
If you think again about the Aerospike data platform that I showed a couple slides ago, we too also can be really deployed anywhere just like Starburst, whether it's on-prem, on the cloud, or perhaps even more importantly multi-cloud. At the same time we have enterprise grade security and again, the single point of support.
So with that, I'd like to run about five minutes of a demonstration. I will warn you it's a little quiet being a prerecorded demo, but let's take a listen a little bit. What we'll show you is really people running SQL against Aerospike. It's SQL and this will largely be the basics, but again, we'd be happy to show you at a later time more depth. So bear with me as I queue up the video.
Richard DeMare:
While Matt's queuing that up. As he said, it's a SQL demo, so you're going to see basic capabilities here within the system. It's really what these capabilities then enable that make it so powerful.
Matt Bushell:
All right.
Speaker 4:
This particular session is a brief demo of using the Starbust Enterprise against NoSQL database like Aerospike as such. Now as you can see over here, what has happened right now is we have used something like query data, which is part of the Starbust enterprise offering, which in turn helps you to expose the connectivity, the sets that is available on Aerospike tables and it'll help you to interact with the Aerospike database, which is a NoSQL via a SQL construct as such. Now from a setup standpoint, you'll be able to see that I have a Aerospike connector, which has been created over here, which is nothing but my catalog. And under the catalog you have the namespace, which is the test under which I have kept my data in Aerospike. And under that you see the set names available over here. So a couple of set names they're for your references, availability, bank demo, bank transactions, demo and Superstore.
So these are the sets which contains the respective bins, that is the columns as well as their associated data as such. Now without much further ado, what we'll do is we'll start off breaking this demo into two different sessions. One is simple set of queries as well as the complex. So when you say a simple set of queries, what are we talking about? Just like running a select start from and identifying the top 10 records from a given table.
Now what I've done is I've used demo, bank demo and expose all the data that's available here, which contains bins like age, balance, campaign, contact, day, and default. Now let's assume that predominantly want to create something like a case statement that group these columns. So this particular option can be used wherein I've done what I've done over here is I've used the ID, the age and based on the age I have given a range of zero to 25, 26 to 50, 51 to 75 and grouped them under specific groups like group one, group two, group three, and group four.
And that is what you're seeing over here. So let's assume the ID 58 and 58 belongs to group three. So this helps you to do any sort of case functions or a sort of statement as well. In addition to it, most of the ad hoc requirement is all about running the aggregation. So for example, let's assume that you want to aggregate the balance associated with the entire bank demo data that I have right now. So what I'm doing is I'm just running a very quick sum against the balance column within bank demo as a data. And look at this, you have already got the data already shown up. Now the next set of aggregations is like the min max average functions and all that stuff. So even that works seamlessly with a select query as like this wherein we have said select min of age, max of age from the bank demo set that is available on the Aerospike databases.
So what's happening right now with the select statement that you're running right now is converted into API call, which in turn is communicated to the Aerospike cluster in the backend by our Aerospike connectors and virtue of which again, the handshake will cause the data to be sent back to the Starburst. And that's why you're able to see the data as the minimum age is 18 and the maximum age is 95 as such.
Now in addition to it, you also want to see the number of records that you have on the particular table as well as the group by conditions or order by conditions even that is supported. So these examples that I'm showing you are just review or examples. There are many, many more such SQL constructs that can be run using the Starburst Enterprise additions.
Matt Bushell:
Okay. So I think Rick will stop there, but I think folks can get a good sense of the capabilities at a high level there and it's kind of what you would expect, right? It's fairly straightforward. Okay, so in terms of what this means for you, there are many use cases. I don't need to tell you of SQL and Starburst, but there are many additional use cases that Aerospike is excited about because Aerospike SQL powered by Starburst can really unleash items for different user groups in your organization. So if you are using Aerospike today, you can go to these communities and say, "Hey analysts, now you can run ad hoc SQL queries. You don't need to do a new programming model, you don't need to bother." It's in a way it's kind of good for you if you're running Aerospike that within your organization now you can just say, "Hey, go levy SQL against it," and that's the beauty of it. So create dashboards.
A lot of our customers run a lot of machine learning. And the idea now that we can run so quickly against such big data sets for data scientists that can really help them do EDA and test their models quite a bit more rapidly. I'll get a little bit more, but think about audit and compliance. Those queries may not happen that regularly and so you don't need a new programming model to do that. You don't need to copy your data to a warehouse or worry about accessing it saying, "Oh, that was in my real time Edge system that Aerospike runs." Now you can levy that using Aerospike SQL. And so we really think, again, we're all data companies at some level I mean we're in the data business, if you will, and so now this is again democratizing data even more.
So let's show you a very high level use case of one of Aerospike's customers that uses Aerospike SQL powered by Starburst. And so this global brokerage really is undergoing really a broad modernization effort. And so they have back office systems running on mainframes and they have real-time applications at the edge engaging with customers, making real-time risk calculations, making 360 recommendations. And so Aerospike sits in the middle there and also communicates with the rest of their backend systems because Aerospike serves as the intraday system of record.
And so the use case that they are running is this exact audit and compliance system and they said, "Hey, we are already engaged with Aerospike." And they really were looking for a couple things. So they were looking for a single vendor for this solution. They were looking for enterprise support and they were looking for some performance aspects in terms of data concurrency. They're global, so they needed to be able to have, in this case, five users access data, to return data sets on the order of five, excuse me, 12 gigabytes and to return within a certain time horizon. So these are things that Aerospike SQL was able to execute on. I'm not saying this is the performance limits, but these are just an upfront requirements. And so from a data discovery, and again from an audit and compliance standpoint in a global brokerage, this is just one deployment model, one deployment methodology.
And so if we were to sum up a lot of the things about why Aerospike SQL powered by Starburst, I think Rick would join me in all of these things, it's about enabling existing tools to now query and Aerospike again for fastest time to insight. Again, being massively parallel, it is also about federating queries. Again, you don't need to move data, you can levy them across Aerospike clusters. Again, querying data in place for all the reasons Rick had cited earlier. It is just not if you want to participate in a data mesh, you don't want to spend time ETLing and moving, creating copies of data.
I view this as being very elegant in being able to access your data with SQL and obviously reducing complexity. So there's a lot of benefits to doing this. If you think about what it takes to... Aerospike prides itself on being low, total cost of ownership, low server footprints compared to other systems that also reduces complexity, particularly when you can query data in place. So I think with that, George, this kind of wraps up our formal part of the presentation, so I'll hand it back over to you.
George Demarest:
Very good, thank you Matt and thank you Rick. We do have some questions that have come in, so I'd like to start with one. This is a question really about how Starburst affects the Aerospike database performance. The question is, "We're curious about the footprint of Starburst on the actual Aerospike database. If you run a query using Starburst, does it basically kick off a scan against Aerospike under the hood in a production environment with tight SLA expectations? Would running a Starburst query have a detrimental effect on Aerospike key value performance?" [Inaudible 00:35:14].
Richard DeMare:
Who do you want to take that? Matt, you want to take it? I can take it.
Matt Bushell:
Well, yeah, I mean, I'll take it real quickly. I mean, you don't have to scan the entire data set. That's the beauty of secondary indexes and being highly parallelized is you no longer need that approach. Oops, I don't know why our slides are kicking back off. Scans are very expensive time wise, and so the short answer is no. So I don't know, Rick, how you would answer that.
Richard DeMare:
No, I think that's probably the simplest way to answer the question. The answer is no. You shouldn't have a situation where it's going to infect... The last thing you want to do in any analytics engagement is affect the production availability of your primary data source. So that would be something we would personally caution against and have mechanisms to get around.
Matt Bushell:
Yeah. In terms of secondary indexes, this is part of Aerospike 6 that was launched not even two months ago, and that's kind of part and parcel with how Aerospike levies today and it just makes us a world-class query engine and very efficient, so that's... Yeah, I mean from a query engine perspective, I think the timing of Aerospike SQL is beautiful because it follows very closely on the heels of this. George, were you going to say something else?
George Demarest:
I was, and actually I did put up that slide-
Matt Bushell:
There is is now, yeah.
George Demarest:
And that is that the Aerospike SQL powered by Starburst doesn't sit in the Aerospike database, right? It sits in the Starburst cluster. So we deal with Starburst coordinator and worker nodes and we have some code in there, our connect product for Presto, Trino, but it doesn't live in the Aerospike database. The database is separate. So yes, really no deleterious effect to your key value. That actually brings me to a very similar question for Rick, and that is, "How does Starburst differ or relate to the open source projects, Presto or Trino?"
Richard DeMare:
Yep, that's a good question. So as you saw here that Aerospike has a connector for Trino, we are based off of Trino. That's an open source project that's been around 12 years or so out of Facebook. There's a long story associated to it, but let's just say some of the founders of that open source code, open source project formed Presto, which then changed names to Trino for complex legal reasons. But all those key contributors are founders at Starburst. So we support and are the main contributor to the open source pipeline, open source project for Trino.
What Starburst does is then add security, these massively parallelized connectors that you see on top of these systems, enhanced capabilities. For example, we just purchased a company called Virata for data lake indexing and storage to speed up data lake query capabilities. We add all these additional features as well as enterprise support to an open source project. And if anybody on the line has used open source in the past, you understand that while they can be very powerful, you got to be an engineer to know how to actually manipulate those systems.
So we have simplified the process, put a nice UI on it, security parallelism, and a whole bunch of enhanced features that grows all the time.
George Demarest:
So by any measure, Rick, Starburst is red hot these days. Why is it taking off now, I guess?
Richard DeMare:
Yeah, well look, we've said it a couple of times now. I've been in this space a long time and I've done a couple of data startups in my lifetime and there was always something that slowed the progress of a startup, right? The example I'll give is many, many years ago I was part of the founding go-to-market team at ThoughtSpot, a search-based BI tool. And what we underestimated with that product was the resistance we would get from the analyst community who already thought they were doing query and analysis. They weren't getting that we were trying to do this for more people with search, and it took us a while to get over that hump. I thought there would be a similar hump here at Starburst because what we're really saying is we're addressing the elephant in the room. It's like the warehouse is not the central source for analytics anymore.
So all that data modeling, all that stuff you've been doing on the backend to make that work in an efficient manner that you've been fighting and fighting and fighting to try to keep up for forever, let's just throw it to the side per se. And let's now go to this new principle called data mesh. So the reason Starburst is so hot is that we've hit all this data, we've had this COVID switch, people are like, "This can't work anymore," so they're looking at the data mesh principles and we are a company that can very quickly and easily enable data mesh.
So these things are happening at the same time, and as soon as they tend to try our product, our NPS scores are super high because we actually deliver upon our promises. So it's like this perfect storm of stuff that's coming together all at once. So it's been a pretty fun ride so far.
George Demarest:
So I'm going to send a question Matt's way, it's related, and that is Matt, we've actually had technology that integrates with Presto, Trino in our Connect product line, but this is different. Can you contrast what Aerospike Connect for Presto Trino is versus Aerospike SQL powered Starburst?
Matt Bushell:
Well, I think we touched on it before with that one slide where it's kind of a single install experience, single license purchase, single support experience as well. So a lot of that has been taken away. I think the beauty is now you have optionality. So if you are a Starburst customer today, you buy the Connect product and you're off and running. If you're an Aerospike customer today, you can add Aerospike SQL on top as an additional product purchase. So the experience, you saw some of it running natively within Starburst, which is great, and this is an incremental product to say the least, but that support I think makes it incredibly viable for-
Richard DeMare:
Yeah, and I'd add onto that, Matt. One of the main pieces of this is you have the ability to serve yourself with your own BI tool. One of the things Starburst does is we certify and verify BI, AI, ML connectors to our infrastructure. So you can use your tool of choice to connect to what we do and use it in the most optimum way while leveraging the security on top of it. These are things that don't come out of the box in the open source world.
Matt Bushell:
Yeah, the difference what you were talking about between just open source and Starburst. Absolutely. Yeah.
George Demarest:
I have a follow-up question to that original question about the impact of Starburst on the Aerospike database. So the follow-up question is, "If an index doesn't exist," or I guess I assume, I mean that hasn't been created, "And analysts users run the query on this data, what would happen?" So it seems pretty straightforward, if they don't have an index, you're not going to get the benefit of having an index, which is performance. And with our secondary indexes, we're seeing 10X, 12X performance gains with secondary indexes, and this is one of some the magic that Aerospike brings to Starburst environments. We have this real-time engine, this real-time data, and now you can use the Starburst SQL engine to query it using the tools you like. Anything you'd like to add? Any gents?
Matt Bushell:
I mean, I would just say we have throttling in the product, so you're not going to return crazy amounts of data back to prevent this, I don't want to say abuse, but very suboptimal applications thereof. So I mean, it's a relatively straightforward answer again, but it certainly is possible.
Richard DeMare:
To me, it's just game changing, the ability to go in and access streaming data and analyze that in real time to near real time is in many applications, risk-based applications, real-time business decision application, IOT applications, it is a game changer. You want to know and or have the ability to programmatically act upon issues when they occur. The delays even in hours, minutes, things of that nature can be impactful in many applications in today's world, and this addresses that.
George Demarest:
Okay. I got a question about customers that are using this technology, "Rick, Matt, what are the data sizes and what are the customers doing, joint customers doing with the data volumes and what are their experiences with these products?"
Matt Bushell:
Yeah. Well, the one example that I gave in the presentation for the global brokerage, the initial rollout is on tens and tens and tens of terabytes right now. The idea is to expose it to petabytes and petabytes of data. The product is brand new. We do have customers today that are levying the Aerospike data platform across their entire data sphere, if you will, within their four walls. And so again, the ability to access this should give customers prospects a lot of peace of mind to be like, "Oh, of course, I'd want to use a highly parallelized storage engine, if you will, like Aerospike with a highly parallelized compute engine for SQL like Starburst."
Richard DeMare:
Yeah, I would add to that, that while the products in their integration together are new, one of the things that made this worthwhile for both organizations is that we know for a fact Aerospike is dealing with petabytes and petabytes and petabytes of data. Well, that's something we do on a regular basis. Just this morning, AWS and EMEA was doing a presentation on data mesh, and they called up our implementation at FINRA where we're literally looking at 25 petabytes of data across 25 plus sources with an update of a hundred billion rows of new data per day. We're analyzing that for that organization and have been for some time. So these are both systems that can massively scale and do that in a manner that's super efficient. So it's made for the modern data workload.
George Demarest:
So when you get to data in that scale, this is, I'm paraphrasing, another question has come in being able to do work on those kind of data volumes, how do you do it without requiring a lot of data movement, which of course kills everything. And also does this have some GDPR implications, Rick?
Richard DeMare:
Yeah. So how do we do it without moving data is that we query it at its location and or we query it in our query engine. Remember we're separating storage and compute. In certain instances, in certain database systems, we'll actually use the database system and its query power. In the case of Snowflake, we'll use their query engine, push that query down, and then just bring back the results set. In others, we're just processing the data. We're the query processing engine. So when we're going against Aerospike, again, they're the data store, we're the processing engine, we're shuffling data across the network, and that is what's actually doing the processing. So the data is not physically leaving any location, we're just physically processing that information to get the answer that we require. What was the second piece of that question? Sorry, George.
George Demarest:
Oh, just if there's GDPR-
Richard DeMare:
Oh, yeah, there can be GDPR implications and data sovereignty implications and egress implications. So when you're in a multi-cloud scenario, what happens physically is Starburst will have its own cluster, as we've talked about before, its own compute cluster, but in the case of a multi-cloud scenario where GDPR could come up and you can't move data from point A to point B, we would physically be in a position to set up a separate cluster of Starburst in that other instance they may have.
So you have data in AWS, you have data in Azure, you have data, you might have separate clusters of Starburst sitting there. It's a feature we call Starburst. Those would independently through one SQL query, do their portion of the work, process the data locally, and then the only transfer that would come across to the end result would be the end result. So you minimize egress cost, you minimize GDPR requirements because you're descending the results. You are not actually sending the data. Hopefully that made sense. I only play technical on TV.
George Demarest:
Yeah, sounds good. So we have one more question I think we'll take, and that is about multi-cloud. So multi-cloud has been-
Matt Bushell:
Oh, can I just interject again just to build a little bit on what Rick said, Aerospike being very distributed. We have our cross data center replication and also our active active solutions with, we call it our multi-site clustering. It's kind of geo or metro clustering at extreme scale if only because the speed of which you can do active active across big distances is significant. And the benefits to keeping your data local with our XDR approach is exactly for the reasons that we're excited and the question prompted, you don't need to move data, you can keep it in its locality. With security, you can make sure egress is managed appropriately. And so with our predicate push downs and our secondary indexes, you can really kind of again, hyper target again to minimize egress.
So we've done that right before and now with Starburst, it's again a perfect overlay of capabilities. So apologize, I should have jumped in-
George Demarest:
That's okay. So that you both perfectly teed up that last question and that is that we've been talking about multi-cloud for a number of years, first aspirationally and theoretically, second as a practical problem to solve. It sounds like multi-cloud is, especially with this type of solution, is more and more becoming within reach of real customers. Is that your sense of it, Rick?
Richard DeMare:
Yeah, I mean this is the Stargate feature I mentioned a moment ago to me is the single most exciting feature in the Starburst portfolio because it is a problem that just about every modern data customer today is facing traditional enterprises to data natives, to small mom and pop shops. There's this, nobody wants to get stuck in one cloud environment. Many of these clients still have significant on-prem environments. And then there are regulatory rules, egress costs, as we mentioned, GDPR issues that need to be addressed. The fact that you could use a combination of Starburst or Starburst and Aerospike together and be able to address those issues confidently and cost affordably, addressing those four Rs, as I mentioned earlier, the right data to the right person, to the right time at the right cost, that's game changing stuff. And one of the main reasons we are growing at the rate we are growing.
Matt Bushell:
Yeah, I mean Aerospike has customers that run multi-cloud today, they're global multinationals that acquire different divisions, and so they're faced with that all of the time. They also like Aerospike for the flexibility because they don't want to be locked in from a data perspective. We've also seen customers move one way or another, either on-prem into cloud, or actually even the other way around, depending. So I think the beauty is you can be very flexible with Aerospike and so again, a good match with capabilities with Stargate.
Richard DeMare:
Agreed.
George Demarest:
Very good. Well, this has been a great presentation. I think we'll end it there. To finish off, this has been a BrightTALK Webinar with Rick DeMare and Matt Bushell from Starburst and Aerospike respectively. Starburst Aerospike and the future of SQL Analytics. Gentlemen, thank you very much for joining us, joining me, and thanks everyone for attending. With that, we'll end this presentation. Thanks very much.
Richard DeMare:
Thank you.
Matt Bushell:
Thank you.
About this webinar
Starburst and Aerospike provide a glimpse of the future of SQL analytics with a joint solution that provides massively parallel SQL queries against real-time datasets in the Aerospike database. Starburst Enterprise is the hottest SQL engine for a host of data/analytics pipelines and use cases. Aerospike Database is the most sophisticated real-time NoSQL database on the market, providing predictable sub-millisecond data operations at gigabyte-to-petabyte scale, all at an affordable cost.
Join Aerospike Head of Product and Solutions Marketing Matt Bushell and Starburst Director of Embedded/OEM Richard DeMare as they discuss:
– The future of SQL analytics – The challenge of “analytics anywhere” with real-time data – An overview and demo of the joint solution – The expanded partnership between the two industry leaders