We are excited to be a part of AWS re:Invent 2024. Visit us at booth #1844 in Las Vegas.More info
WEBINARS

6 things to look for in a graph database

Video cover
You can view it at https://vimeo.com/942503478

George Demarest:

So roughly speaking, George, a graph database is basically a database that stores a graph in a native graph model and serves up queries and enables you to process and query graph data. What else would you add to that?

George Anadiotis:

Yes, a good thing to keep in mind here is that people often get confused between graph databases, making the distinction between graph databases and graph processing frameworks. So depending on your use case, you may need a graph database or you may as well do well with a graph processing framework. The difference basically is the following. So if you already have, and probably most people already have a database solution, and they just want to do some graph analytics on top of that, it's possible that they may be able to do that by basically offloading the data to a graph processing framework and just doing their analytics there.

That basically means that they won't be able to build new applications on top of that, all you are able to do with graph processing framework is do some analytics on top of existing data. That may be enough for you, that may be a good start to get your toes wet in the graph world, let's say. But if you want to do more, then you want to have a full-blown graph database, and that basically means not just being able to do reads, which is what you do with the graph processing framework, but the whole deal, create, read, update, delete everything that you need to support in order to build applications on top of graph data model.

George Demarest:

Sounds good. So for the agenda today, we promised six things we cover and here are the six. So data modeling and frameworks, what we've touched on. Use cases there are now really clear. Well, fairly clearly defined categories of use cases we'll go over. We'll talk about what developers need to know. We'll then break for questions that you might have on what we've covered. Then we'll talk about the impact of graph on IT. We'll talk about issues of performance and scale, which is a pretty popular topic with graph databases. We'll touch on native graph databases versus multi-model graph database platforms. Aerospike, who I work for, is a multi-model database that offers a graph data model and querying and others like Neo4j and TigerGraph, they are native graph databases. Which is better? Well, you can decide. At the very end, I'll give a very brief introduction to Aerospike Graph. This is not meant to be a Aerospike product pitch, it's more about talking about, as I mentioned, the evaluation and decision-making around graph platforms. And then we'll take some final questions.

So first let's talk about data model and framework. So George, overall, we've selected two of the more common ones, Labeled Property Graph and Resource Description Framework, RDF, can you please give us an idea of what the differences are and why you would choose one over another?

George Anadiotis:

Yeah, sure. So as opposed to let's say the relational data model when things are pretty straightforward, you have tables, you have views, you have [inaudible 00:04:01], pretty much every... Doesn't matter what vendor you choose, the data model stays the same. And graph things are a little bit different because there are two variants of how people choose to model graphs. I'll start with RDF because it was historically, let's say, the first one. So in RDF, and I need to emphasize here that RDF was originally conceived as a data model meant to facilitate one use case, which was publishing data on the web. So that means it comes with certain baggage, let's say. Both positive and negative. Let's start with the positive. The fact that...

A very positive one is that things that you model in RDF come with [inaudible 00:04:51]. That means that they're globally referenceable, which in turn makes it a very good choice if you want to facilitate data integration use cases, for example. So the key abstraction that you use in the RDF world is so-called triples. Triples correspond to subject, predicate, and an object which is very close to the syntactic structure of language actually. Again, that reflects the history of RDF. Here we have an example, like we have nodes that can represent subjects. We have edges that typically come as predicates and we have objects that can be other nodes or they can also be literals. For example, if you want to [inaudible 00:05:48] a fact such as George lives in Athens, for example, that George is your subject, lives is your predicate, and Athens is the object. And you can model your object as a literal or as another type of node. It's an interesting concept and I've briefly talked about the positive sides that it brings.

The negative sides of RDF is that it can be quite verbose, especially compared to the alternative, which is a labeled property graph. So in labeled property graphs we just do away with the concept of subject-predicate triple, object. So things are not triples. You have two key abstractions that you work with in the LPG world. You have nodes and you have edges and that's it. And both of these abstractions can have properties. So for example, if you want to model the same fact that I mentioned previously, George lives in Athens, what you would probably do is you would create two nodes, one node for George and one node for Athens, and you would connect them with another node called lives in. Or you could also just use one node George, and just give it a property lives in, and fill in the value Athens in that.

So that already hints that there are different ways to model things, but the main takeaway here is that depending on your use case, you may want to go with one or the other model, and that's a choice you need to make actually rather early in your journey.

George Demarest:

So at the very beginning of the presentation, I said this wasn't a graph 102, but a graph 101 concept I should mention is that relationships in a graph database are treated as first-class citizens where is if you want to implement graph in a relational model, there's a lot of extra programming and considerations that you have to make. So that's also unique to graph databases. So you mentioned frameworks and you've named three here starting with openCypher, that organization and the GQL has been in the news very recently. But tell me about these frameworks and their significance.

George Anadiotis:

Yeah, so we already talked a little bit about RDF. I'm not sure I would actually call it a framework, it's more like a modeling abstraction that comes with its own stack, let's say, of technologies. What definitely is a framework is TinkerPop. So TinkerPop, as you can see here in the slide that you're sharing right now, also has its own stack. So basically you can think of TinkerPop as a specification that comes with different layers and different implementations for that layer. And in the end it's basically like a graph processing and querying framework that can sit on top of any backend and just lets you do compute and querying on top of that backend using the LPG abstraction.

Now, openCypher on the other hand, and GQL you just mentioned, is more of a graph query language. To do the comparison, let's say with TinkerPop, it's not really a specification, it's not really a framework per se, but instead what happens is that different vendors may implement it in different ways. So with the specification part is the query language, and then from that point on, every vendor is free to do as they please basically to just to add quickly about GQL. GQL is a newly minted standard that tries to do bridge the gap between different vendor implementations in the labeled property graph world because openCypher again historically comes from one vendor. It has been open sourced in the process, but now lots of vendors came together to work on that and they have actually standardized it. So that's expected to bring a big boost in the graph data world.

George Demarest:

So the question is for developers or operators, does the underlying framework really affect what they do? Does someone need to know that they're using TinkerPop or whatnot, or is it really more about the query language?

George Anadiotis:

You don't really need to know the internals of its framework. I think the emphasis really is on the query language. So RDF comes with its own query language, which is called SPARQL. TinkerPop has its own query language, which is called Gremlin. And we already talked about openCypher and the new query language GQL. There are differences here obviously, there are also similarities, but in my mind I would call it the biggest differences is in Gremlin. So Gremlin, as opposed to both other query language, is a procedural query language. So that basically means that you have to write not just the query, but also you have to write for every query via specific way that you want it to be executed, like giving instructions to the compiler how to run your query. You can think of it that way. Whereas SPARQL and openCypher, it's declarative. There are both declarative query languages, so you just write the patterns that you want to [inaudible 00:11:56] and the compiler does the rest. Again, there are plus and cons in [inaudible 00:12:02] but it's something that you need to know to make it [inaudible 00:12:05].

George Demarest:

Very good. We're going to talk about that in a moment in terms of what developers need to know, but let's talk about graph use cases because graph databases have really not been around all that long and they are definitely progressing and evolving. So probably the most, well-known or most numerous are knowledge graphs that Neo4j really pioneered. But more lately there are transactional systems, OLTP graphs if you will, or analytic systems, more like an OLAP use case. And then finally, another area that is very interesting is the application of graph technologies with artificial intelligence and machine learning. So let's look at each of those individually with some use cases for each. So George, the three we've chosen are linking identities and recommendation systems, security threat analysis, and building knowledge bases. What can you say generally about knowledge graphs? What should people know about them?

George Anadiotis:

In a way, knowledge graphs are the oldest use case of them all because... We talked a little bit about the history of RDF and how it was the first technology really that was at least standardized in this space. There was graph processing before that, but the first standard was RDF. And so the initial use case was precisely building interconnected knowledge bases. To be even more specific, it was about publishing data on the web, which is like the interconnected knowledge base and the primary one let's say. And obviously what can be done at a web scale can also be done at smaller scale. And this is a use case [inaudible 00:14:10] knowledge graphs assign. And the main reason is that, well, with knowledge graphs you get also the notion of a schema. So you get to define semantics and how things relate to other things, and that's something that knowledge graphs are very good at.

Now, when it comes to... Sorry, go ahead.

George Demarest:

No, you go.

George Anadiotis:

Okay. I was just going to talk a little bit about the other use cases. So entity linking and recommendation systems. When you have basically concepts that are related to each other and you are able to know exactly how they're related, and this is what knowledge graphs enable you to do, you can define exactly how each concept is related to each other. Then obviously that means that you can build better recommendation systems because you have this a priori knowledge that you don't have if you come with a blank slate. And you can also link different types of concepts. And I think it's the same principle, let's say, that is behind the security threat analysis use case. So if you come with certain a priori knowledge that you can apply, that means that you can do better than just starting from scratch.

George Demarest:

Yeah, we give a life sciences example of a knowledge base. I actually have a friend who used to work at Lincoln Center in New York City and created a graph database of all their performances and all the orchestras and all the pieces they played and all the conductors and were able to provide an interactive interface into them. So that, as you mentioned, is the oldest. More recently, graph OLAP has become more popular. Can you describe what that means?

George Anadiotis:

Yes, sure. So as the size of data, the volume of data has been growing, people have been taking note of the fact that many times what really matters in those datasets is not so much the volume but the connection. So the value in the analysis many times comes not so much from being able to load infinite volumes, but actually finding what are the valuable connections in the data. So this is where graph databases and graph processing frameworks as well assign. So just to mention a few of the use cases you've lined up here, again, let's start with what's probably the most famous one. So PageRank, it's the algorithm that started, on which the Google empire is built. And besides being a famous one, it's also a very good example of what graph algorithms and graph analytics can do. So it's the whole idea of creating algorithms that find connections between your data points and leveraging those connections to infer basically. And you can apply that for pricing analysis or analysis of buyer behavior or... The list goes on and on basically.

George Demarest:

So the next is something that's close to our heart here at Aerospike, OLTP graphs, and that is graph databases that can handle punishing transactional environments where there's lots of reads and writes and updates. And in particular, we have a pretty strong heritage in adtech and they are facing the loss of deterministic data as in cookies from browsers and graph databases. And creating identity graphs are one of the ways that these companies are establishing identity, resolving identity, but also there are recommendation engines. And yet another case for fraud detection, but in a real-time manner where the amount of time you have to resolve a fraudulent transaction is in milliseconds rather than seconds. Anything else that you would say about OLTP and graph databases?

George Anadiotis:

I like to think about it by drawing an analogy. So in the relational database world you don't normally use the same [inaudible 00:19:02] in the same database to process your transactional applications and the same database to do your offline analytics. So the same pretty much applies in the graph world. So you're looking for different set of qualities basically, different set of requirements in your OLTP versus your OLAP solution. So obviously in the OLTP use case, you're looking for a low latency, transactional support, and all of those things that you would normally expect in such an environment.

George Demarest:

And finally, probably a subject that could take a webinar or 10 on its own is how graph will figure in AI and ML. It's a space that is evolving really quickly. Just did a little bit of research and applying AI to recommendation systems using graphs is one possibility, or actually adding some reasoning and language models to knowledge graphs. And then of course there's always fraud and cybersecurity applications to this. What are you seeing out there that's interesting, George?

George Anadiotis:

Yeah, you're right. Actually, this is an umbrella term that basically hosts, let's say, quite different use cases, not so much in terms of applications, but more in terms of technology and implementation. Just [inaudible 00:20:42] to emphasize here, so you mentioned knowledge graph reasoning. Again, this is in fact an old use case among the oldest ones that has seen renewed interest. So being able to do reasoning, inference using knowledge graphs and semantics and ontologies and that kind of stuff was one of the archetypical, let's say, use cases of RDF where the idea was that if you could build a very detailed schema, what is called an ontology in that world, and also specify rules for inference, then you could basically add new... You could infer new facts in your knowledge base. And so if you have a question like, I don't know, is George a person? For example. If you already had a fact in your knowledge base that well George is an employee and you have the rule that employees are persons, then you could infer that, George is a person.

So it's an old one that has seen renewed interest mostly because of the advent of large language models and the so-called retrieval-augmented generation [inaudible 00:21:58] that people are outbuilding to enhance them. Large language models are obviously a very different paradigm, so you don't have a priori knowledge in the same way that you have in knowledge graphs. You have like a big blob on which you train your model basically. And this is why in fact those two seem to mingle well together. So on the one hand you have the non-deterministic kind of application and answers that you can get from large models. On the other hand, you have what's a very strict and formal body of knowledge that comes with modeling your knowledge base as a knowledge graph. So if you're able to somehow combine these two, you can maybe have the best of both worlds, and that's a very interesting direction that things seem to be taking.

George Demarest:

Yes. There's also a lot of talk about vector databases and how they may intersect with graph databases. Is there yet a rule about where you go with which George?

George Anadiotis:

Obviously vector databases can easily be the topic for another webinar or 5 or 10, but let's just say that any other specialized let's say type of data modeling, we see a lot... What we see happening a lot these days is that for example, you see graph databases adding vector capabilities to their arsenal. So the idea there being, if you're already using vendor X, instead of going out there and adding yet another vendor that only does vector, maybe you can still keep using your existing vendor X, which by the way also gives you some vector capabilities. It's a topic in and of itself whether that's a good idea or not, and under which conditions and so on. But let's just say that it happens.

George Demarest:

Yes, and a little commercial message that Aerospike has announced, we will be releasing vector database capabilities, so I couldn't help myself, but let's move on. So we've talked a little bit about what developers need to know already in terms of query languages, but let's just briefly talk about it again. So Gremlin, part of the open source TinkerPop technology. What can you say about Gremlin?

George Anadiotis:

Gremlin has lots of things going for it. It's vendor neutral, so you can use it no matter what backend you have there. It's focused on property graph traversals. Gremlin is an interesting one precisely because it's not tied to any specific vendor, but it comes with its own stack. You can think of it in a way as a Java virtual machine for graphs. So in the same way that a virtual machine abstracts the intermediate layer and you don't need to know the specifics of the hardware that it's running on, in the same way TinkerPop abstracts the specifics of the backend vendor that it's running on. So in the end, you don't need to know or care about it, you just type in your Gremlin query and you're good to go.

Well, what may be hard for some people to swallow about Gremlin is the fact that, as mentioned previously, you need to basically be very specific about how your query runs. And depending on how you do that, you may even get differences in performance. So that basically means that you need to know the shape of your data well, and you need to even know the distribution of your data. So depending on where... Let's state it in the relational way. So depending on which join, let's say you put first, or in this case, which node you choose, the traverse first, you may actually get a difference in performance. So you need to be aware of that, then you need to take it into account when writing your queries.

George Demarest:

Okay. We've already touched on GQL published as an open standard this month, actually just a couple of weeks ago, based on Neo4j, Ciyher. How would you relate that to Gremlin?

George Anadiotis:

It has some similarities with Gremlin, so it's also vendor neutral, even though it's originated from Neo4j. Eventually it got open sourced and now it even got to the point where it's standardized. So vendors are already working on implementations. It's easier probably for most people to get started with because it was intentionally built to resemble SQL to the extent possible. So it's much easier typically for people to work with initially at least. And as opposed to Gremlin, you don't necessarily need to know that much level... You don't need to have that much level of detailed knowledge about your data to make it work.

George Demarest:

Okay. And finally, it sounds to me like RDF is a bit more specialized. Who is going to be choosing RDF and what are the reasons why they would?

George Anadiotis:

RDF and SPARQL, its standardized query language, also has some things going for it. To begin with, up until very, very recently, like a couple of weeks ago, it was the only real standard in the graph world. And again, interoperability is a huge plus obviously. So you can switch vendors and just keep your queries running without any changes, and that's a huge win. And obviously the people in the LPG world realized that that was something missing in their stack. So this is why they came together and defined the GQL standard. Besides that, there are also some similarities between GQL or Cypher and SPARQL. So in some ways they are also... They're both modeled after SQL, so they try to keep the syntax as much similar as possible. Obviously this is not always 100% the case because there are different data models after all.

Another thing to know which is not worthy about SPARQL is that, in fact, it's not just a query language, it's also a protocol. SPARQL uses HTTP, the HTTP protocol under the hood. So that basically means that you get data integration for free. So if you have different nodes and they both... Or they can be two or three or more, doesn't matter how many, what you basically get is federated queries for free because all you need to know is the address, the IP address of the node you want to address, and you can just type your query and the query will automatically run over the wire remotely.

George Demarest:

Got it. All right. So the other thing that developers will come in touch with are IDEs and visualizations. For Gremlin, you have the Gremlin Console, which is basically a command line interfaced into Gremlin, but then you have a commercial tool called G.V. Aerospike is in a partnership with G.V and is available for our platform. And then you have each other vendor has their own consoles, the Neo4j browser. You see there's the Amazon Neptune visualizer, I think it's called. And TigerGraph and Arango and a number of others. Are there any other commercial IDEs or visualization tools that are especially common, George?

George Anadiotis:

There are, but I think it's important to make a distinction here. You have both IDEs and visualization tools, and these are basically aimed at different audiences. So IDEs may also be visual and the examples that you mentioned here actually showcase that because if you're working with graphs, having a visual interface even for writing code can be very helpful. However, these tools are basically aimed at developers, so people who write code. However, not everyone who wants to visualize graph datasets is a developer. So you have also a different set of tools that are aimed at analysts. So people who want to explore graph datasets, not necessarily by typing in queries, but just by ways of visual exploration. So you can think of it as BI for graph basically. So you have both of these families let's say, family of tools.

George Demarest:

Okay, we're going to talk a little bit about that in a moment. At this point, I want to see are there any questions that have come in? One second, I have relatives. What are the relative size and changing emphasis of OLTP versus OLAP versus knowledge graph use cases? Is there one that's growing faster than the other at this point, George, or is it the ocean rising for all boats?

George Anadiotis:

That's actually very hard to say because at least to the best of my knowledge, I don't think there's any data on this, any hard data. All we have at least, or just to speak for myself, I only have a sense of just what's going on in the market by talking to people and seeing and hearing about implementation. Based on that, I would say that knowledge graphs seem to be... Let's put it that way, knowledge graphs at least have lots of [inaudible 00:33:03] at this point, precisely because of what we already briefly touched upon, that they seem to be a very good tool to work in tandem with large language models, which is what everyone is trying to capitalize on at the moment because when you marry the strengths of both, you get systems that are quite robust now.

That said, I think the other two use cases, so graph OLTP and OLAP, I think they're both on the rise. What may be a little bit behind in comparison, not for any other reason, but just because of the fact that it's a use case that came to the fore later is graph AI. So that's a recently new development and one that people are not very familiar with. And just to quickly add to that, we run a poll. So one of the many hats I wear is I organize an event which is around graphs and graph database and all of that stuff. So we recently ran a poll and we asked our audience, which is called Connected Data World, and so we asked our audience to give us their opinion on what use cases they're mostly interested in and what their backgrounds are and how proficient they are. And by far, the one that people were less familiar with was graph AI, but at the same time it was one that people were very, very much interested in learning more about. So it shows that there's lots of potential.

George Demarest:

I have another question that has come in. It's about use cases. Can you suggest some use cases where it's preferable to use a Labeled Property Graph versus RDF? As usual, it's probably it depends, but is there a clear indicator when you should go for one or the other, George?

George Anadiotis:

Yes, I would say that if you have a use case in which inference, so being able to generate new knowledge based on your domain, based on your domain knowledge, if that's something that you have used for, then you should be looking at knowledge graphs and specifically the RDF flavor. The same goes for use cases that are basically data integration. So if you have a scenario in which you have different nodes, different databases, different datasets, and you want to somehow unify them all, not necessarily by... There's always the option of the so-called data lake scenario. So you basically do lots of ETL and you dump everything in one common repository and they're done. Of course, again, there's plus and cons in that scenario. If you want to do that, fine, you don't need the knowledge graph, you don't need the graph database to do that, you just need the data lake, go and do it. But if you don't want to do that and you want to have a federated data integration scenario, then you should be definitely looking at knowledge graphs because this is where they shine.

If on the other hand you are more interested in OLTP or OLAP analytics, then maybe you should start at least by looking at the LPG solutions because of the fact basically that due to their model they are less verbose. So that means it comes down to having less... If we were talking about relational, you could say that you have less rows and less columns in your database, so that makes it easier to scan. We don't have rows and columns in the graph world, but well, we have nodes and edges. So if you have less nodes and less edges, again, easier to scan.

George Demarest:

So there's another implementation I should mention, George. Some people elect to do graph use cases without a graph database. We actually... Aerospike has some big customers, I think PayPal and Adobe that have built graph solutions using Aerospike as the data store, but before we offered graph database, so they really did it on their own. I imagine that's a pretty involved and difficult challenge to take on.

George Anadiotis:

Yes, in many ways. First of all, because if your backend doesn't come with a graph API or graph processing framework, or if it's not a graph database, then it means that you need to add more components in your stack. So you have overheads in that respect then. Obviously you need to learn how it works, you need to integrate it and you need to familiarize yourself with it, and you will always have the ETL and data transfer issue. So you will basically have to move your data from wherever it is to wherever it is that you're taking them to do your analysis on. So yes, it's possible. In some cases it may make sense, but I think above a certain scale, let's say maybe it stops making that much sense.

George Demarest:

All right, let's continue with our list of six. The next one is the impact of graph on IT operations and just wanted to talk about the personas. Graph databases, like any database has an operational component, has a developer component, and nowadays, there's applicability to data scientists. So the roles that we see for operations, obviously at the top is the CIO, but DBAs and I imagine graph DBAs are a specialized breed right now, but also IT infrastructure, cloud infrastructure, DevOps, and then you have the developers themselves, it's a CTO, and you have graph developers, graph architects, and then finally data scientist with a CDO and data engineers and data scientists. So wanted to go through each one of those individually with you, George.

So first, if we look at operations, the operators have... They define the scalability and extensibility expectations for the organization. They need to deal with the expected data volume and growth rate. They have to decide on where to deploy it, in the cloud, on prem? Are they going to use virtual machines or containers? And then any specific runtime requirements, do they have latency requirements for transactions? Do they have a lot of throughput? Are there are a lot of reads and writes coming in and out? Do they have uptime SLAs? Are there anything that are unique to graph database organizations or teams that would be new to an IT organization?

George Anadiotis:

I'm not sure I would call it exactly new, but let's say that there's a variance on something that I think most relational let's say DBAs at least would be familiar with. So in the graph world, what you need to be aware of, which is similar to the relational world obviously, is that, again, above a certain scale you need to have multiple machines basically. And in the graph world, what's unique but also similar to the relational about it is that you need to somehow split your data according to what makes sense. The idea there is that you want to keep data that somehow belong together on the same node.

And there's lots of strategies about how to do that and that can be quite involved, let's say. And this is the reason why we see at least some vendors trying to provide some kind of abstraction, let's say, over that in the same way that relational vendors try to make life easier for DBAs but not necessarily exposing them to all the nitty-gritty of data distribution and splitting tables and all that. Some graph vendors have been trying to do the same. So the idea there is that you want to keep nodes that are somehow semantically related, close, and on the same machine so that when you do graph [inaudible 00:41:55], you don't necessarily have to go to another machine, which obviously incurs cost and latency, all of that.

George Demarest:

So we're going to talk about performance and scale. It's an evergreen topic within IT, but also in graph in particular. But let's just go to the next one. For developer, the considerations are they are the ones that define the problems that need to be solved and how they'll be implemented as use cases. They're the ones probably that determine then the type of workload they're dealing with. Is a transactional analytical knowledge graph or is there an AI ML component? They're the ones that have to determine the expected data volume and growth rate, and that will affect how they architect their applications. I imagine they probably work with the data science teams to define data model or schema per use case, and then integration with other systems. Those are the most obvious developer considerations. Any that are, once again, unique to graph, George?

George Anadiotis:

I think what's unique about this, the development let's say experience on graph, is the fact that people need to come to terms with modeling things as a graph. So most people are used to thinking in terms of tables and joins, and that doesn't always necessarily translate well in the graph world because in graphs you don't really have joins. That's the whole idea, that's the whole reason why people go to graph databases. So you can directly do multi-hop queries, and that's actually a defining characteristic. So what would be prohibitively expensive to run as a SQL query, in terms of even just writing the query, but also running the query, in graph it can be orders of magnitude faster. So one thing to keep in mind there is that the modeling is different, the queries are different, and you have to start fresh, in a way. You have to learn how to model your domain as a graph.

George Demarest:

Yeah, one of the graph cliches I've learned is that once you start dealing with graphs, you see graphs everywhere. So I imagine that's probably the same for developers. Let's finally talk about data scientists. They are helping to define the use cases and the data models and schemas. They're the ones that are answering business pain points or at least technical business pain points. They're the ones that look after the expected benefits from the solution, whether it's reducing costs, whether it's protecting revenue, generating revenue, is there a business performance gains to be made? And they're also probably somewhat determining the transactional or analytical nature of the use cases. Is the data science component pretty universal with graph databases or not necessarily, George?

George Anadiotis:

I would say that what makes... Some people actually even refer to it to graph data science, and I think it makes sense. In my mind. What makes data scientists that work with graph, what makes their other job specialists is the fact that they get to deal with lots of graph algorithms. So we already briefly spoke about PageRank, but there are tons really. There's centrality algorithms, there are pathfinding algorithms. I think there are even hundreds of them. And to make things a little bit easier for data scientists, there are also graph algorithm libraries and many solutions, either graph processing frameworks or graph databases, actually come bundled with pre-implemented graph algorithms precisely to make the job of data scientists easier.

That said, it's great to have an out-of-the-box implementation, but it's not going to be much use if you don't know what it does and how to use it and where it's applicable. So long story short, working as a data scientist, working with graph algorithms gives you a whole new range of capabilities, but also a whole new range of responsibility because you have to learn what they are, what do, and what makes sense in which scenario.

George Demarest:

Perfect. So we're a little bit running out of time, and this is... The next one is a big topic. So rather than go into depth about it, we're just going to talk at a high level that there are really unique challenges with the graph data model and distributed computing and making graphs high-performance, realtime, et cetera. So you have the issue of scaling the data itself. Is it going to be in the terabytes? Is it growing, is it shrinking? You have to scale the compute as well, the number of people, the number of queries coming in. Can you scale storage and compute independently? Because if you go to the cloud for instance, you don't want to have either your cost for data or for compute tied to each other so you can right-size your budget for your needs.

The deployment architecture also affects scalability and performance. Are you using an in-memory approach? Is it a single instance system which clearly limits your ultimate scale but might give you to a certain data size, great performance, or do you need to go to a distributed model? Aerospike and Tiger, both distributed by design. Also, once again, are you virtualizing, are you containerizing your systems? That of course affects performance and scale. Also, just the nature of graph queries, dimensionality. If your graph is very bushy or if it's very tall, if there's lots of edges per node and you're doing multi-hop queries, let's take that in particular because it's so unique to graph. All right, so can you talk about multi-hop queries and the challenge that they represent to graph databases?

George Anadiotis:

Yeah, sure. In a sense this is what graph databases [inaudible 00:49:13], but it's also hard to handle as the more hops basically, the harder it is. And again, to draw the equivalent from the relational world, you can think of it as adding more joins and it may not be as hard or as complex to even express in the graph world as it is in the relational world. But still the fact remains that the more hops, the more computational workload you add to your backend and the more data locality becomes relevant and potentially a problem. So if all the nodes that you're going to traverse in order to run your query are on the same node, then it's fine, it's predictable let's say. If you start spanning nodes, then things start getting interesting, let's say, which is precisely why data localization is a big issue and why there are lots of strategies on how to deal with that.

George Demarest:

Yes, and I imagine as the number and type of graph use cases continues to grow, that's going to provide continued challenges for... Especially in distributed environments. There's also performance techniques that the vendors use, including Aerospikes, so indexing of data, something called index-free adjacency, which we'll talk about in a moment. And then as you mentioned, data locality, all effect performance. So as I said, this is a topic that could also be an hour-long webinar, so we're going to leave it there. The last subject is in the topic of native graph databases versus multi-model, what are the differences and does one have advantage over the other?

So for native, like a Neo4j or a Tiger, they have an optimized storage layout. They're designed for one purpose and they work for maximum query efficiency. They have this idea of index-free adjacency for fast traversals. There is a memory penalty you have to pay for that, but like I said, to a certain size of data, that may be a worthwhile thing to invest in. For multi-model of course, the general benefit of multi-model is their versatility, the ability to support multiple use cases using multiple data models, and then you get the operational and cost benefits of a single vendor, a single product, a single management interface, single documentation and all that stuff. I wanted to point out that there's a good blog on this topic from our director of product management, Ishaan Biswas, called Demystifying native vs. multi-model graph database myths. So check that out.

George, it seems like both types of graph databases are making it work. Are there inherent benefits that you would point out of one over the other?

George Anadiotis:

I think in a way, this conversation could be very similar regardless if whether we're talking about multi-model versus graph or multi-model versus vector or multi-model versus document or what have you. So the benefits of having a specialized database for your specific data model is that obviously it's optimized for that and its performance is probably going to be a little bit better, or you may be able to do more in terms of specialized use case. On the other hand, the benefit of having a multi-model database is that you can serve more use cases in general because you're not only limited by one data model and you have less operational complexity and costs. So I don't think there's a clear-cut answer, "This is what you should do, this is what everyone should do." In all cases, it's always... It depends, basically.

George Demarest:

All right, we're running to the end of our hour. I have a couple of questions come in, so I'm going to end pretty quickly now. Oops. Why... Oh, that's why.

Okay, so going to finish with just a really brief introduction of Aerospike Graph. We come from the massive scale space where we do payment systems and ad tech ad auctions at an incredible scale. So that is what Aerospike brings to the table, the ability to support billions of vertices, trillions of edges. We scale compute and storage independently. We are definitely targeting that real-time OLTP performance use cases at least initially, we'll be adding a bunch of OLAP capabilities with our next release coming later this year. We have the Gremlin query language. We have bulk loading, both as a batch or also streaming through Spark. And it is.... We are the multi-model database. So that's that. The architecture is such that as you see, the graph service in the middle is separate from the database. It's just an Aerospike database, although it is a natively rendered graph data model within the Aerospike database. And then the graph service is based upon Apache TinkerPop, it's available in the cloud or on-prem and you can virtualize.

So I don't want to spend too much time, but that's what we do in the graph space. And I want to get to the final questions. The first, George is... Carrie has asked about insights or strategies about migrating, he says from Neo4j, but if you could give some helpful hints, not just... Let's just take migrating in general. How portable is graph database? For instance, if you wanted to move your graph data from Neptune or Neo4j to or Aerospike or whatever, how easy is that or how hard is that I guess I should say? It's usually hard.

George Anadiotis:

Yes. It's not always straightforward. Obviously, it's much easier if you're migrating between vendors that support the same graph data model. So one RDF vendor to another RDF vendor. It's actually pretty easy. It's one of the key benefits of going with RDF because the data format is standardized, you can export in RDF, import in RDF, boom, you're done. And the query language is standardized. You can keep your queries running pretty much exactly the same unless of course you're using proprietary extensions with obviously some vendors who implement, but. So those are not portable, but the vanilla let's say SPARQL is 100% portable. So integration is pretty seamless.

Integration... Sorry, porting. I meant migration. Migration among different labeled property graph vendors is less straightforward, but still possible. You would still have to find some intermediate data format that both your vendor of origin and your vendor of destination support. It could be RDF, some LPG vendors also support that. It could be anything as long there's common ground. It could be like CSV. There's also other graph formats that are somehow supported by vendors. But either way, you need to port your data and then you also need to port your queries.

Now, porting your queries was probably the weakest spot up until recently, and we hope that the introduction of the GQL standard will solve that. But again, if you're migrating, let's say between one vendor that supports the TinkerPop framework to another one that also supports it, it's fine, you won't have much issues. Your queries will keep running. If you are migrating from openCypher to openCypher, same. If you're migrating specifically because Carrie was asking about migrating from Neo4j, I presume she meant migrating to something like Aerospike for example.

Now, one good thing about the TinkerPop framework is that it's in fact multi-language. So in the whole concept of having this virtual machine. On TinkerPop, you can also in fact run SPARQL queries and you can also run Cypher queries. However, as opposed to native Gremlin queries, they will be interpreted. So they will first be translated to Gremlin and then executed. So you add a little bit of complexity and [inaudible 00:58:38] there.

George Demarest:

Okay. It doesn't sound super easy, but it does sound possible. The next question is from anonymous attendee. Can a graph database serve as a primary OLTP database for a social network? I wonder if an activity stream query, like find the people I follow and return the most recent posts sorted descending as page one of posts, it seems like it would have to read all posts and those users to be able to order them by date descending. I guess the high level question is, could a graph database function as an OLTP database in a social network? I imagine that would depend on the scale, but what do you think, George?

George Anadiotis:

The quick answer would be yes. That's the sort of use case that graph database is a very good fit for. Of course the devil in details and as you said, it also depends on scale. So if you have a solution that can only run in memory and your memory is limited or it can only run on one node and your node starts getting out of resources, then it may only get you so far. So yes, in theory, it's very much possible and it's actually a very good idea. In practice, you also need to take scale into consideration and choose wisely.

George Demarest:

Wonderful. All right, that really brings us to the end of our presentation. I just want to leave some final thoughts. Aerospike is in this space in a big way. We've made a serious investment in graph databases, the graph data model, and Gremlin. And George, you've been very helpful to us in figuring it all out. So thank you for that. So in order to engage with us, you can view a demo, you can go to our product page for Aerospike Graph. There's the product doc, there are Gremlin tutorials. There's a great online Gremlin tutorial for people who want to learn that. You can also have a 60-day free trial of Aerospike Graph, so you can try it out now. So we appreciate your time and this has been very enlightening, George. Thank you very much for your time and thank you all for attending. Any final thoughts? You have George to the graph world.

George Anadiotis:

Thank you for having me first of all, and yes, it's been fun and I enjoyed working with you, not just for the webinar, but also being there and watching you through your graph journey, let's say. I hope you keep going along that journey and hope that people who joined us today had a good start, let's say. And hopefully it's been helpful and we'll be seeing more people join the graph world.

George Demarest:

It's safe to say it's a fascinating topic. Just doing the research for this webinar has really been an eye-opener and your perspectives are very much appreciated. With that, we're going to end here. I want to thank everyone for coming. Oops, one second. We have one more question. Oh, it's, "Excellent webinar." Thank you. Thank you, [inaudible 01:02:00] that. That'll do it for today. Have a good day everybody, and thanks for joining.

George Anadiotis:

Thank you.

About this webinar

Graph databases have emerged as a powerful tool for managing interconnected data, but not all are created equal. Join us for an insightful webinar as we delve into the critical factors to consider when evaluating graph solutions.

  • Data modeling and frameworks

  • Operational (OLTP) vs. analytical (OLAP) use cases

  • Performance and scale

  • What developers need to know

  • Impact to IT operations

  • Native vs multi-model platforms

Speakers

george-demarest-600x600-1
George Demarest
Director of Product Marketing
george-anadiotis
George Anadiotis
Founder at LinkedDataOrchestration