Webinar - May 13th: How Criteo powers real-time decisions with a reduced footprintRegister now
WEBINARS

Data Modeling for Low Latency Querying

Video cover

About this webinar

Join this session with PhonePe to learn how to use data modeling strategically for low latency queries in the Aerospike database. We’ll go into detail on the phases of data modeling and how to model your data at scale for high performance with Aerospike.

Speaker

Chaitanya Reddy
Software Engineer, PhonePe

Speaker 1:

Hello. Good evening, everybody. Good evening, everybody again. I welcome you all to this Aerospike developer meetup that's happening. We've got some amazing sessions lined up. I would like to formally, again, thank PhonePe to let us organize it in this venue where they were even open to non-PhonePe people coming and attending the developer's meetup. So thank you for that. So to start off with, we have our first speaker from PhonePe, which is Chaitanya Reddy. He's going to speak about data modeling for low-latency querying. And after that, we have got two more exciting sessions. After that. So hold on to it. And if you have any Q&A post the session, you can have one or two questions in the interest of time, and then towards the end we have got some dedicated slot to ask questions around it. And then as the session is going on, there are some refreshments on the side, so you can probably go and pick them up post the session or in between silently. So, Chaitanya, over to you. Thank you.

Chaitanya Reddy:

Hey, guys. So I think I'm Chaitanya, and I work as a software engineer in insurance part in PhonePe. So we'll quickly get to the presentation. So today we will talk about data modeling with AeroSpike. I think everybody in here would've at least worked with one stateful software application, which would have needed a database to it. And when you talk about database, I think the first thing that you end up doing is modeling your data so that you can fit in the schemas, you can fit in your queries and everything, right? So I think that is the most important aspect of any usage of a data store. And hence, I think that's basically why we are talking about the data modeling with AeroSpike in here today. Let me quickly get started.

So the agenda is pretty plain today. We'll be talking about very basics of what data modeling is and we'll touch about how we have to do the data modeling in AeroSpike. We'll talk about some use cases, some examples, and also we'll touch how we had to actually do the bin modeling within the AeroSpike with CDT examples and SA examples. So yeah, with data modeling I think why do we need to do a data modeling or why is it such an important aspect of your software application? I think it's basically the foundation for your persistence where all your storage, retrieval and representation actually all happens within your data modeling. So that's basically one of the foundation stones which you use to maintain your data grow and evolve from there. So it is also one of the important things that you need for scaling up your databases because the more cleaner your data exists, the better you can scale because you know how and where you have to actually fill in the blocks when the evaluation is happening along with it.

The other thing is basically the performance I think everybody would need. When you have a data store, I think the major use cases, you need queries in almost single digit latencies or sometime even less than single digit Latencies. And how you would actually do it depends completely on how we are storing the data and how we are performing your queries. So that's where the whole modeling comes into the picture when you want it to have a very great performance and that also implies your data processing. For example, let's say you have a batch processing where you have to go through tons and tons of data and you have to read some specific key within some big long JSON object. I think that's basically where you'll end up into taking too much of time. But if you have modeled well your database on the queries that you would need, I think you would be able to perform a better operation in doing any kind of analysis or doing any type of retrieval jobs.

So yeah, let's talk about what exactly data modeling is. Data modeling is nothing rocket science where you say, okay, I'll have to do something where it actually represents something or show something. It's basically all the data that you have, you are able to do a visual representation of it saying, okay, this whole structure of data you have, I would want to put it in a table with these kinds of columns or maybe probably say I would want to put it into some kind of blob structures or something where I would be able to query on top of it. That's basically the visual representation that you give to your data. And how you do that is basically by giving it a Logical structure, like okay, this set of data is called as a, let's say, if you talk about the basic examples, like a student's table, these things are called as rule numbers.

These are called names, these are called phone numbers. So that's basically a logically separating your data, understanding the information and the business behind it. And the last thing that you end up doing for a data modeling is basically having a proper schema so that every new input information that you wanted to do will have a proper schema on what has to go in where and how it has to be treated, what data type it has to be treated on, how you have to even use that kind of data. Very quickly we'll touch about how we start a data modeling. So there are multiple ways of looking how we have to, one has to do a data modeling. But if you look at how exactly you do in the realtime phase, maybe you'll end up, the first thing, called as aggregation of your data, which we are calling it as a conceptual model.

So first you have to bring the big picture into the table saying, okay, out of all the business use cases I have, these are all the data inputs that I have, these are all the data points, these are all my business use cases, these are all my rules that I would want to apply on top of my data. That's basically capturing your complete information of what type of data you would want to. That's basically the conceptual modeling where you don't care about how you want to structure, you don't care about what is underlying data store, rather your only worry is basically to have complete detailed and segregated data of what you wanted to operate on. The second comes is a logical separation or logical model where you end up saying, okay, out of all this data, I identified these are the data points, these are the keys of this thing.

And I'll basically say, I'll give a logical structure. Like for example, if you have a group of data where you have profile information or say you have some kind of transactional information or some kind of workflows information or something in with you, all you have to do is first logically separate and say these are the data points belonging to one group, these data points belong to other group and so on and so forth. That's basically a logical separation or logical modeling on top of your data store. Next, once you're done with this thing, the final step that you end up is basically building a physical model. Physical model essentially means you work with multiple different data stores. For example, you work with Aerospike, MySQL, Maria or something else. So every data store comes up with their own definitions. Every data store comes up with their own nuances and nomenclatures on how you have to actually use the data store.

So representing your logical data on top of the specific data store that you work is basically a physical model on top for your data. So basically these are high level on how you actually look at a data modeling. Basically start with a conceptual, then go for a logical and then build a physical layer on top of all of these things. Again, how exactly you do it, I think there are tons of other approaches. People use hierarchical models or maybe entity relationship models or even a hybrid models on how you actually structure the data. I think that's basically the secondary thing once you have decided on how you are actually doing these kinds of things. So basically now that we know what is data modeling and how we have to do it and how important it is, let's talk about how we have to do it on Aerospike.

And for that maybe we just do a very quick recap on what are Aerospike data structures and how they are placed so that we can understand within Aerospike how we have to actually do it. So I think everybody who worked with Aerospike would have known what this picture is. So I'll just quickly touch upon what is a nomenclature we have within Aerospike. So for example, if you see the big picture, which you see as a AS node, that's called as of one single node, which can have multiple namespaces to it, which is, again, logical separation of database of your node or cluster. And then on top of, within namespace you'll have multiple sets. And within sets you have multiple records. And for each record you'll have multiple bins and records are identified by keys. That's basically a complete view on how your... That's completely all the data structure that Aerospike offers for storage of your data.

So let's talk about in detail in the further slides, but I'll just quickly touch upon what else each of these things has. So for example, namespace is, again, in isolation, set is also a logical isolation. And for the record, you basically have multiple bins within it and it is also called as an object, maybe if we can refer as an object in further slides. There are defined meta bins along with the record that you actually store which are necessary for your CIS, which is basically create only once. Or if you want to know when it was last updated, you'll have last updated time, gen times and other meta bins, which it used to manage your data and which it will use to basically do your querying as well at some point in time.

And last comes is basically bins. Bins can be defined as multiple data types. You can have, say integers, strings, GeoJSONs. I think with newer things, we also have list maps and other things as well as part of your bins. And the last that you have is basically an identifier, which is called as a key. AeroSpike key is basically a tuple of your namespace set and your unique identifier that you use in your application to basically identify your record. So that's basically a tuple and that's basically a key for AeroSpike that we hold.

So yeah, I think now that we know this structure, I think a very well-known database is an RDBMS. Let's try to relate to an RDBMS and see how well we can understand the data structure so that it is easy for us to talk in the further slides.

So if you see, a namespace is a database where you basically can have multiple tables, and a set is a table, and a record is a row, and a bin is a column. So when I say we can have multiple bins, it's as simple as saying I can have multiple columns in a table. When I say I can have multiple Namespaces, that basically means in a DB server I'm able to hold multiple databases the same way I'm able to hold multiple namespaces to it. But there are very different variations on how independent databases work.

I do not want to correlate them, but it's just a visual representation so that we understand how exactly these data structures of AeroSpike are relatable to a RDBMS. And the other thing is if we can call a RDBMS as a tabular structure, then AeroSpike is necessarily a row structure because all you're doing is you have one single row and you have an identifier that for that one single row. That's basically what you're ending up doing in an AeroSpike. And again, how you use a primary index in an AeroSpike is very similar to how you have a primary key in your RDBMS where you end up creating, there is an index key and you wanted to talk what index key is. And the same way you basically end up saying this is how my AeroSpike key primary index is also, which that's basically how these two are related.

And similarly, you can have multiple number of columns. AeroSpike can also have multiple different types of columns how you can define a vachar. Here also you can define a string, you can define an integer, you can define big int and other things on the RDBMS. It's the same relation that you can see that I've given in the list here.

So maybe I'll just do a very quick read I think people would be able to understand easily. So I think this is just a simple record example where we talked about a tuple. So the first thing you see is test, which is called as a namespace. And then user rankings are basically what is set. And then the last one is a unique identifier, which is my user ID in here, and I can have any type of bin. So you can see three tuples, everything is a different record and what you see in the end is basically an integer or a map or list of multiple combinations of this particular record as well. So that's basically a record example. So now we know what AeroSpike is, now we know what data modeling is. Now we have to know how can we do it at a scale with a very good performance in AeroSpike. So that's basically what we are going to talk about in the next further slides.

So yeah, before even jumping into the modeling, I think there are some basic necessity things, that we have to first capture the information and you basically have to have it ready so that before taking any call or before making a choice of what to use, we need to understand why the choice is also being made. So there are a bunch of ideas and a bunch of things that I would want to get list down before even starting with the data modeling. The first thing, I think is a very obvious one where I would want to first list on all my read and write query patterns. For example, let's say if you have a student database, the first you would want to know is do you want to query by student name or do you want to query by your student ID or do you want to query by the student's school rule number, something in that sort.

Those are the read patterns. Do you update by a particular student ID or do you update by, say, a name independently or together? All these are your write patterns. So you basically have to first identify what are your read patterns and write patterns in order to understand what are the choices that you want to make for these things. And the other thing is, which is very important, is making a choice of what kind of system you want it to have. Do you need a CP system or a AP system in terms of CAP theorem? So I think it's basically a trade-off how we want it to make up. Do you want a highly consistent system or do you want a highly available system? That also defines how your setup is going to look like, how your choices are going to look like when you're doing the modeling.

Along with these things, you also have to understand what are data you would want to store. What are the traffic patterns? For example, when I say traffic patterns, it's basically usage of how, at what point, and what kind of request can come. I point and say, I would have a burst of requests which always read these kinds of data, but the other data which might be important but is very rarely used, or maybe I can be okay with managing it with higher latencies, you would want to choose a different modeling for those kinds of cases. And hence, you also need to understand your traffic patterns and data along with it. So the last part is also understanding how your data would also grow, how your modeling is going to grow and how your evaluation of the data and the business requirements comes into the picture. Based on that, you can see the extensibility of what you have to do and how you want to extend your use cases to the further places.

So before jumping into the final modeling structures, the major thing that I would want to touch about is some basics of AeroSpike architecture so that we understand how the data that you're going to store or how the data that you're going to persist within AeroSpike is getting segregated within the cluster so that you know how Optimally you can query them or how optimally you can write back to them whenever you need it. So important thing that I will cover, three types of major architecture from the AeroSpike. One is primary index, secondary index and set index. So in case of primary index, it's basically the place where it always fetches your key from, or it is basically the place where you actually... It identifies where your data exists, the data that you're trying to query from. So it's basically a mixed data structure of a hash under distributed trees.

So if we look at how our data is particularly fetched, every namespace has a list of partitions and it is very strict to 4096, which AeroSpike maintains 4096 for every namespace that you make. And all the keys that you generate, basically we generate a digest out of a key that you generate. The keys, what we're talking about is a tuple. Again, a namespace set under key, right? The tuple that we're talking about is basically what we are saying as a key. So using the key, AeroSpike generates a digest using RIPEMD-160 hashing, which generates a 64-bit key. So from this, the first 24 bytes is basically what the digest is and using those 24-bit bytes, it basically finds what is a partition, where it exists. From the partition, it basically figures out which node it belongs to. Once it goes to the node, there is a hash table of your partition ID and also a tree where your data actually exists.

Now, from the partition ID, the tree that you see is basically a red-black tree which actually has a login operation for all the use cases, which is search, insert and delete, and it is also called a sprigs. When you work with AeroSpike, I think the word that you call sprig is basically a red-black tree which has the partitions, which is mapped by the partition. So that's basically a primary index of AeroSpike. Then you basically... And all the keys that you do here, it is used for all the lookups, scan operations, batch operations. All the writes and reads that you do is basically what would be used in here.

Let's talk about secondary index. I think secondary index has a similar but a very different structure where it is completely built on a B-tree and it basically has a proper... Sorry, before even jumping to the other thing.

Secondary index basically used to be stored in the general memory. Now they're stored in the shared memory of AeroSpike so that whenever a restart happens, your data still exists and you can just start it from where you left off. And this is a feature that you have from AeroSpike 6.1. And secondary indexes are majorly useful when you want to have search by multiple keys. For example, you defined a key as a user ID, but then you decided, oh, I want to also use my user email as my definition so that I can search for multiple other parameters. That's basically where you can start using secondary indexes. I think we'll talk about the examples in the next one. So this also can be used for doing queries, writing expressions and multiple other things that you have in here. I think the last part of our architecture understandings is basically set index.

So what we talked about primary index, it's basically a smaller part of your set index. So let's say if you have a primary index of the complete namespace, this set index is basically a primary index of a particular set. This is basically to optimize the queries within a set. It also has a performance implication. Say if you have... We used to see performance implications when you are searching by namespace. But let's say if you have two namespaces, one namespace has thousand x of other namespace. When you do a scan operations on the names, of each has a smaller set, then what you'll end up doing is you'll end up scanning all the records in the complete namespace. Which was being an overkill and because of this, there was a hack that people used to do is use a secondary index on top of the set name within your AeroSpike so that you can easily scan on that particular set.

But then with AeroSpike version 5.6, we also have an index that we can enable and disable on the runtime for any set that you have within your namespace. I think I was just showing some couple of examples on how you can prepare your primary indexes and how you can set or enable or disable your indexes. So basically, if you see, creating an index needs three major things, your namespace, set, index, index name, and the bin on which you wanted to do a index, and enabling set basically needs a namespace and also a set name for which you want it to enable the set.

Okay. Okay. So now let's come to the beast in the room, which is AeroSpike data modeling. And before even GEM, starting with what we have to do, there is a very small disclaimer. All the things that we are going to discuss here are not a prescription or suggestion that we should use in always doing a data modeling. It is always a precursor to understand what we have to do and a guiding principle of how you can move forward. Everything depends on how your data exists and what are your use cases, and it's better if we can do a benchmarking of what data you have and how you actually wanted to use it. Try it out, do a benchmarking and how you want to use it and then we can make a choice basis on that.

So the first thing is basically how you wanted to define your namespaces, right? That's basically how you wanted to create your table. So when you look at a table, you basically have an... Sorry. For a database, you basically have an isolation of that complete data. Have namespaces segregated at least by your business use cases or your services that you typically run so that it helps you in doing migration. It helps in case of there are compliance requirements that my data should be isolated or my data needs to be encrypted at some point in time and other things. And it also helps you in managing resources. You can find for this namespace there is only say a hundred G, or even there is a thousand G of this is available in the cluster. So then comes your set. Set is typically a logical separation on top of namespace. So this is better if you wanted to do let's say a simple grouping, right?

Like a user profile into one set or user active sessions into one set. These are the kinds of sets that you would want to do isolation on top of these things. Then the last part is basically your key. So for you to manage your AeroSpike key, it should be a unique key. And it doesn't necessarily always have to be a singular key. Rather you can always say I would want to use a key which is a combination of multiple other things or based on the use case to use case. Let's go by an example. Let's say you have a data like this where you have users and you also have a list of ads that user has seen on a particular date. Now if I say user ID one and I have all the things listed in this fashion, if you want to query the data or my business use cases, I need user ID, for this user ID how many times this ad has been seen.

If I want to do a query like this, I have to basically end up taking the user ID, read in my application and do it. Rather if you structure the key where you say user ID and add ID is basically one key and have a count on top of to this thing, which you can do a direct query. Again, there is a trade-off between both of the approaches. In the second approach, I basically end up doing multiple records. I do too many records probably based on the number of users and number of ads that I would be having. But it depends on how many business use cases are. You have to do an estimate of how many records you'll end up doing, your business use cases, what is the trade-off that you would want to do in the end. So that's basically how you can also use an app composite key.

Again, hash key is also pretty similar to what we discussed about in the other slide. So I want to search by one of the data points within my particular data. The other approaches are basically saying I would break this into multiple records, but then you'll end up updating each record every time there is an update. Maybe you can also do secondary indexes or other things, but they are very costly operations, right? Because you have to store everything in a RAM maintainer tree and also update every time there's something is going on. Rather if you can do a hash key, take a value of your original object, do a hash of it and store the data with actual user ID, maybe that would be cheaper. Again, as we are talking about it's all a trade-off that you would want to do based on how much information or based on how much business use cases that you have on top of these things.

Then comes the bin modeling that you have to do. We talked about the top five sets. Then last comes your bin modeling. Typically, while creating bins, as we know, we can all create multiple bins within a particular record and the bins can vary from size to size. Can be used multiple different data types, can have different usage patterns, like one bin can be too big, one bin can be too small and everything can happen. But modeling is very important so that you actually have a better query performance when you're actually working on AeroSpike. So the first thing we would always need to think about is have a logical separation if needed. Let's say if your business says I work with independent bins, for example, I would work with addresses, would work with phone numbers separately, maybe I would segregate them. But if my business says I never care about these things, all I need is a user profile at any point in time, I would not want to do that.

That would be an overkill because I have to do it every time. Second thing is try to use single bins. So AeroSpike has a very optimized single bin approach as well. So if you have a single bin which is even an integer, then if you do it in memory, there is even no bin overhead that you'll end up doing. So your data that you store in AeroSpike would be pretty lower. It directly stores that data within the index, which is in the red-black piece that we just talked about, so that your query always would be very fast and your updates and all also very quick because it doesn't have to do a bin overhead, updating your digest, updating the references and so on and so forth. Then the last part that comes is what is a choice of data type that you would want to use?

So you have created, you have segregated. Out of all these things, what should be the data type? Should I use a string? Should I use a map? Should I use a list of whatever that is? So that basically decides on what are your query patterns or how you are using that bin. Are you using for a secondary index? Are you using this data which is frequently updated or how is your scale per this particular bin that you have in here? So all these questions are something that we have to answer while doing a choice of data type. So the other part is AeroSpike, while choosing, it has primitives, right? It also has strings, integers, byte arrays and other things. Along with that, AeroSpike now offers multiple other complex data types, which are your maps, list and other things. So they have very powerful APIs.

So for example, if you're using a map, you can do operation on top of your map. If you're using an integer, you can do increment, decrement. You can do it as an atomic operation within the AeroSpike, which can do the application heavy lifting where you don't have to manage all these things within an application. So we have to try these things before doing any kind of modeling so that we know how well we can use it. It also supports UDFs, but try preferring native data types, which basically are even the complex data types as well, right? Try preferring complex data types or the primitive ones so that your performance when you're doing a query would be really, really quick. So then we'll touch about the two data types that you're talking about. If you see, we have multiple complex data types, which has very good APIs on top of it, which you can use for multiple different use cases.

One of the type is map. So if this is an example of a record bin which has a map, so if you see there is a record ID, which you see in the top which is test student ranks under an ID, then there is a map which says ranks subject. Each subject has a rank to it, right? This is just a simple map to it. Then basically rank is basically the key, subject is basically a value and if you want to do an index on top of it, you can do all the secondary indexes on top of this particular bin as well.

So yeah. Then for every map operation that you have to do, there are two main things that you have to understand. One is basically having a map policy, which basically has two main things. One is do you want to do an ordered operation or do you want to do an unordered, or key value or non-key value and other things, right? So there are also right types where you can say I want to do a create, I wanted do update, replace, fail and other things that you wanted to try on top of these things. So all the other operations that you see are typically supported from 5.0.2. So make sure whenever you try these things, you see the server compatibility and the client compatibility when you're accessing these things. So for each of the map operations, this has a very rich API, say how you have a Java collections, right?

How rich they are. These also have a very rich API, like for example, I can get by keys, I can get by range, I can get by list, I can get by multiple keys. All the operations exist within the server so that it is very easy for you to manage or it is easy for you to do application heavy lifting when you're doing it. Then there are multiple ways that you can do a filter where I think for example, I have mentioned few examples in here, where let's say the key are date and there is one more JSON or one more dictionary within it. So now for me to do filters, all I have to do is create an index on top of let's say event type, which is basically generic error, and I can just say filtered out contains where the event type equal to generics.

So with this query I'm able to return all the three where the event type is generic and you can also do a range, like I need generics, and also raw, and also something else or you can do a not of all these things. So filtering would be very quick so you can have all these kinds of things within one particular JSON or one particular map and you can do all the queries within it. Then comes two major use cases that we can do along with the maps, right? For example, we do have this use case of doing a sequence number generator. For example, if you're working with counters where you have to keep counting, maybe because of a rate limiting, maybe because let's say you have a vendor request that you have to do it, use only this sequence for something specific, this style of doing within a map where you're doing an operation, like do a map operation dot increment.

I'm not sure if you're able to see it clearly. I'll share the slides later on. So if you see, all I'm doing is take a bin, say map operation increment by value that I wanted to increment and be done with it. So the value is getting incremented, I'm getting the latest value and this is a distributed system, if you see. So I don't have to understand locking, I don't have to understand how I want to manage these things. All I have to say is [inaudible 00:29:09] method and this is the gate method. Just do it together in an atomic operation over the record and I'm done with it. So similarly, you can also do white listing checks or blacklist checks, just everything you do on a map and if it exists or not. So similarly, you have map, you also have list. So now [inaudible 00:29:29], like how we had the [inaudible 00:29:34] map format.

Similarly, how Java collection...

List also has very rich use cases. You can get by indexes, you can get by value, you can do a continuous, you can do range filters. Every other thing is possible while using lists. So again, the same way you have examples of map, you also have a lot of examples for lists where you can say index by, you can filter by an index by a range. You need to update something on a specific index or get something by a specific index as well. Then a bunch of use cases over here as well. Let's say you have hundreds of IDs that are given, which you have to choose one every time. You just have to put it into a list and just basically say give me one every time and then take it out of the list or basically do a pop and then use that value every time.

Similarly, this pop operation is like a stack. You can also do a distributed queue without any application code. All you have to do is just have two operations, append and also remove by index and you're done with it. So your NQ works by append, your DQ works by remove, the index by zero. That's it. So now that we know the data modeling, we have seen the CDTs and everything. Another core important aspect is basically you should be able to estimate how much data you are going to get because you're using secondary indexes or complex data types or you're using something like a explosion of primary indexes and others. So you should be able to estimate how much cluster size this is going to be and basically make a call on how much you have to do. So there are two estimates that we'll talk about.

Just one, estimate is basically a primary index estimate. And there is a secondary index estimate. So as we talked about in the earlier set, primary index basically has a 64 byte for every record that you hold. So let's say if you have a replication factor of two, then it would be 64 bytes into 2 into the number of records that you hold. So that's the amount of primary index space, your minimum need, and where do you need it? Do you need it in RAM or do you need it in disk? Depends on how your namespace model is, like how your data engine storage engine is modeled on namespace, how your index is modeled on namespace.

So then for every namespace that is also set. If you have enabled set index, it also goes by four megabytes for each set that you have. Again, it gets multiplied by the replication factor. So for in-memory there is a 50-byte overhead and for in-disk it has a approximate 65-bit of overhead. These two are approximations on multiple other choices that you make, choice of data types, choice of say some choice of your complex queries, choice of I think bin policies and exit policies that you basically choose. Then the total storage that you might require is basically count for a record, then do a multiplication factor, and then you end up doing the final records that you need.

So the last session that we'll talk about is basically secondary. Similar to how you did one record estimate, then multiplied by the number of replication factor and then multiplied by it the records that you want. Similarly, every secondary index, if you have enabled or if you are using it, you would have to calculate each higher record and then you have to multiply it by the number of things that you want to have. So again, the same overhead for secondary index in bytes where it goes for the bin, nine, six bytes goes for [inaudible 00:32:59]. Secondary index work on B-tree. So B-tree for their consumption... B-tree for the consumption also needs say almost somewhere between one to two, which is in the range of extra space for it to operate on top of it. And hence, you basically have to adjust for the B-tree and the B-tree operations together when you take a record value along with it.

And basically the sizes, the threshold is basically the number of unique values for that particular second index as similar as a cardinality for an index within a MySQL DB, right? So yeah, again, you can see the data type variations are very skewed. For integer, you have a very overhead of a 16.25, whereas I think overhead is pretty high. It's almost nearing to 30. So again, it depends on what data types you're using and what type of queries you want to perform and how you wanted to use these kinds of things. So yeah, summing it up, modeling and then estimation and then performing, doing your AS benchmarks on top of these things is basically what you wanted to do. So yeah, that's...

Speaker 1:

Thanks, Chaitanya for that informative session on how complex data types can be used within AeroSpike to solve your data modeling challenges. We are open for quick one or two burning questions that you can have. And like I mentioned, they're going to be there post the two other sessions we have for more questions you can have. So one question around here.

Speaker 5:

Have you been able to do a benchmark for bin scans versus set scans? Let's say if a record were to have 1024 bins.

Chaitanya Reddy:

We have actually found it very helpful. So what we ended up doing at once was we tried to do a scan actually missing without a set scan, and what was happening is that CPU was spiking so much, it was reaching 95%. And then once we have enabled-

Speaker 5:

95 when? When you're doing...

Chaitanya Reddy:

When I'm doing a simple expression search, like I'm doing a simple scan which has an expression within it.

Speaker 5:

Set scan?

Chaitanya Reddy:

Without set scan.

Speaker 5:

Without set scan. Okay

Chaitanya Reddy:

With just a scan, it was going up to 95% to 98%. And the latencies are, for the record, we had was pretty low. I think it was less than 10,000 or 20,000, but the data size is huge within each record and the CPU was going up to 96%, and the latency was close to 1.5 seconds, which was very abnormal. And then the moment we have enabled set scan, it has dropped significantly. It gave us single number, single digit, which is like three and four ms. It has actually helped a lot.

Speaker 5:

What about multiple bins in a record? Have you tried to fetch let's say 1024 bins together, a thousand bins together? Do you see a performance dip fetching one bin... Data sizes aside.

Chaitanya Reddy:

Yeah, data aside, I think what... We haven't done a complete benchmark on top of it, but what we have observed is when you take a complete record from the server to client, that's basically where we have seen a lot of gap. Because when you do a read, AeroSpike server actually gets all your bins into the RAM or everything into your particular working space and then it returns the bins that you have asked for. So AeroSpike work is not ready even when you ask for five bins or 10 bins, but the exchange, network exchange that you do between the client and server, that we have seen a significant drop.

Speaker 5:

Of course, data size and then data on wire.

Chaitanya Reddy:

And the performance. You can see a significant performance improvement if your data sizes are too large. Because if, let's say you have a very large data set, but you're exchanging only say-

Speaker 5:

Of course. Your NIO itself will kill that.

Chaitanya Reddy:

Yeah, that's basically how that-

Speaker 5:

Thanks, man. That's super informative. Thank you.

Chaitanya Reddy:

Thank you.

Speaker 1:

If there are more questions, Chaitanya is going to be around to take that.