Record Aggregation in Aerospike For Performance and Economy
A strong differentiator for Aerospike vs other key value databases is its DRAM economy. Every object has a 64 byte DRAM footprint no matter what the size of the actual object is. You can manage a billion objects using only 128Gb of DRAM, allowing for 2x replication.
Great news! A billion is a pretty big number and 3 * 512GB nodes gets me to 12bn. Within reason I can have as many objects as I like. I should start making them right away with no further thought required.
Hold your horses, cowboy. It might not be as simple as that.
For instance, what if your objects are very small? Worst case, they’re so small that they’re of the order of 64 bytes, so now your memory footprint is similar to your disk footprint. It might even be the case that your memory to disk ratio is such that your DRAM is full when your disk is half empty. In a bare metal situation, buy less disk / more DRAM for sure, but you might be in the cloud where you’re stuck with certain DRAM / storage ratios. Or maybe these machines got brought to you for re-purposing.
A technique informally known as blocking can help you. You store your objects within larger objects. Your API turns this into an implementation detail. Blocking can reduce your memory footprint, helping with the small object use case.
Aerospike lets you do this by offering a comprehensive List and Map API which allows you to place objects within objects as well as retrieving them in an efficient manner. List structures can be used for storing structured data such as time series of a regular frequency while maps can be used to reduce your key space. The API offered is a distinguishing feature of Aerospike when contrasted with other key value databases.
Let’s look again at our example. Suppose your key space is composed of device ids, and these are fundamentally UUIDs — 128 bit numbers or 32 digit hexadecimal numbers. Let’s say you anticipate you may need to store as many as 15bn of these, but each record is only around 200 bytes. Your DRAM requirement with Aerospike would be of the order of
64(bytes) * 15bn * 2(replication) = ~2Tb
Not the end of the world, but you could do better by working smarter.
Assume also that we want to keep our physical object size below 128kb — a good starting point for optimal block size on a flash device, which is recommended for Aerospike. We can put 655 200 byte objects into 128kb. If each physical object contains 655 actual objects then we require 15bn / 655 = 22.9m container object keys. The question is, how then, given do we map from a device id to the container(physical) object key, and how do we reliably look up a logical object inside the container object. The answer is that we do this using bit-masking.
A 128 bit key space can be converted into a key space of size 2,4 …. 65536… 220 keys by AND-ing the key with a binary number composed of 1,2 … 16 … 20 etc leading ones followed by trailing zeros. For our example we need a bit mask of size equivalent to the first power of two above our key space size which can be calculated as
ceiling(log(22.9 *10⁶) / log(2)) = 25 bits
This gives us a key space of size 225 = ~33.5m so we’ve got our maths correct.
Let’s look at how we make use of this in Aerospike
https://gist.github.com/ken-tune/77e4297e54d296c49fb3a3140dfc3460
This function stores our device metadata inside a physical object. As described, the physical object key is derived using bit masking. Note this is efficient from a network capacity point of view — only the metadata gets sent across the network, not the full physical object.
We also need to see how to retrieve our object
https://gist.github.com/ken-tune/adda49e10bd440f7a9331148122767f3
The construction of the physical key is as before. This time we use the getByKey operation to retrieve the device metadata.
An important point to note is that only the metadata requested is transmitted across the network not the entire physical object. This consideration applies in general to the calls offered by the List/Map API. This is what we mean by ‘economy’ in the article title.
Finally, a code snippet showing how to calculate the bit mask using the ‘natural’ inputs.
https://gist.github.com/ken-tune/8e248c893d3539b0201393df4bb3dbcf
The net benefit of all the above is that the memory footprint will be reduced in this case to
225 (keys) * 64 (DRAM cost per record) * 2 (rep factor) = 4Gb
from
15bn * 64 (DRAM cost per record) * 2 (rep factor) = ~1.8TB
The reduction factor is 447, slightly less than the ‘655’ quoted above as on average, our 128kb blocks will not be completely filled.
Before we close, worth noting our Enterprise only ‘all flash’ capability which allows both index and data to be placed on disk thus reducing DRAM usage to very low levels. This was developed specifically with the use cases of small objects and/or very large numbers of objects (~10¹² = 1 trillion ) in mind. It will engender higher levels of latency (~5ms vs ~1ms at the 95th percentile ) but it’s still competitive vs any other database out there.
The above solution is a good example of a differentiating feature, our List and Map API , providing a distinguishing optimisation under constraints. The technique of ‘blocking’ can also be made use of for time series data which I hope to explore in a future article.
Cover image with thanks to Nana Smirnova