I once worked on a team that was involved with testing and profiling new hardware. Test engineers crafted and executed numerous tests against a phalanx of hardware for each hardware revision, each test collecting an array of time series data totalling GBs of storage. After the test suite was completed, the data from each test would be loaded, analysed, and compiled into a report.
My team was tasked with addressing the length of time it took to perform the post-processing. We examined the database, the network, and anything else we could think of, and we made big improvements. But the test engineers still complained. We then turned to the data analysis pipeline. We found that, in most cases, the analysis was reporting only the minimum, maximum, and mean of most of the time series. To do so, the pipeline had to load each time series into memory and examine each measurement.
No doubt you’re thinking that this pipeline could have been a lot more efficient. We could have kept accumulators that tracked the minimum and maximum values as the measurements were ingested, and maintained a running total with the count of the number of samples to calculate the mean at the end of the test. Not only did the data not have to be read and processed again after the test, but they never had to be individually stored in the first place.
Use Case – Online Ad Targeting
HyperLogLog is an algorithm for estimating the number of distinct elements in a multiset. The HyperLogLog data type was introduced in Aerospike 4.9 as a native data type, encompassing the algorithm and underlying data structures. (The HyperLogLog data type also implements the HyperMinHash algorithm for estimating similarity between sets.) With HyperLogLog (and HyperMinHash), we are able to apply the same principle to new classes of problems. For example, a common big data problem is online ad targeting.
Each time a user performs an action online, information is collected about the user. For site visits, data is typically collected in the form of a set of tags (eg. user characteristics like “male” or “female”, interests like “databases”, or locations like “California”) associated with a user identifier (eg. an email address). This is how advertisers are able to target marketing to a specific demographic or interest. Before launching a marketing campaign, the advertiser wants to know the audience that will be reached by the campaign (ie. the number of unique user profiles that match a specific set of tags over a given period of time).
Suppose we wanted to estimate the number of unique users who searched for the term “aerospike” in a particular month. The naïve approach is to record all the identifiers and associated tags. Then we can count the number of unique identifiers that are associated with the desired tag within the time period. But, like the example of finding the minimum, maximum, and mean values of a time series, we can do better.
Using HyperLogLog (HLL), we can estimate the number of unique values in a set. It is a probabilistic data structure, so it does not yield a precise count. Instead, it estimates a count with a maximum error that is a function of the number of bits used for the HLL bins. In our example, we use 12 bits of HLL index; this yields a maximum error of 1.6250% (see https://docs.aerospike.com/guide/hyperloglog.html#error-bounds-by-hyperloglog-operation).
Suppose, throughout a month, we are given (partial) user profiles. Rather than storing the raw profiles, we can add the profile identifier to HLL data structures instead. We create a new Aerospike record for each tag in a month; the key for this record is “<tag>:<month>”, and the HLL is stored in the bin “hll_bin”. For each partial profile ingested, we add the profile identifier to the corresponding record for each tag contained in the partial profile.
# code - hll_add for tag in profile: ops = [ hll_operations.hll_add(HLL_BIN, [str(i)], HLL_INDEX_BITS) ] _, _, result = client.operate(getkey(tag, month), ops)
Adding the same identifier to a HLL element is idempotent. Thus throughout the month, we simply add partial profiles as we receive them without needing to know if that identifier has previously been added to the HLL elements for the month.
With 12 HLL bits (and 0 HyperMinHash bits), each HLL bin requires just over 3 kB of storage (3,088 bytes to be precise).
Suppose we were not using HLL, but, instead, stored each partial user profile with each tag as a string. Each time a partial profile includes the tag “aerospike”, storing the tag requires 18 bytes (9 overhead plus length 9). If the “aerospike” tag were to be included in 1000 profiles in a month, then that corresponds to 18 kB of storage; using HLL reduces the storage in this case by 83%.
As our usage scales up and we process more partial profiles per month, the storage required for HLL remains constant. However, the storage required for the raw tags increases, and this savings becomes even greater.
Once our partial profiles are ingested over the month, we can get the estimated count of unique profile identifiers in the HLL element.
# code - hll_get_count ops = [ hll_operations.hll_get_count(HLL_BIN) ] _, _, result = client.operate(getkey(tag, month), ops) print 'tag:%s month:%d count:%d' % (tag, month, result[HLL_BIN]) # output tag:aerospike month:0 count:3367 tag:aerospike month:1 count:3324 tag:aerospike month:2 count:3419 tag:aerospike month:3 count:3298 tag:aerospike month:4 count:3298 tag:aerospike month:5 count:3515 tag:aerospike month:6 count:3353 tag:aerospike month:7 count:3412 tag:aerospike month:8 count:3423 tag:aerospike month:9 count:3308 tag:aerospike month:10 count:3326 tag:aerospike month:11 count:3530
Now after a year of collecting data, we wish to estimate the number of unique profile identifiers for a tag over the year. We can’t simply add the monthly estimates since some of those identifiers may be counted in multiple months. But we can calculate a union of multiple HLL data structures.
# code - hll_get_union_count records = [record[HLL_BIN] for record in client.get_many([getkey(tag, time) for time in times])] ops = [ hll_operations.hll_get_union_count(HLL_BIN, records) ] _, _, result = client.operate(getkey(tag, t0), ops) print 'tag:%s months:%d-%d count:%d' % (tag, t0, t0+months-1, result[HLL_BIN]) # output tag:aerospike months:0-11 count:18381
Our data contains geographic tags to identify the location of the users. We’ve already created HLL records for all of our tags, so it’s easy to check how many of our users are located in Vancouver:
tag:vancouver month:0 count:6217
There are two cities on the Pacific coast named Vancouver: one in Canada and the other in Washington. We can get the count of users in each of those:
tag:canada month:0 count:5460 tag:washington month:0 count:6720
But how many of our users are from Vancouver, Washington? We can use HLL intersection to estimate this.
# code - get_hll_intersect_count records = [record[HLL_BIN] for record in client.get_many([getkey(tag, month) for tag in tags[1:]])] ops = [ hll_operations.hll_get_intersect_count(HLL_BIN, records) ] _, _, result = client.operate(getkey(tags, month), ops) print 'tags:%s month:%d count:%d' % (tags, month, result[HLL_BIN]) # output tags:['vancouver', 'washington'] month:0 count:724
HLL estimates that of the 6217 profiles that contain the tag “vancouver”, 724 of them also contain the tag “washington”.
There are many use cases where we are interested in the approximate size of a set, of the union of sets, or of the intersection of sets. For these use cases, HyperLogLog is remarkably efficient, in both time and space, at estimating the cardinality of sets. Operational costs are lower with improvements in either one of the dimensions; the combination allows for significant cost savings that should not be neglected.
Complete sample code can be found at https://github.com/aerospike-examples/hll-python