Probabilistic data structures are pivotal tools in managing large datasets with significant space efficiency. Unlike traditional data structures, they offer approximate answers, making them suitable for applications where exact precision is not critical. These structures use hash functions to encode information, allowing them to provide rapid responses with minimal memory usage. In the context of big data, where handling vast amounts of information is a common challenge, probabilistic data structures become invaluable. They are designed to provide quick operations such as membership tests, frequency counts, and cardinality estimation without the overhead of storing all the data explicitly.
Consider the scenario of a web service handling billions of user interactions daily. Storing each interaction individually would require substantial memory and computational resources. Here, probabilistic data structures like Bloom filters or HyperLogLog can efficiently manage the data with a fraction of the space. Their ability to offer approximate answers suits applications such as network security, database indexing, and real-time analytics, where speed and efficiency are prioritized over exactness.
For developers and computer science students, understanding probabilistic data structures is essential. These structures challenge the traditional paradigms of data handling by offering solutions that balance accuracy and performance. A fundamental grasp of how these structures work, including their reliance on hash functions and their trade-offs between precision and efficiency, is crucial for leveraging them effectively in various technological domains. By integrating probabilistic data structures into their skillset, professionals and students alike can address modern data challenges with innovative and efficient strategies.
Sources:
Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: The Count-Min Sketch and its applications. Journal of Algorithms, 55(1), 58-75.
Flajolet, P., Fusy, É., Gandouet, O., & Meunier, F. (2007). Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science.
Common examples of probabilistic data structures
Bloom filters
Bloom filters efficiently test whether an element is a member of a set. They use hash functions to map elements to a bit array, providing a space-efficient method to track membership. While Bloom filters can confirm the absence of an element, they can only offer an approximate answer for membership presence, potentially allowing false positives. Despite this trade-off, Bloom filters are invaluable for applications where space efficiency is critical, such as cache filtering or database querying.
Count-min sketch
Count-Min Sketch is used to estimate the frequency of elements in a dataset. This data structure employs hash functions to map elements into a two-dimensional array, maintaining space efficiency while allowing for fast updates and queries. Although Count-Min Sketch provides approximate answers and may overestimate frequencies due to hash collisions, its practical application in streaming data scenarios and large datasets makes it highly valuable. It enables real-time performance in monitoring network traffic or tracking item counts in large-scale data processing.
Hyperloglog
HyperLogLog estimates the number of distinct elements in a dataset. This probabilistic data structure leverages hash functions to probabilistically count unique items with remarkable space efficiency. By using multiple hash functions and exploiting the distribution of hash values, HyperLogLog provides an approximate answer to cardinality estimation problems. Its application is prominent in big data environments where exact counts of distinct elements are infeasible, such as in web analytics and distributed systems.
Ready to see how these data structures power real-time big data solutions? Discover the future of big data processing with Aerospike’s insights on leveraging modern hardware efficiently. Learn more about Aerospike 8 and the future of big data.
Advantages of probabilistic data structures
Probabilistic data structures offer several significant benefits that make them valuable tools in handling large datasets. They are particularly advantageous in scenarios where space efficiency and real-time performance are priorities.
Scalability: Probabilistic data structures can handle large volumes of data efficiently, making them suitable for big data applications. They allow systems to scale without a corresponding linear increase in resource consumption.
Space efficiency: These structures require significantly less memory compared to traditional data structures. By storing only essential information and using techniques like hash functions, they provide approximate answers while conserving space. This space efficiency is critical when working with massive datasets where storage resources are limited.
Real-time performance: Probabilistic data structures facilitate faster data processing by reducing the computational overhead. They provide quick approximate answers, making them ideal for real-time applications where speed is crucial.
Reduced computation: By offering approximate results, probabilistic data structures reduce the need for intensive computations. This reduction is especially useful in applications where exact precision is not necessary, allowing for more efficient use of computational resources.
These advantages underscore the utility of probabilistic data structures in modern computing environments, particularly in big data analytics and other fields requiring efficient data handling.
Discover how Aerospike transforms these principles into real-world solutions. Join our webinar to explore the cutting-edge innovations in data processing at scale.
Aerospike and probabilistic data structures
Note: the code examples in this post are written in python. HLL support requires at least version 4.0.0 of the Aerospike python client.pip install "aerospike>=4.0.0"Complete sample code can be found on GitHub.
I once worked on a team that was involved with testing and profiling new hardware. Test engineers crafted and executed numerous tests against a phalanx of hardware for each hardware revision, each test collecting an array of time series data totaling GBs of storage. After the test suite was completed, the data from each test would be loaded, analyzed, and compiled into a report.
My team was tasked with addressing the length of time it took to perform the post-processing. We examined the database, the network, and anything else we could think of, and we made big improvements. But the test engineers still complained. We then turned to the data analysis pipeline. We found that, in most cases, the analysis was reporting only the minimum, maximum, and mean of most of the time series. To do so, the pipeline had to load each time series into memory and examine each measurement.
No doubt you’re thinking that this pipeline could have been a lot more efficient. We could have kept accumulators that tracked the minimum and maximum values as the measurements were ingested and maintained a running total with the count of the number of samples to calculate the mean at the end of the test. Not only did the data not have to be read and processed again after the test, but they never had to be individually stored in the first place.
Use case – online ad targeting
HyperLogLog is an algorithm for estimating the number of distinct elements in a multiset. The HyperLogLog data type was introduced in Aerospike 4.9 as a native data type, encompassing the algorithm and underlying data structures. (The HyperLogLog data type also implements the HyperMinHash algorithm for estimating similarity between sets.) With HyperLogLog (and HyperMinHash), we are able to apply the same principle to new classes of problems. For example, a common big data problem is online ad targeting.
Each time a user performs an action online, information is collected about the user. For site visits, data is typically collected in the form of a set of tags (eg. user characteristics like “male” or “female”, interests like “databases”, or locations like “California”) associated with a user identifier (eg. an email address). This is how advertisers are able to target marketing to a specific demographic or interest. Before launching a marketing campaign, the advertiser wants to know the audience that will be reached by the campaign (ie. the number of unique user profiles that match a specific set of tags over a given period of time).
Suppose we wanted to estimate the number of unique users who searched for the term “aerospike” in a particular month. The naïve approach is to record all the identifiers and associated tags. Then we can count the number of unique identifiers that are associated with the desired tag within the time period. But, like the example of finding the minimum, maximum, and mean values of a time series, we can do better.
Using HyperLogLog (HLL), we can estimate the number of unique values in a set. It is a probabilistic data structure, so it does not yield a precise count. Instead, it estimates a count with a maximum error that is a function of the number of bits used for the HLL bins. In our example, we use 12 bits of HLL index; this yields a maximum error of 1.6250%.
Ingesting data
Suppose, throughout a month, we are given (partial) user profiles. Rather than storing the raw profiles, we can add the profile identifier to HLL data structures instead. We create a new Aerospike record for each tag in a month; the key for this record is “<tag>:<month>,” and the HLL is stored in the bin “hll_bin.” For each partial profile ingested, we add the profile identifier to the corresponding record for each tag contained in the partial profile.
# code - hll_add
for tag in profile:
ops = [
hll_operations.hll_add(HLL_BIN, [str(i)], HLL_INDEX_BITS)
]
_, _, result = client.operate(getkey(tag, month), ops)
Adding the same identifier to a HLL element is idempotent. Thus throughout the month, we simply add partial profiles as we receive them without needing to know if that identifier has previously been added to the HLL elements for the month.
Data footprint
With 12 HLL bits (and 0 HyperMinHash bits), each HLL bin requires just over 3 kB of storage (3,088 bytes, to be precise).
Suppose we were not using HLL but, instead, stored each partial user profile with each tag as a string. Each time a partial profile includes the tag “aerospike,” storing the tag in the user profile database requires 18 bytes (9 overhead plus length 9). If the “aerospike” tag were to be included in 1000 profiles in a month, then that corresponds to 18 kB of storage; using HLL reduces the storage in this case by 83%.
As our usage scales up and we process more partial profiles per month, the storage required for HLL remains constant. However, the storage required for the raw tags increases, and this savings becomes even greater.
Counting
Once our partial profiles are ingested over the month, we can get the estimated count of unique profile identifiers in the HLL element.
# code - hll_get_count
ops = [
hll_operations.hll_get_count(HLL_BIN)
]
_, _, result = client.operate(getkey(tag, month), ops)
print 'tag:%s month:%d count:%d' % (tag, month, result[HLL_BIN])
# output
tag:aerospike month:0 count:3367
tag:aerospike month:1 count:3324
tag:aerospike month:2 count:3419
tag:aerospike month:3 count:3298
tag:aerospike month:4 count:3298
tag:aerospike month:5 count:3515
tag:aerospike month:6 count:3353
tag:aerospike month:7 count:3412
tag:aerospike month:8 count:3423
tag:aerospike month:9 count:3308
tag:aerospike month:10 count:3326
tag:aerospike month:11 count:3530
Union
Now after a year of collecting data, we wish to estimate the number of unique profile identifiers for a tag over the year. We can’t simply add the monthly estimates since some of those identifiers may be counted in multiple months. But we can calculate a union of multiple HLL data structures.
# code - hll_get_union_count
records = [record[2][HLL_BIN] for record in client.get_many([getkey(tag, time) for time in times])]
ops = [
hll_operations.hll_get_union_count(HLL_BIN, records)
]
_, _, result = client.operate(getkey(tag, t0), ops)
print 'tag:%s months:%d-%d count:%d' % (tag, t0, t0+months-1, result[HLL_BIN])
# output
tag:aerospike months:0-11 count:18381
Intersection
Our data contains geographic tags to identify the location of the users. We’ve already created HLL records for all of our tags, so it’s easy to check how many of our users are located in Vancouver:
tag:vancouver month:0 count:6217
There are two cities on the Pacific coast named Vancouver: one in Canada and the other in Washington. We can get the count of users in each of those:
tag:canada month:0 count:5460
tag:washington month:0 count:6720
But how many of our users are from Vancouver, Washington? We can use HLL intersection to estimate this.
# code - get_hll_intersect_count
records = [record[2][HLL_BIN] for record in client.get_many([getkey(tag, month) for tag in tags[1:]])]
ops = [
hll_operations.hll_get_intersect_count(HLL_BIN, records)
]
_, _, result = client.operate(getkey(tags[0], month), ops)
print 'tags:%s month:%d count:%d' % (tags, month, result[HLL_BIN])
# output
tags:['vancouver', 'washington'] month:0 count:724
HLL estimates that of the 6217 profiles that contain the tag “vancouver”, 724 of them also contain the tag “washington”.
Conclusion
There are many use cases where we are interested in the approximate size of a set, of the union of sets, or of the intersection of sets. For these use cases, HyperLogLog is remarkably efficient, in both time and space, at estimating the cardinality of sets. Operational costs are lower with improvements in either one of the dimensions; the combination allows for significant cost savings that should not be neglected.
Complete sample code can be found at https://github.com/aerospike-examples/hll-python