Bloom and Cuckoo filters: Trading Memory for Certainty
This article provides a conceptual overview of why we use bloom and cuckoo filters rather than how to implement them. We do briefly explore their implementation at the end.
Let’s jump straight in. Sometimes we want to test for whether something belongs to a set or not. For instance
- does this email belong to the set of emails we’ve marked as spam?
- does this ip address belong to the set of all ip addresses a user has accessed the API with in the past?
- does this location geohash belong to the set of previously known locations?
Using the examples above, we can allow emails through that haven’t been marked previously as spam. We can request extra authentication if a user is accessing a service from a new, unrecognised ip address. We can send a notification if a device deviates from a known trail or path. Being able to efficiently identify whether something does or does not belong clearly has practical applications.
Part 1: Hash Map
Regardless of name and implementation, a hash map essentially maps one value to another value.
Why is this important? If we want to determine whether an item belongs to a set or not, we can do so definitively with 0 false positives and 0 false negatives. Even better, hash maps generally run in O(1) constant time while hash tries in O(log n) time. So we have a data structure that is arguably ideal for determining set membership. It can determine with 100% certainty whether it belongs or not and in addition it can do so relatively quickly. So why do we need Bloom and Cuckoo filters then?
Answer: when memory becomes a limiting factor.
Imagine we are the app developers of Pokémon Go and we want to detect if a user has visited a location that nobody in the history of the game has been to. If you can imagine, that’s a whole lot of location points from say ≈30 million daily users. At a thousand location points a day over the course of a year thats ≈11 trillion data points. To be able to continuously record and check new location points against them would take an unbounded amount of disk and memory space.
With a Bloom or Cuckoo filter memory use is set. We can check against a record of all those location points while only using a predetermined amount of memory or disk space. However, it comes at a cost. Instead of being able to determine if an item belongs to a set with certainty, now we are only able to determine it probabilistically.
Bloom and Cuckoo filters are memory optimised with tradeoff being probabilistic set membership.
What probabilistic set membership means is we now need to consider the notions of false positives (tells us item belongs to a set when in fact it doesn’t, also known as Type 1 error), and false negative (tells us item doesnt not belong to a set when in fact it does, or Type II error). We can control our false positive rate by adjusting certain parameters of the algorithm like total memory allocated. However, false negatives remain at 0%. That means, if these algorithms tells us an item does not belong to a set, we can be 100% certain that it is the correct.
Intuitively, what this means is that we can make the following 2 inferences:
- an item is definitely not in the set
- it could be in the set, but ultimately we don’t know
As a concrete example, suppose we have a complete set of all malicious ip addresses in the world and we want to determine if a new ip address accessing our server is malicious or not. Using a filter we could conclude either:
- the new ip address is definitely not malicious → allow
- the new ip address could be malicious, but we don’t know for sure → block
Note that if we block on the possibility of it being malicious we will end up also blocking legitimate new users.
We describe conceptually what bloom and cuckoo filters are below; however, we omit implementation details for clarity.
Part 2: Bloom filter
A bloom filter is a Bit Array with all bits set to 0. Every time a new element is added, we hash it, use the result as an index into the array and set the bits to 1.
In practice though, bloom filters are often used with more than 1 hash function. So it looks more like this.
For testing set membership, we simply hash the input and check whether any single bit contains 0. If one of the bits contains a 0 then we conclude definitively that the item is not part of the set. Otherwise we conclude it may possibly be part of the set.
Now imagine if we add keep elements to the bit array indefinitely. Eventually all the bits in the array will be set to 1. At this point testing of any input will give us a result of 1 regardless of what the input is. The proportion of the bit array that is filled is referred to as the load factor.
As load factor increases, the usefulness of a bloom filter decreases
If we continually add to the bloom filter then there is no way to prevent this eventuality. But there are ways to mitigate this. The following is outside the conceptual scope of the article but is included for implementation reference.
First, we introduce some terms:
- n → number of inserted elements
- m → size of bit array
- k → number of hash functions
- ε → false positive rate
Given an acceptable false positive rate ε, we then work backwards and derive approximations of other values:
Part 3: Cuckoo Filter
Based on a specific implementation of a hash map called a cuckoo hash, wherein conflicts are resolved by kicking out previous results and rehashing it under a different bit array. A Cuckoo hash uses two arrays (or buckets), each with its own associated hash function. Best visualised below:
Whenever a conflict occurs, we kick out the previous occupant and rehash it under the other table — this continues until there are no more conflicts.
In the event of an infinite loop being formed between the two tables, insertion fails and we resolve this by rebuilding both tables with new hash functions.
The difference between a cuckoo filter and cuckoo hash however is that instead of storing the actual inputs, we store a fingerprint of that input (using a third separate hash function).
The indices for each array is determined as follows:
- Index 1: hash(input)
- Index 2: hash(input) xor hash(fingerprint)
We test for set membership in a cuckoo filter by checking for its fingerprint in both arrays. As with bloom filters, if a fingerprint it not found in either array then it tells us definitely it is not part of a set. However, if it is found then it tells us only that is it possibly part of a set.