Bloom Filters & Probabilistic Data Structures
A Bloom filter answers one question: "Is this element possibly in the set, or definitely not?" It does this using dramatically less memory than storing the actual set, at the cost of a small, tunable probability of false positives. It never produces false negatives — if the filter says "no," the element is guaranteed absent.
This trade-off turns out to be extraordinarily useful. Every query you run against a database, every URL your browser checks against a phishing list, every key lookup in an LSM-tree storage engine — all of these use Bloom filters or their descendants to avoid expensive operations that would almost certainly return "not found."
How Bloom Filters Work
The Data Structure
A Bloom filter is a bit array of
Initial state (m = 16 bits):
[0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Insert Operation
To insert an element, compute all
Insert "alice" with k=3 hash functions:
h1("alice") = 2
h2("alice") = 7
h3("alice") = 13
[0][0][1][0][0][0][0][1][0][0][0][0][0][1][0][0]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
^ ^ ^Insert "bob":
h1("bob") = 4
h2("bob") = 7 ← already set by "alice"
h3("bob") = 11
[0][0][1][0][1][0][0][1][0][0][0][1][0][1][0][0]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Lookup Operation
To check membership, compute all
Lookup "alice": bits 2, 7, 13 → all 1 → "possibly present" ✓
Lookup "charlie": bits 1, 5, 9 → bit 1 is 0 → "definitely not present" ✗
Lookup "dave": bits 2, 4, 11 → all 1 → "possibly present" (FALSE POSITIVE!)False Positives Are Fundamental
A Bloom filter can report an element as present when it is not (false positive). This happens because other insertions set the same bit positions. A Bloom filter can never report an element as absent when it is actually present (false negative). You cannot delete elements from a standard Bloom filter because clearing a bit might affect other elements.
The False Positive Math
The probability that a specific bit is still 0 after inserting
The false positive probability — the chance that all
Optimal Number of Hash Functions
The false positive rate is minimized when:
With the optimal
Sizing a Bloom Filter
Given a desired false positive rate
| False positive rate | Bits per element ( | Hash functions ( |
|---|---|---|
| 10% | 4.79 | 3 |
| 1% | 9.58 | 7 |
| 0.1% | 14.38 | 10 |
| 0.01% | 19.17 | 13 |
Memory Efficiency
A Bloom filter with a 1% false positive rate uses only 9.6 bits per element — regardless of element size. Storing 1 billion URLs (average 50 bytes each) as a set would take ~50 GB. A Bloom filter for the same set with 1% FP rate takes ~1.2 GB — a 40x reduction.
Implementation
import math
import mmh3 # MurmurHash3
from bitarray import bitarray
class BloomFilter:
def __init__(self, expected_items: int, fp_rate: float = 0.01):
self.fp_rate = fp_rate
self.size = self._optimal_size(expected_items, fp_rate)
self.num_hashes = self._optimal_hashes(self.size, expected_items)
self.bit_array = bitarray(self.size)
self.bit_array.setall(0)
self.count = 0
@staticmethod
def _optimal_size(n: int, p: float) -> int:
"""Calculate optimal bit array size."""
m = -(n * math.log(p)) / (math.log(2) ** 2)
return int(math.ceil(m))
@staticmethod
def _optimal_hashes(m: int, n: int) -> int:
"""Calculate optimal number of hash functions."""
k = (m / n) * math.log(2)
return int(math.ceil(k))
def _get_hash_indices(self, item: str) -> list[int]:
"""Generate k hash indices using double hashing."""
h1 = mmh3.hash(item, seed=0) % self.size
h2 = mmh3.hash(item, seed=42) % self.size
return [(h1 + i * h2) % self.size for i in range(self.num_hashes)]
def add(self, item: str) -> None:
for idx in self._get_hash_indices(item):
self.bit_array[idx] = 1
self.count += 1
def might_contain(self, item: str) -> bool:
"""Returns True if item might be in the set, False if definitely not."""
return all(self.bit_array[idx] for idx in self._get_hash_indices(item))
@property
def current_fp_rate(self) -> float:
"""Estimate current false positive rate."""
bits_set = self.bit_array.count(1)
return (bits_set / self.size) ** self.num_hashes
# Usage
bf = BloomFilter(expected_items=1_000_000, fp_rate=0.001)
bf.add("user:12345")
bf.add("user:67890")
print(bf.might_contain("user:12345")) # True (correct)
print(bf.might_contain("user:99999")) # False (or very rarely True)Counting Bloom Filters
Standard Bloom filters do not support deletion — you cannot unset a bit without potentially removing other elements. Counting Bloom filters replace each bit with a counter:
Standard: [0][1][0][1][1][0][0][1] ← 1 bit per position
Counting: [0][2][0][1][3][0][0][1] ← n bits (usually 4) per positionclass CountingBloomFilter:
def __init__(self, expected_items: int, fp_rate: float = 0.01):
self.size = self._optimal_size(expected_items, fp_rate)
self.num_hashes = self._optimal_hashes(self.size, expected_items)
self.counters = [0] * self.size
def add(self, item: str) -> None:
for idx in self._get_hash_indices(item):
self.counters[idx] += 1
def remove(self, item: str) -> None:
"""Remove an element. Only call if the element was previously added."""
if not self.might_contain(item):
raise ValueError("Cannot remove element not in filter")
for idx in self._get_hash_indices(item):
self.counters[idx] = max(0, self.counters[idx] - 1)
def might_contain(self, item: str) -> bool:
return all(self.counters[idx] > 0
for idx in self._get_hash_indices(item))Counter Overflow
With 4-bit counters, the maximum count is 15. If a counter overflows, it stays at the maximum value and deletion becomes unsafe. The probability of overflow is negligible for well-sized filters, but monitor it in production.
Trade-off: Counting Bloom filters use 3-4x more memory than standard Bloom filters (4 bits per counter vs 1 bit).
Cuckoo Filters
Cuckoo filters are a modern alternative that support deletion, use less space than counting Bloom filters, and often have better lookup performance.
The key insight: each element has exactly two possible bucket locations, and you can compute one from the other using only the fingerprint (not the original element):
This XOR property means you can relocate entries without knowing the original elements.
import hashlib
import struct
from typing import Optional
class CuckooFilter:
def __init__(self, capacity: int, bucket_size: int = 4,
fingerprint_bits: int = 8, max_kicks: int = 500):
self.capacity = capacity
self.bucket_size = bucket_size
self.fp_bits = fingerprint_bits
self.max_kicks = max_kicks
self.buckets: list[list[int]] = [[] for _ in range(capacity)]
self.count = 0
def _fingerprint(self, item: str) -> int:
h = int(hashlib.sha256(item.encode()).hexdigest(), 16)
fp = h % ((1 << self.fp_bits) - 1) + 1 # Never 0
return fp
def _hash1(self, item: str) -> int:
h = int(hashlib.md5(item.encode()).hexdigest(), 16)
return h % self.capacity
def _hash2(self, index: int, fingerprint: int) -> int:
return (index ^ hash(fingerprint)) % self.capacity
def insert(self, item: str) -> bool:
fp = self._fingerprint(item)
i1 = self._hash1(item)
i2 = self._hash2(i1, fp)
if len(self.buckets[i1]) < self.bucket_size:
self.buckets[i1].append(fp)
self.count += 1
return True
if len(self.buckets[i2]) < self.bucket_size:
self.buckets[i2].append(fp)
self.count += 1
return True
# Must kick existing entries
import random
i = random.choice([i1, i2])
for _ in range(self.max_kicks):
j = random.randrange(len(self.buckets[i]))
fp, self.buckets[i][j] = self.buckets[i][j], fp
i = self._hash2(i, fp)
if len(self.buckets[i]) < self.bucket_size:
self.buckets[i].append(fp)
self.count += 1
return True
return False # Filter is too full
def lookup(self, item: str) -> bool:
fp = self._fingerprint(item)
i1 = self._hash1(item)
i2 = self._hash2(i1, fp)
return fp in self.buckets[i1] or fp in self.buckets[i2]
def delete(self, item: str) -> bool:
fp = self._fingerprint(item)
i1 = self._hash1(item)
i2 = self._hash2(i1, fp)
for bucket_idx in [i1, i2]:
if fp in self.buckets[bucket_idx]:
self.buckets[bucket_idx].remove(fp)
self.count -= 1
return True
return FalseBloom Filter vs Cuckoo Filter
| Feature | Bloom Filter | Cuckoo Filter |
|---|---|---|
| Deletion | Not supported | Supported |
| Space (at ❤️% FP) | More efficient | Less efficient |
| Space (at >3% FP) | Less efficient | More efficient |
| Lookup speed | 2 hash probes | |
| Insert speed | Amortized | |
| Load factor | N/A | Typically ~95% |
HyperLogLog
HyperLogLog (HLL) answers a different question: "How many distinct elements are in this set?" It does so using
The Intuition
If you hash elements uniformly, the maximum number of leading zeros in any hash tells you roughly how many distinct elements you have seen. If the longest run of leading zeros is
hash("alice") = 00101... → 2 leading zeros
hash("bob") = 10011... → 0 leading zeros
hash("charlie") = 00001... → 3 leading zeros
hash("dave") = 01100... → 1 leading zero
Max leading zeros = 3 → estimate ≈ 2^3 = 8 distinct elementsThis estimate has high variance. HyperLogLog reduces variance by splitting elements into
where
import hashlib
import math
class HyperLogLog:
def __init__(self, precision: int = 14):
"""precision p: uses 2^p registers, ~1.04/sqrt(2^p) standard error."""
self.p = precision
self.m = 1 << precision # Number of registers
self.registers = [0] * self.m
self.alpha = self._compute_alpha(self.m)
@staticmethod
def _compute_alpha(m: int) -> float:
if m == 16:
return 0.673
elif m == 32:
return 0.697
elif m == 64:
return 0.709
else:
return 0.7213 / (1 + 1.079 / m)
def _hash(self, item: str) -> int:
h = hashlib.sha1(item.encode()).hexdigest()
return int(h[:16], 16) # 64-bit hash
def add(self, item: str) -> None:
h = self._hash(item)
# First p bits → register index
idx = h >> (64 - self.p)
# Remaining bits → count leading zeros
remaining = h & ((1 << (64 - self.p)) - 1)
leading_zeros = (64 - self.p) - remaining.bit_length() + 1
self.registers[idx] = max(self.registers[idx], leading_zeros)
def estimate(self) -> int:
"""Return estimated cardinality."""
raw = self.alpha * self.m ** 2 * (
sum(2 ** (-r) for r in self.registers) ** -1
)
# Small range correction
if raw <= 2.5 * self.m:
zeros = self.registers.count(0)
if zeros > 0:
return int(self.m * math.log(self.m / zeros))
return int(raw)
def merge(self, other: 'HyperLogLog') -> 'HyperLogLog':
"""Merge two HLLs (e.g., from different servers)."""
assert self.p == other.p
result = HyperLogLog(self.p)
result.registers = [
max(a, b) for a, b in zip(self.registers, other.registers)
]
return resultStandard error of HyperLogLog:
| Precision ( | Registers ( | Memory | Standard Error |
|---|---|---|---|
| 10 | 1,024 | 1 KB | 3.25% |
| 14 | 16,384 | 12 KB | 0.81% |
| 16 | 65,536 | 48 KB | 0.41% |
Redis uses HyperLogLog natively with the PFADD and PFCOUNT commands — 12 KB per counter, ~0.81% error.
Count-Min Sketch
A Count-Min Sketch (CMS) estimates the frequency of elements in a stream using sub-linear memory. Unlike a Bloom filter (which answers "is this in the set?"), a CMS answers "approximately how many times has this element appeared?"
import hashlib
import sys
class CountMinSketch:
def __init__(self, width: int, depth: int):
"""
width (w): number of columns. Determines accuracy: error ≤ e*N/w
depth (d): number of rows/hash functions. Determines confidence: 1 - 1/e^d
"""
self.w = width
self.d = depth
self.table = [[0] * width for _ in range(depth)]
self.total = 0
def _hash(self, item: str, i: int) -> int:
h = hashlib.md5(f"{i}:{item}".encode()).hexdigest()
return int(h, 16) % self.w
def add(self, item: str, count: int = 1) -> None:
self.total += count
for i in range(self.d):
j = self._hash(item, i)
self.table[i][j] += count
def estimate(self, item: str) -> int:
"""Returns an overestimate of the true count."""
return min(
self.table[i][self._hash(item, i)]
for i in range(self.d)
)
@staticmethod
def from_error_bounds(epsilon: float, delta: float) -> 'CountMinSketch':
"""Create CMS with error ≤ epsilon*N with probability ≥ 1-delta."""
import math
w = int(math.ceil(math.e / epsilon))
d = int(math.ceil(math.log(1 / delta)))
return CountMinSketch(w, d)The error guarantee: for any element, the estimated count
with probability
Real-World Usage
LSM-Tree Storage Engines (LevelDB, RocksDB, Cassandra)
Before reading an SSTable from disk, check a Bloom filter to see if the key could be there. This avoids unnecessary disk I/O for keys that do not exist in that level.
Query for key "user:123":
Level 0 Bloom filter → "maybe" → read SSTable → not found
Level 1 Bloom filter → "no" → skip (saved disk read!)
Level 2 Bloom filter → "maybe" → read SSTable → found!See Storage Engines for how this fits into the LSM-tree architecture.
Google Chrome Safe Browsing
Chrome checks URLs against a Bloom filter of known malicious URLs stored locally. Only if the Bloom filter says "maybe" does it make a network call to verify.
Content Delivery Networks
CDNs use Bloom filters to decide which objects to cache. Only cache an object after it has been requested more than once (the first request adds it to the filter; subsequent requests that hit the filter trigger caching).
Database Query Optimization
PostgreSQL uses Bloom filters for join optimization, and Cassandra uses them to avoid reading SSTables that cannot contain a partition key.
Stream Processing
Count-Min Sketch is used in network monitoring to detect heavy hitters (IPs sending the most traffic) and in advertising to track approximate impression counts.
Comparison of Probabilistic Data Structures
| Structure | Question Answered | Memory | Error Type |
|---|---|---|---|
| Bloom Filter | "Is x in the set?" | False positives | |
| Cuckoo Filter | "Is x in the set?" (with delete) | False positives | |
| HyperLogLog | "How many distinct elements?" | Estimation error | |
| Count-Min Sketch | "How often does x appear?" | Overestimates | |
| MinHash | "How similar are sets A and B?" | Estimation error | |
| Top-K (Space-Saving) | "What are the most frequent elements?" | May miss elements |
Further Reading
- Storage Engines — How Bloom filters are used in LSM-trees
- Redis Internals — Redis native HyperLogLog and Bloom filter modules
- Consistent Hashing — Another hash-based distributed systems primitive
- Distributed Locking — Probabilistic approaches to distributed coordination
- Burton H. Bloom, "Space/Time Trade-offs in Hash Coding with Allowable Errors" (1970)
- Philippe Flajolet et al., "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm" (2007)
- Bin Fan et al., "Cuckoo Filter: Practically Better Than Bloom" (2014)
- Graham Cormode and S. Muthukrishnan, "An improved data stream summary: the count-min sketch and its applications" (2005)