Bloom filter.

Sometimes you need to know whether a thing is in a set, and a hash set would be too big. A bloom filter answers either "probably in there" or "definitely not in there" — never just plain "in there." It uses a bit array and a handful of hash functions, takes ~10 bits per item regardless of item size, and powers cache fronts, deduplication pipelines, and the very first step of every Bitcoin SPV lookup.

speed 1.7s

step 1 / 9 · 0/16 bits set

filter m=16 bits · k=3 hashes · insert apple, banana, cherry · query 4 words no false negatives · some false positives · no deletes

INSERT

TRUE POSITIVE

TRUE NEGATIVE

FALSE POSITIVE

Empty bit array · 16 bits · 3 hash functions

A bloom filter is just a bit array and a fixed set of hash functions. Insertion: hash the item with all k functions, set those k bits. Membership query: hash and check all k bits. If any is 0, the item is definitely not in the set. If all are 1, the item is probably in the set — but maybe not. Bloom filters trade certainty for space.

Bloom filter: A space-efficient probabilistic membership test. No false negatives (says "definitely not" only when it knows). Some false positives (says "maybe yes" when actually no).
m (bits): Size of the bit array. Bigger m → lower false-positive rate. Linear in space cost.
k (hash functions): Number of bits flipped per insert and checked per query. Optimal k for given m and n items: k* = (m/n) * ln(2). Too many or too few both hurt FP rate.

Why bloom filters are everywhere

In a distributed database, every read might need to check several disk files to see if a key is present. Most files don\'t have the key. A bloom filter in front of each file lets you skip the disk read entirely for the common case where the answer is no. Cassandra and RocksDB do this — every SSTable carries a bloom filter sized for ~1% FP. The 1% of false positives still hit disk; the 99% true negatives don\'t. That\'s a massive read amplification reduction.

Chrome\'s safe-browsing feature ships a bloom filter of "known bad URLs" in your browser. Most URLs aren\'t bad; the filter says "definitely safe" and no network call is needed. When a URL maybe-matches, Chrome asks Google\'s server for the real answer. Result: privacy plus low bandwidth.

Sizing one yourself

Given n expected items and a target FP rate p, the formulas are:

m = -n · ln(p) / (ln 2)² — bits needed.
k = (m/n) · ln 2 — optimal number of hash functions.

For 1 million items at 1% FP: m ≈ 9.6 million bits = 1.2 MB; k = 7. For 1 million items at 0.01% FP: 19 million bits = 2.4 MB; k = 13. The cost grows logarithmically with the FP rate you demand — going from 1% to 0.01% only doubles the size.

Cheap trick to get k hash values cheaply: hash the item once with a 128-bit hash (xxhash, murmur), split into two 64-bit halves h1, h2, and synthesize h_i = (h1 + i*h2) mod m. Empirically as good as truly independent hashes.

What you can\'t do with a bloom filter

Delete. Clearing a bit might clear it for another item that hashed to the same slot. There\'s no way to know. Use a counting bloom filter (4 bits per slot instead of 1, incremented/decremented) or switch to a cuckoo filter.
Iterate. The filter doesn\'t store the items, just bits. You can\'t list what\'s in it.
Get the value. It\'s a set, not a map. Bloom filters can\'t store associated values.
Trust a positive. Always verify with the source of truth — disk, database, network. The filter only tells you the verification might be worth doing.

Variants worth knowing

Counting bloom filter. Each slot is a small counter (e.g. 4 bits) instead of one bit. Supports deletion at the cost of 4x space.
Scalable bloom filter. Starts small and adds new filter layers when full. Useful when you don\'t know n upfront.
Cuckoo filter. Stores small fingerprints in a hash table with cuckoo-hashing displacement. Supports delete. Slightly better space for low FP rates. Becoming the modern default for new designs.
Quotient filter. Cache-friendly, mergeable, deletable. Common in storage engines.
HyperLogLog. Adjacent topic — instead of "is X in the set?" it answers "how many distinct things are in the set?" with ~12 KB total for billions of items.

What this simplifies

Toy size. 16 bits is comically small. The 25% FP rate we measured is roughly what you\'d get; real systems run at 0.1-1% by sizing the filter much larger.
Made-up hashes. Real bloom filters use murmur, xxhash, or similar fast non-cryptographic hashes. The k-hashes-from-one trick mentioned above is the standard.
Single-threaded. Concurrent insertion is non-trivial — atomic bit-set works for the data, but high-contention spots still need care.

Go deeper

Probabilistic data structures →

HyperLogLog, count-min sketch, t-digest, cuckoo filter. The family of "approximately correct, much smaller" structures you reach for when exact answers are too expensive.

Open the Codex →

Found this useful?