14 · 9 steps
Visualize / 14

Bloom filter.

Sometimes you need to know whether a thing is in a set, and a hash set would be too big. A bloom filter answers either "probably in there" or "definitely not in there" — never just plain "in there." It uses a bit array and a handful of hash functions, takes ~10 bits per item regardless of item size, and powers cache fronts, deduplication pipelines, and the very first step of every Bitcoin SPV lookup.


step 1 / 9 · 0/16 bits set
filter m=16 bits · k=3 hashes · insert apple, banana, cherry · query 4 words no false negatives · some false positives · no deletes
INSERT
TRUE POSITIVE
TRUE NEGATIVE
FALSE POSITIVE
INPUT— idle —BIT ARRAY · m = 16 0bit 0 0bit 1 0bit 2 0bit 3 0bit 4 0bit 5 0bit 6 0bit 7 0bit 8 0bit 9 0bit 10 0bit 11 0bit 12 0bit 13 0bit 14 0bit 15FILL RATIO · 0/16 = 0%RUNNING TALLYinserted0true pos.0true neg.0false pos.0SIZING REFERENCEFP RATE = (1 − e^(−kn/m))^km=10n · k=7~1%m=15n · k=10~0.1%m=20n · k=14~0.01%m=24n · k=17~0.001%PER-ITEM COSTstoring the strings~30 bytes eachhash set~50 bytes eachbloom @ 1% FP10 bits eachbloom @ 0.1% FP15 bits eachREAL-WORLD USES• Cassandra SSTables• Postgres pg_bloom• RocksDB / LevelDB• Chrome safe-browsing• Bitcoin SPV wallets• CDN cache front• Spotify de-dup• Akamai & Cloudflare
Empty bit array · 16 bits · 3 hash functions

A bloom filter is just a bit array and a fixed set of hash functions. Insertion: hash the item with all k functions, set those k bits. Membership query: hash and check all k bits. If any is 0, the item is definitely not in the set. If all are 1, the item is probably in the set — but maybe not. Bloom filters trade certainty for space.

Bloom filter
A space-efficient probabilistic membership test. No false negatives (says "definitely not" only when it knows). Some false positives (says "maybe yes" when actually no).
m (bits)
Size of the bit array. Bigger m → lower false-positive rate. Linear in space cost.
k (hash functions)
Number of bits flipped per insert and checked per query. Optimal k for given m and n items: k* = (m/n) * ln(2). Too many or too few both hurt FP rate.

Why bloom filters are everywhere

In a distributed database, every read might need to check several disk files to see if a key is present. Most files don\'t have the key. A bloom filter in front of each file lets you skip the disk read entirely for the common case where the answer is no. Cassandra and RocksDB do this — every SSTable carries a bloom filter sized for ~1% FP. The 1% of false positives still hit disk; the 99% true negatives don\'t. That\'s a massive read amplification reduction.

Chrome\'s safe-browsing feature ships a bloom filter of "known bad URLs" in your browser. Most URLs aren\'t bad; the filter says "definitely safe" and no network call is needed. When a URL maybe-matches, Chrome asks Google\'s server for the real answer. Result: privacy plus low bandwidth.

Sizing one yourself

Given n expected items and a target FP rate p, the formulas are:

  • m = -n · ln(p) / (ln 2)² — bits needed.
  • k = (m/n) · ln 2 — optimal number of hash functions.

For 1 million items at 1% FP: m ≈ 9.6 million bits = 1.2 MB; k = 7. For 1 million items at 0.01% FP: 19 million bits = 2.4 MB; k = 13. The cost grows logarithmically with the FP rate you demand — going from 1% to 0.01% only doubles the size.

Cheap trick to get k hash values cheaply: hash the item once with a 128-bit hash (xxhash, murmur), split into two 64-bit halves h1, h2, and synthesize h_i = (h1 + i*h2) mod m. Empirically as good as truly independent hashes.

What you can\'t do with a bloom filter

  • Delete. Clearing a bit might clear it for another item that hashed to the same slot. There\'s no way to know. Use a counting bloom filter (4 bits per slot instead of 1, incremented/decremented) or switch to a cuckoo filter.
  • Iterate. The filter doesn\'t store the items, just bits. You can\'t list what\'s in it.
  • Get the value. It\'s a set, not a map. Bloom filters can\'t store associated values.
  • Trust a positive. Always verify with the source of truth — disk, database, network. The filter only tells you the verification might be worth doing.

Variants worth knowing

  • Counting bloom filter. Each slot is a small counter (e.g. 4 bits) instead of one bit. Supports deletion at the cost of 4x space.
  • Scalable bloom filter. Starts small and adds new filter layers when full. Useful when you don\'t know n upfront.
  • Cuckoo filter. Stores small fingerprints in a hash table with cuckoo-hashing displacement. Supports delete. Slightly better space for low FP rates. Becoming the modern default for new designs.
  • Quotient filter. Cache-friendly, mergeable, deletable. Common in storage engines.
  • HyperLogLog. Adjacent topic — instead of "is X in the set?" it answers "how many distinct things are in the set?" with ~12 KB total for billions of items.

What this simplifies

  • Toy size. 16 bits is comically small. The 25% FP rate we measured is roughly what you\'d get; real systems run at 0.1-1% by sizing the filter much larger.
  • Made-up hashes. Real bloom filters use murmur, xxhash, or similar fast non-cryptographic hashes. The k-hashes-from-one trick mentioned above is the standard.
  • Single-threaded. Concurrent insertion is non-trivial — atomic bit-set works for the data, but high-contention spots still need care.
Go deeper

Probabilistic data structures →

HyperLogLog, count-min sketch, t-digest, cuckoo filter. The family of "approximately correct, much smaller" structures you reach for when exact answers are too expensive.

Open the Codex →
Found this useful?