9 min read · Guide · Data structures
How it works · Data structures

How a hash table turns a key into an answer in O(1).

A function that turns any key into an array index. Constant time on average, linear time at worst. The whole craft is keeping the average case alive.

Parts01–08 InteractiveLive insertion PrereqArrays / hashing

What is a hash table?

A hash function turns each key into an array index.

A hash table is an array of buckets indexed by a hash function applied to the key. Hans Peter Luhn proposed the idea at IBM in 1953. Today: every language's map / dict / HashMap, every database index, every cache, every router's flow table. The two collision strategies are chaining (linked lists per bucket) and open addressing (probe to the next bucket).

The whole hash table is built on one trick. You have a key — a string, a number, anything hashable. You apply a hash function that turns it into an integer. You modulo by the table size. The result is the index where that key's value lives. Look-up: same hash, same index, fetch in O(1). No search, no traversal — just arithmetic.

The catch is that the hash function maps an unbounded keyspace to a bounded number of slots. Two keys can hash to the same slot — a collision. The table's strategy for resolving collisions is what defines its character.


Watch three collision strategies fill the same table

Same keys, different layouts: linear, quadratic, and chaining.

Below: an 11-slot hash table. Type a key and watch where it lands. Switch strategies and reseed — same keys, different layouts. Open addressing slots stay flat; chaining grows lists at busy slots.

0 / 11 slots · load factor 0.00
0
1
2
3
4
5
6
7
8
9
10

Open addressing vs chaining: two ways to handle collisions

Probe for another slot, or grow a list inside the slot.

When two keys collide, you have two options. Either you find another empty slot in the same array (open addressing) or you turn the slot into a list and append (chaining). Both work. They have very different operational shapes.

Open addressing

All in the array.

No pointers, cache-friendly, simple memory layout — every key sits in the same flat array. The price is sensitivity to load factor: as the table fills, probe distances grow rapidly and performance degrades. Linear probing has clustering issues; quadratic and double hashing scatter better. Used by Python dict, Java's IdentityHashMap.

Chaining

A list per slot.

Each slot holds a linked list (or small array). Collisions extend the list rather than spilling into other slots. Tolerates higher load factors gracefully. Costs memory for pointers and loses cache locality. Used by Java's HashMap, Go's map.


Load factor and when the table has to grow

Once it gets too full, the table allocates a bigger array and rehashes.

The load factor is occupied slots divided by total slots. Past a threshold (typically 0.75 for chaining, 0.5 for open addressing), performance starts to fall off. The fix is to allocate a bigger array, walk every key, rehash with the new size, copy.

Rehashing is O(n) and pauses the table. Production hash tables amortise the cost — Java grows by 2× every threshold breach, so resizes happen rarely and amortise to O(1) per insert. Some implementations incrementally rehash — moving a few keys per operation rather than all at once — to keep p99 latency tight, the same trick Redis uses for its dictionary resize.

Pick a prime size

When the table size is prime, hashes mod size distribute more evenly. Powers of two are cache-friendly but require a stronger hash function to avoid patterns. Most modern implementations pick power-of-two and rely on the hash function — but the prime-size convention persists in older code.


What makes a good hash function

It spreads keys evenly, scrambles on any change, and runs fast.

A good hash function spreads inputs uniformly across the output range, changes wildly when one bit of input changes (avalanche), and runs fast. Modern non-cryptographic hashes — xxHash, MurmurHash, CityHash — manage all three in a few ns per key. Use them.

Don't use cryptographic hashes (SHA-256) for hash tables — they're overkill on speed and unnecessary on bit quality. Don't use the older Java String.hashCode() as your only line of defence — it's known to cluster on real-world strings; the JDK now post-mixes it specifically because of this.


HashDoS: when an attacker forces every key to collide

Per-process random seeds make crafted-collision attacks fail.

If your hash function is deterministic and the attacker knows the algorithm, they can craft keys that all collide. A web server putting JSON keys into a hash map is suddenly running O(n²) per request. This is HashDoS, and it took down major frameworks until they fixed it.

The fix is per-process random seeds — every server starts with a different hash function. Python, Ruby, Java, Rust, and Go all do this. The seed is invisible to the attacker, so collision-engineering attacks fail.


Where hash tables show up, usually out of sight

Language maps, database joins, caches, kernel tables, and more.

Every language's map / dict / object is a hash table. Database query optimizers use them for joins. Caches use them for keys. JVM string interning uses one. The kernel's file-descriptor table is one. You almost certainly used a dozen hash table operations to load this page.

memcached

Distributed cache

Hash table per node, consistent-hash across nodes. The simulator above scales horizontally to a thousand boxes by sharding the keyspace — same primitives.

postgres

Hash join

Build a hash table on the smaller relation, probe with the larger. The optimizer picks this when the build side fits in memory and the join is on equality — see also database indexing for the B-tree alternative.

javascript

Object

An object is a hash table from string keys to values. V8 has a complex storage strategy — hidden classes for stable shapes, dictionary mode for dynamic ones — but the abstraction underneath is a hash map.


What slows a hash table down

Mutable keys, weak hashes, and repeated resizing all hurt.

mutable keys

Don't.

Modifying a key after it's in the table changes its hash — the old slot still references it, but lookups from the new hash miss. The table looks corrupt. Most languages refuse mutable keys; some (Python with frozenset) require explicitly immutable wrappers.

low-quality hash

Bad clustering.

If your hash function maps similar inputs to similar outputs, sequential keys land in the same area and probe distances explode. Either use a known-good hash or make sure your custom one passes basic distribution tests.

resize storms

Pre-size when you can.

If you know you'll insert a million keys, allocate the table at that size up front. Letting it grow from default-16 means dozens of rehashes; that's a lot of garbage for the collector and a lot of CPU.

How Java, Python, Go, and Rust build their hash maps

Same idea, very different defaults.

Java HashMap
Open addressing turned to chained buckets in 8.0; if a chain exceeds 8 entries it converts to a tree (red-black) for O(log n) worst case. Initial capacity 16, load factor 0.75.
Python dict
Open addressing with perturbation. Since 3.6 (CPython 3.6+, guaranteed since 3.7) keys keep insertion order — implemented via two arrays (one ordered hash table, one ordered key list). Load factor caps at ~2/3.
Go map
Open addressing with bucket-of-8 cells (each bucket holds up to 8 entries). Resize doubles capacity, incremental rehashing during writes. Source: runtime/map.go.
Rust HashMap
SwissTable algorithm (Google, 2017): single array of 16-byte groups, SIMD lookup of 16 metadata bytes at once. Faster than chained tables by ~30%. Default since Rust 1.36.
C++ std::unordered_map
Mandated chained — the spec requires reference stability, which open addressing breaks. Performance lags SwissTable variants (Abseil's flat_hash_map, F14 from Folly) by ~2× in benchmarks.
Postgres hash index
On-disk hash index. Optimised for equality lookup but rarely used in practice — B-tree handles equality fast enough plus range, so most workloads pick B-tree by default.

The SwissTable revolution. Google's 2017 CppCon talk on SwissTable changed the field. Modern hash tables use SIMD to compare 16 metadata bytes at once, getting ~3x throughput on modern x86. Rust, Abseil (Google), and F14 (Meta) all implement variants. Java and Python's stdlib hashmaps haven't adopted SwissTable yet but conversations are active.


When hash collisions became a security vulnerability

The 2003 attack and its 2011 sequel.

The original (2003). Crosby and Wallach showed that an attacker who can choose hash table keys (e.g., via HTTP form fields parsed into a server-side hash) can craft inputs that all collide into the same bucket. Insertion becomes O(n²); a 1,000-key form takes seconds to parse. Quiet vulnerability for years.

2011 — the sequel. Klink and Wäldecke at the 28C3 conference demonstrated practical exploits against PHP, Java, Python, Ruby, ASP.NET. A few KB of carefully crafted HTTP form data could pin a server's CPU at 100% for minutes. Most languages patched within days by adopting SipHash (Aumasson and Bernstein, 2012) — a randomised hash that takes a per-process key, making the input-to-bucket mapping unpredictable to attackers.

Today. Every modern language ships a randomised hash for hash tables. Python uses SipHash since 3.4 (2014); Rust uses SipHash; Go uses a randomised AES-based hash; Java's HashMap uses a hash with per-process randomisation since Java 8. The vulnerability is largely closed; the lesson — never use raw input as a hash without randomisation — is permanently in security-101 curricula.



A closing note

Hash tables are the data structure that makes most systems possible at the scale they run. One function from key to index, plus one strategy for collisions, plus one threshold for resize. The literature is decades deep, but you can ship correct, fast hash-table code knowing those three pieces.

Found this useful?