How a hash table turns a key into an answer in O(1).
A function that turns any key into an array index. Constant time on average, linear time at worst. The whole craft is keeping the average case alive.
What is a hash table?
A hash function turns each key into an array index.
A hash table is an array of buckets indexed by a hash function applied to the key. Hans Peter Luhn proposed the idea at IBM in 1953. Today: every language's map / dict / HashMap, every database index, every cache, every router's flow table. The two collision strategies are chaining (linked lists per bucket) and open addressing (probe to the next bucket).
The whole hash table is built on one trick. You have a key — a string, a number, anything hashable. You apply a hash function that turns it into an integer. You modulo by the table size. The result is the index where that key's value lives. Look-up: same hash, same index, fetch in O(1). No search, no traversal — just arithmetic.
The catch is that the hash function maps an unbounded keyspace to a bounded number of slots. Two keys can hash to the same slot — a collision. The table's strategy for resolving collisions is what defines its character.
Watch three collision strategies fill the same table
Same keys, different layouts: linear, quadratic, and chaining.
Below: an 11-slot hash table. Type a key and watch where it lands. Switch strategies and reseed — same keys, different layouts. Open addressing slots stay flat; chaining grows lists at busy slots.
Open addressing vs chaining: two ways to handle collisions
Probe for another slot, or grow a list inside the slot.
When two keys collide, you have two options. Either you find another empty slot in the same array (open addressing) or you turn the slot into a list and append (chaining). Both work. They have very different operational shapes.
All in the array.
No pointers, cache-friendly, simple memory layout — every key sits in the same flat array. The price is sensitivity to load factor: as the table fills, probe distances grow rapidly and performance degrades. Linear probing has clustering issues; quadratic and double hashing scatter better. Used by Python dict, Java's IdentityHashMap.
A list per slot.
Each slot holds a linked list (or small array). Collisions extend the list rather than spilling into other slots. Tolerates higher load factors gracefully. Costs memory for pointers and loses cache locality. Used by Java's HashMap, Go's map.
Load factor and when the table has to grow
Once it gets too full, the table allocates a bigger array and rehashes.
The load factor is occupied slots divided by total slots. Past a threshold (typically 0.75 for chaining, 0.5 for open addressing), performance starts to fall off. The fix is to allocate a bigger array, walk every key, rehash with the new size, copy.
Rehashing is O(n) and pauses the table. Production hash tables amortise the cost — Java grows by 2× every threshold breach, so resizes happen rarely and amortise to O(1) per insert. Some implementations incrementally rehash — moving a few keys per operation rather than all at once — to keep p99 latency tight, the same trick Redis uses for its dictionary resize.
When the table size is prime, hashes mod size distribute more evenly. Powers of two are cache-friendly but require a stronger hash function to avoid patterns. Most modern implementations pick power-of-two and rely on the hash function — but the prime-size convention persists in older code.
What makes a good hash function
It spreads keys evenly, scrambles on any change, and runs fast.
A good hash function spreads inputs uniformly across the output range, changes wildly when one bit of input changes (avalanche), and runs fast. Modern non-cryptographic hashes — xxHash, MurmurHash, CityHash — manage all three in a few ns per key. Use them.
Don't use cryptographic hashes (SHA-256) for hash tables — they're overkill on speed and unnecessary on bit quality. Don't use the older Java String.hashCode() as your only line of defence — it's known to cluster on real-world strings; the JDK now post-mixes it specifically because of this.
HashDoS: when an attacker forces every key to collide
Per-process random seeds make crafted-collision attacks fail.
If your hash function is deterministic and the attacker knows the algorithm, they can craft keys that all collide. A web server putting JSON keys into a hash map is suddenly running O(n²) per request. This is HashDoS, and it took down major frameworks until they fixed it.
The fix is per-process random seeds — every server starts with a different hash function. Python, Ruby, Java, Rust, and Go all do this. The seed is invisible to the attacker, so collision-engineering attacks fail.
Where hash tables show up, usually out of sight
Language maps, database joins, caches, kernel tables, and more.
Every language's map / dict / object is a hash table. Database query optimizers use them for joins. Caches use them for keys. JVM string interning uses one. The kernel's file-descriptor table is one. You almost certainly used a dozen hash table operations to load this page.
Distributed cache
Hash table per node, consistent-hash across nodes. The simulator above scales horizontally to a thousand boxes by sharding the keyspace — same primitives.
Hash join
Build a hash table on the smaller relation, probe with the larger. The optimizer picks this when the build side fits in memory and the join is on equality — see also database indexing for the B-tree alternative.
Object
An object is a hash table from string keys to values. V8 has a complex storage strategy — hidden classes for stable shapes, dictionary mode for dynamic ones — but the abstraction underneath is a hash map.
What slows a hash table down
Mutable keys, weak hashes, and repeated resizing all hurt.
Don't.
Modifying a key after it's in the table changes its hash — the old slot still references it, but lookups from the new hash miss. The table looks corrupt. Most languages refuse mutable keys; some (Python with frozenset) require explicitly immutable wrappers.
Bad clustering.
If your hash function maps similar inputs to similar outputs, sequential keys land in the same area and probe distances explode. Either use a known-good hash or make sure your custom one passes basic distribution tests.
Pre-size when you can.
If you know you'll insert a million keys, allocate the table at that size up front. Letting it grow from default-16 means dozens of rehashes; that's a lot of garbage for the collector and a lot of CPU.
How Java, Python, Go, and Rust build their hash maps
Same idea, very different defaults.
- Java HashMap
- Open addressing turned to chained buckets in 8.0; if a chain exceeds 8 entries it converts to a tree (red-black) for O(log n) worst case. Initial capacity 16, load factor 0.75.
- Python dict
- Open addressing with perturbation. Since 3.6 (CPython 3.6+, guaranteed since 3.7) keys keep insertion order — implemented via two arrays (one ordered hash table, one ordered key list). Load factor caps at ~2/3.
- Go map
- Open addressing with bucket-of-8 cells (each bucket holds up to 8 entries). Resize doubles capacity, incremental rehashing during writes. Source: runtime/map.go.
- Rust HashMap
- SwissTable algorithm (Google, 2017): single array of 16-byte groups, SIMD lookup of 16 metadata bytes at once. Faster than chained tables by ~30%. Default since Rust 1.36.
- C++ std::unordered_map
- Mandated chained — the spec requires reference stability, which open addressing breaks. Performance lags SwissTable variants (Abseil's flat_hash_map, F14 from Folly) by ~2× in benchmarks.
- Postgres hash index
- On-disk hash index. Optimised for equality lookup but rarely used in practice — B-tree handles equality fast enough plus range, so most workloads pick B-tree by default.
The SwissTable revolution. Google's 2017 CppCon talk on SwissTable changed the field. Modern hash tables use SIMD to compare 16 metadata bytes at once, getting ~3x throughput on modern x86. Rust, Abseil (Google), and F14 (Meta) all implement variants. Java and Python's stdlib hashmaps haven't adopted SwissTable yet but conversations are active.
When hash collisions became a security vulnerability
The 2003 attack and its 2011 sequel.
The original (2003). Crosby and Wallach showed that an attacker who can choose hash table keys (e.g., via HTTP form fields parsed into a server-side hash) can craft inputs that all collide into the same bucket. Insertion becomes O(n²); a 1,000-key form takes seconds to parse. Quiet vulnerability for years.
2011 — the sequel. Klink and Wäldecke at the 28C3 conference demonstrated practical exploits against PHP, Java, Python, Ruby, ASP.NET. A few KB of carefully crafted HTTP form data could pin a server's CPU at 100% for minutes. Most languages patched within days by adopting SipHash (Aumasson and Bernstein, 2012) — a randomised hash that takes a per-process key, making the input-to-bucket mapping unpredictable to attackers.
Today. Every modern language ships a randomised hash for hash tables. Python uses SipHash since 3.4 (2014); Rust uses SipHash; Go uses a randomised AES-based hash; Java's HashMap uses a hash with per-process randomisation since Java 8. The vulnerability is largely closed; the lesson — never use raw input as a hash without randomisation — is permanently in security-101 curricula.
Hash tables are the data structure that makes most systems possible at the scale they run. One function from key to index, plus one strategy for collisions, plus one threshold for resize. The literature is decades deep, but you can ship correct, fast hash-table code knowing those three pieces.
Further reading on hash tables
Primary sources, in order.
- WikipediaHash tableSurprisingly thorough. Good visual explanations of probing, chaining, and Robin Hood hashing.
- probablydanceA new fast hash tableMalte Skarupke walks through real-world hash-table tuning — why Robin Hood hashing wins, where it loses.
- Semicolony guideDatabase indexingHash indexes vs B-tree indexes — same trade studied at storage scale.
- Semicolony simulatorBloom filterA different shape on the same hashing primitive — probabilistic membership.