32 · 7 stages
Visualize / 32

How search engines work.

A few trillion pages indexed. You type three words. A hundred milliseconds later you have a ranked list of the ten most relevant. The whole machine is a pipeline: crawl, parse, tokenize, build an inverted index, then at query time do a giant set intersection and rank.


step 1 / 7
INDEXURL frontier
~10B URLs queued · prioritized
INDEXFetcher workers
~100k parallel · respect robots.txt
INDEXParser
HTML → text + outlinks
INDEXTokenize + normalize
lowercase · stem · strip stopwords
INDEXInverted index
term → list of (docID, position)
← INDEX TIME · QUERY TIME → QUERYQuery
tokenize query · intersect postings
QUERYRank
BM25 · PageRank · personalization
1. URL frontier

A priority queue of billions of URLs. Newly-discovered links go in, scored by domain authority, freshness, depth. The frontier is the limiting resource for crawl coverage.

Why an inverted index, not a forward one

A forward index says "doc 17 contains words [the, quick, brown, fox]." A query "fox" would have to scan every document. An inverted index turns it around: "fox" → [docs 17, 42, 99, 102, …]. Now a query lookup is constant-time per term. The work moves from query time to index time (where you can afford it) and the dataset never gets queried directly.

The shard-and-merge trick

Google\'s index is too big for one machine. So it\'s sharded across thousands. Each query is sent to every shard in parallel; each returns its top-K candidates; an aggregator merges and re-ranks. With 1000 shards each handling 1/1000 of the index, query time barely scales with corpus size. Adding more shards is how you add more pages without slowing down.

Ranking is the hard part

Finding documents that contain your terms is the easy step. Picking the 10 best out of millions of candidates is where the real engineering lives. Hundreds of signals contribute: term frequency, document length, link authority, query intent classification, click-through history, location, language, recency, spam scores. Google\'s ranking is a layered ML model trained on user behaviour. Open-source equivalents (Elasticsearch with custom plugins, OpenSearch, Vespa) ship with BM25 + boost functions you tune yourself.

Go deeper

Search internals →

BM25 derivation, term-at-a-time vs document-at-a-time scoring, vector search with embeddings, RAG architectures, hybrid lexical + semantic ranking.

Open the Codex →
Found this useful?