10 / 19
Playbook / 10

Web crawler at scale

A crawler is mostly a queue with manners. The mechanics of fetching a page are trivial; the design problem is how to walk billions of URLs without melting any one host, without re-fetching pages you already have, and without falling into one of the half-dozen traps the open web sets for naive crawlers. Walk this once and you will have a working model for the URL frontier, politeness, dedup, and the freshness trade-off — all of which recur in scrapers, indexers, and link-checkers of every shape.


1 · Clarifying questions

Five minutes. A crawler can mean a search-engine index, a price scraper, a backup service, or a security scanner — the scope question matters more here than in most designs.

How many URLs in the index horizon?Pin the order of magnitude. 10B unique URLs is "small search engine"; 100B is "Google-scale". We'll assume 10B.
Politeness budget per host?Usually 1 request per second per host as a default; some hosts publish a Crawl-Delay, some need a much lower rate. The crawler must respect both.
Re-crawl interval?Freshness is per-page. News fronts need minutes; long-tail pages can sit untouched for weeks. Budget by priority class.
Content types?HTML by default. PDFs and plain text usually yes. Images, video, binaries usually no — they dominate bandwidth without paying back in index value.
What's stored vs discarded?Store the raw bytes (for re-parsing later), the extracted text, and the link graph. Discard duplicates and near-duplicates.
Respect robots.txt?Yes, always. The cost of not doing so is being blocked at the network layer by hosts and CDNs.
JavaScript-rendered pages?Headless Chrome is roughly 10× the cost of a plain HTTP fetch. Default off; turn on per-domain when a heuristic flags an empty body.
Multi-region?Yes for fetch (latency to far-away hosts dominates). Storage can be single-region with cross-region replicas.

2 · Capacity math, on a napkin

Before any boxes. The numbers drive the frontier size, the bandwidth budget, and the storage tier choice.

NumberCalculationResult
Unique URLs in indexgiven10B
Average page sizeHTML + headers, post-gzip~100 KB
Raw storage10B × 100 KB~1 PB
Crawl rate (sustained)1B URLs/day / 86,400~12K URLs/sec
Peak burst (4×)4 × sustained~48K URLs/sec
Bandwidth (sustained)12K × 100 KB~1 GB/sec → ~80 TB/day
DNS lookups (uncached)~12K/seccache reduces by ~100×
Frontier size (in-flight)~1B URLs × 64 B/entry~64 GB sharded
Dedup Bloom filter10B URLs × 10 bits~12 GB, FPR ~1%
Parsed text store10B × 10 KB extracted~100 TB

The numbers tell the story before the architecture does. 1 GB/sec of sustained ingest is "this is a bandwidth problem before it is a CPU problem". 12 GB of Bloom filter is "we can keep dedup in memory; we don't need a database round-trip on every URL".

3 · API and data model

A crawler is mostly an internal system; the public surface is the seed-injection API and the operator dashboard. The data model is where the design choices live.

Endpoints (internal)

POST /v1/seeds                           # inject URLs into the frontier
{"urls": ["https://example.com/", ...], "priority": "normal"}
→ 202 {"accepted": 1024, "duplicates": 17}

GET  /v1/url/:hash                       # look up crawl status
→ 200 {"url":"...", "last_crawled":"...", "status":200, "etag":"..."}

POST /v1/policy/host                     # override politeness
{"host":"slow-site.example", "rps": 0.2}
→ 200

GET  /v1/stats                           # operator dashboard feed
→ 200 {"fetched_24h": 1.02e9, "errors_24h": 4.3e7, ...}

Schema

urls                                       -- canonical record per URL
  url_hash       BYTEA PRIMARY KEY          -- SHA-256(canonical_url)
  canonical_url  TEXT NOT NULL
  host           TEXT NOT NULL
  last_crawled   TIMESTAMP NULL
  etag           TEXT NULL
  content_hash   BYTEA NULL                 -- dedup near-duplicates
  status         SMALLINT NULL              -- last HTTP status
  priority       SMALLINT NOT NULL
  next_crawl_at  TIMESTAMP NULL
  INDEX (host, next_crawl_at)               -- scheduler pull

frontier                                    -- the live queue
  host           TEXT NOT NULL
  scheduled_at   TIMESTAMP NOT NULL
  url_hash       BYTEA NOT NULL
  priority       SMALLINT NOT NULL
  PRIMARY KEY (host, scheduled_at, url_hash)

host_politeness                             -- per-host gate
  host             TEXT PRIMARY KEY
  last_request_at  TIMESTAMP NOT NULL
  allowed_after    TIMESTAMP NOT NULL
  rps              REAL NOT NULL            -- crawl-delay-derived
  robots_fetched   TIMESTAMP NULL

crawled_content                             -- pointer to object store
  url_hash       BYTEA PRIMARY KEY
  fetched_at     TIMESTAMP NOT NULL
  content_addr   TEXT NOT NULL              -- s3://bucket/aa/bb/
  bytes          INT NOT NULL
  content_type   TEXT NOT NULL

The hot path reads host_politeness and writes to frontier. Raw bytes go to object storage keyed by content hash, which gives near-duplicate dedup for free.

4 · High-level architecture

The boxes-and-arrows. Crawlers split cleanly into the fetch loop (URL frontier → DNS → fetcher → store) and the discover loop (parser → dedup → frontier). The scheduler bridges them.

Two loops, one queue. The fetch loop drains the frontier through the politeness gate; the discover loop pushes newly-extracted URLs back into the frontier after dedup. The renderer is an opt-in path that runs about 10× slower than plain HTTP.

5 · The hard part — the URL frontier

The frontier is the queue of URLs waiting to be crawled. Sounds simple. The reason it is the heart of every serious crawler is that it has to satisfy three constraints at once: enforce politeness per host, schedule by priority, and answer "have I seen this URL?" in O(1).

Per-host queues

Politeness is per-host, so the frontier is too. Conceptually, one FIFO per host, with a wall-clock allowed_after on each queue. A single global queue would let one popular host starve the others; a single global rate cap would let one fast host hog the bandwidth and leave slow hosts unvisited.

In practice the frontier is implemented as a two-tier structure (Mercator-style): a back queue per host holds the URLs in order, and a front queue holds the host identifiers in priority order. The scheduler pops a host from the front queue, pops a URL from that host's back queue, and re-inserts the host into the front queue with the next allowed_after time. Heap-based, O(log H) where H is the number of hosts.

Dedup with Bloom filters

Before adding a URL to the frontier, ask: have we seen this URL before? At 10B URLs, a hash set in memory is too big (~600 GB), and a database round-trip per URL kills throughput. A Bloom filter at 10 bits per entry is ~12 GB total, fits in memory on one shard, and gives a ~1% false-positive rate. A false positive means we skip a URL we have not seen — annoying but not fatal; the URL gets re-discovered later through another link.

The memory–accuracy trade-off. Doubling bits per entry (to 20) drops the false-positive rate to ~0.05% and the filter grows to 24 GB. Halving to 5 bits raises it to ~10% and the filter shrinks to 6 GB. Most production crawlers sit at 8–12 bits per entry — enough accuracy that you do not skip much, small enough to fit in RAM. See the Bloom filter simulator for the curve.

Canonical URLs

Same page, many URLs: ?utm_source=…, trailing slashes, scheme variants, session IDs. Canonicalise before hashing. Strip known tracking parameters, normalise the host case, sort query parameters, follow <link rel="canonical"> when the parser finds one. Without canonicalisation, dedup misses by 2–5×.

Priority and freshness

Front-queue priority is a combination of static importance (PageRank-like) and freshness deficit (time since last crawl ÷ expected re-crawl interval). News fronts float to the top; long-tail product pages sink. Re-crawl intervals are learned per page from observed change rate — pages that change every fetch get re-fetched frequently; pages that have not changed in a year get a 30-day TTL.

6 · Failure modes & runbook

The open web is hostile. Most of a crawler's code is failure handling, not happy path. The interview-room version of the operational story.

FailureSymptomMitigation
Host returns 5xxRepeated failure on one hostExponential backoff per host (1 min → 1 h → 6 h). Quarantine after N consecutive failures; revisit in 24 h.
DNS resolution failsNXDOMAIN or timeoutRetry with backoff; quarantine the host for 1 h on persistent failure. Log to dead-letter for operator review.
robots.txt blocksDisallow rule matches the URLRespect it. Mark the URL as status = -1 and skip permanently. Re-check robots.txt every 24 h per host.
Content too largeContent-Length > 10 MB or transfer exceeds limitTruncate at the limit; record truncated = true. Do not retry — the limit is a policy choice, not a transient error.
Redirect loop3xx chain longer than 5 hopsTrack redirect depth in the fetcher; abort at depth 5. Record the URL as status = 599 (custom) for operator visibility.
Crawler trap (infinite calendar / faceted nav)One host produces unlimited URLs per crawlPer-host crawl-depth limit (default 8). URL-similarity heuristic: if > 100 URLs differ only in a date or query parameter, sample 10 and drop the rest.
Bloom filter overflowFalse-positive rate above targetRotate to a fresh, larger filter; keep the old one for a grace period as a secondary check.
Politeness violationHost returns 429 / blocks our IP rangeHalve the per-host rps immediately. Page if it happens to a top-1000 host.

7 · Cost & operability

A ballpark in 2026 cloud prices. The cost shape is unusual: bandwidth and storage dominate; CPU is comparatively cheap.

LineEstimateNote
Egress / ingress bandwidth~$30K / month~80 TB/day in; inbound is usually free, but cross-AZ transfer adds up.
Fetcher compute (500 pods)~$8K / monthI/O-bound; HTTP/2 keepalives reduce per-connection cost.
Renderer compute (50 pods, opt-in)~$5K / monthHeadless Chrome is ~10× CPU and memory per page vs plain fetch.
Raw content storage (1 PB)~$15K / month~$15 / TB / month on S3 standard; ~$4 / TB on S3 IA for older crawls.
Frontier + URL DB~$4K / monthSharded KV; mostly RAM-bound for the politeness state.
Bloom filter hosts~$1K / month3 × 32 GB RAM nodes for the dedup filter and its replicas.
Total~$65K / monthPer billion URLs/month indexed: ~$2 / M URLs.
The freshness audit cost nobody plans for. Forcing a re-crawl of a subset to validate the index (have these pages changed?) is a hidden line item. A 1% sample of 10B URLs is 100M re-fetches — about $0.30 in compute but ~$1K in bandwidth if the pages have changed. Most teams discover this when they try to justify why "the index feels stale" cannot be fixed without a budget conversation.

8 · Trade-offs & "what would you change at 10×"

The interviewer's last 5 minutes. These are the answers that distinguish a competent design from a thoughtful one.

If…Then…
Focused crawl (single vertical)Drop the generic priority function; replace with a relevance classifier on the fetched page. Stop following links from low-scoring pages. Cuts crawl volume by 10–100× for the same index value.
Broad crawl (search index)The architecture above. Politeness and dedup dominate; freshness is per-priority-tier.
JavaScript rendering everywhereRoughly 10× cost on compute and memory. Usually not worth it; instead, render on demand when a heuristic (empty body, single <script> tag, content-length mismatch) flags a likely SPA.
Domain-specific path (news, e-commerce)Add a sidecar with per-domain extractors and sitemap consumers. Sitemaps cut frontier discovery cost by 10–100× and let you respect <lastmod> for freshness.
Multi-regionFetchers in each region crawl their local web (latency wins). Storage is single-region with cross-region replicas; the frontier is partitioned by host country, not by request origin.
10× the URL countThe Bloom filter is the first thing that breaks. Switch to a partitioned filter, one per shard, with consistent hashing. The frontier also needs to spill to disk — see Mercator's two-tier disk-backed queue.
"What would a more senior answer add?"Talk about adversarial hosts (cloaking, IP-based rate limits, captchas), the legal / TOS layer, and the cost of running a near-duplicate detector (MinHash / SimHash) on top of exact-hash dedup. The honest next-tier answer is the moderation and abuse pipeline — most candidates skip it.

Further reading

  • Heydon & Najork — "Mercator: A Scalable, Extensible Web Crawler" (1999). The canonical paper on the two-tier frontier. Old but the model is still right.
  • Olston & Najork — "Web Crawling" (2010). A survey covering politeness, freshness, dedup, and the trade-offs between them. The best single reference if you read only one.
  • Common Crawl — monthly crawl reports. Public data and operator writeups from one of the few open large-scale crawlers; useful for sanity-checking your numbers.
  • Google — "Sitemaps protocol". The other half of the freshness story; respecting sitemaps cuts wasted fetches dramatically.
  • Cloudflare — "How we built our internet-scale crawler" (2023). A modern operator's view; covers bot detection and the politeness-versus-freshness tension.
  • Adjacent: Bloom filter simulator. The memory–FPR curve you cite when defending the dedup choice.
  • Adjacent: How DNS works. A crawler is the worst-case DNS client; caching is non-optional.
  • Adjacent: Rate limiter simulator. The per-host politeness gate is the same algorithm at a smaller scale.
Found this useful?