Web crawler at scale
A crawler is mostly a queue with manners. The mechanics of fetching a page are trivial; the design problem is how to walk billions of URLs without melting any one host, without re-fetching pages you already have, and without falling into one of the half-dozen traps the open web sets for naive crawlers. Walk this once and you will have a working model for the URL frontier, politeness, dedup, and the freshness trade-off — all of which recur in scrapers, indexers, and link-checkers of every shape.
1 · Clarifying questions
Five minutes. A crawler can mean a search-engine index, a price scraper, a backup service, or a security scanner — the scope question matters more here than in most designs.
| How many URLs in the index horizon? | Pin the order of magnitude. 10B unique URLs is "small search engine"; 100B is "Google-scale". We'll assume 10B. |
| Politeness budget per host? | Usually 1 request per second per host as a default; some hosts publish a Crawl-Delay, some need a much lower rate. The crawler must respect both. |
| Re-crawl interval? | Freshness is per-page. News fronts need minutes; long-tail pages can sit untouched for weeks. Budget by priority class. |
| Content types? | HTML by default. PDFs and plain text usually yes. Images, video, binaries usually no — they dominate bandwidth without paying back in index value. |
| What's stored vs discarded? | Store the raw bytes (for re-parsing later), the extracted text, and the link graph. Discard duplicates and near-duplicates. |
| Respect robots.txt? | Yes, always. The cost of not doing so is being blocked at the network layer by hosts and CDNs. |
| JavaScript-rendered pages? | Headless Chrome is roughly 10× the cost of a plain HTTP fetch. Default off; turn on per-domain when a heuristic flags an empty body. |
| Multi-region? | Yes for fetch (latency to far-away hosts dominates). Storage can be single-region with cross-region replicas. |
2 · Capacity math, on a napkin
Before any boxes. The numbers drive the frontier size, the bandwidth budget, and the storage tier choice.
| Number | Calculation | Result |
|---|---|---|
| Unique URLs in index | given | 10B |
| Average page size | HTML + headers, post-gzip | ~100 KB |
| Raw storage | 10B × 100 KB | ~1 PB |
| Crawl rate (sustained) | 1B URLs/day / 86,400 | ~12K URLs/sec |
| Peak burst (4×) | 4 × sustained | ~48K URLs/sec |
| Bandwidth (sustained) | 12K × 100 KB | ~1 GB/sec → ~80 TB/day |
| DNS lookups (uncached) | ~12K/sec | cache reduces by ~100× |
| Frontier size (in-flight) | ~1B URLs × 64 B/entry | ~64 GB sharded |
| Dedup Bloom filter | 10B URLs × 10 bits | ~12 GB, FPR ~1% |
| Parsed text store | 10B × 10 KB extracted | ~100 TB |
The numbers tell the story before the architecture does. 1 GB/sec of sustained ingest is "this is a bandwidth problem before it is a CPU problem". 12 GB of Bloom filter is "we can keep dedup in memory; we don't need a database round-trip on every URL".
3 · API and data model
A crawler is mostly an internal system; the public surface is the seed-injection API and the operator dashboard. The data model is where the design choices live.
Endpoints (internal)
POST /v1/seeds # inject URLs into the frontier
{"urls": ["https://example.com/", ...], "priority": "normal"}
→ 202 {"accepted": 1024, "duplicates": 17}
GET /v1/url/:hash # look up crawl status
→ 200 {"url":"...", "last_crawled":"...", "status":200, "etag":"..."}
POST /v1/policy/host # override politeness
{"host":"slow-site.example", "rps": 0.2}
→ 200
GET /v1/stats # operator dashboard feed
→ 200 {"fetched_24h": 1.02e9, "errors_24h": 4.3e7, ...}Schema
urls -- canonical record per URL
url_hash BYTEA PRIMARY KEY -- SHA-256(canonical_url)
canonical_url TEXT NOT NULL
host TEXT NOT NULL
last_crawled TIMESTAMP NULL
etag TEXT NULL
content_hash BYTEA NULL -- dedup near-duplicates
status SMALLINT NULL -- last HTTP status
priority SMALLINT NOT NULL
next_crawl_at TIMESTAMP NULL
INDEX (host, next_crawl_at) -- scheduler pull
frontier -- the live queue
host TEXT NOT NULL
scheduled_at TIMESTAMP NOT NULL
url_hash BYTEA NOT NULL
priority SMALLINT NOT NULL
PRIMARY KEY (host, scheduled_at, url_hash)
host_politeness -- per-host gate
host TEXT PRIMARY KEY
last_request_at TIMESTAMP NOT NULL
allowed_after TIMESTAMP NOT NULL
rps REAL NOT NULL -- crawl-delay-derived
robots_fetched TIMESTAMP NULL
crawled_content -- pointer to object store
url_hash BYTEA PRIMARY KEY
fetched_at TIMESTAMP NOT NULL
content_addr TEXT NOT NULL -- s3://bucket/aa/bb/
bytes INT NOT NULL
content_type TEXT NOT NULL The hot path reads host_politeness and writes to frontier. Raw bytes go to object storage keyed by content hash, which gives near-duplicate dedup for free.
4 · High-level architecture
The boxes-and-arrows. Crawlers split cleanly into the fetch loop (URL frontier → DNS → fetcher → store) and the discover loop (parser → dedup → frontier). The scheduler bridges them.
Two loops, one queue. The fetch loop drains the frontier through the politeness gate; the discover loop pushes newly-extracted URLs back into the frontier after dedup. The renderer is an opt-in path that runs about 10× slower than plain HTTP.
5 · The hard part — the URL frontier
The frontier is the queue of URLs waiting to be crawled. Sounds simple. The reason it is the heart of every serious crawler is that it has to satisfy three constraints at once: enforce politeness per host, schedule by priority, and answer "have I seen this URL?" in O(1).
Per-host queues
Politeness is per-host, so the frontier is too. Conceptually, one FIFO per host, with
a wall-clock allowed_after on each queue. A single global queue would
let one popular host starve the others; a single global rate cap would let one fast
host hog the bandwidth and leave slow hosts unvisited.
In practice the frontier is implemented as a two-tier structure (Mercator-style):
a back queue per host holds the URLs in order, and a front queue
holds the host identifiers in priority order. The scheduler pops a host from the
front queue, pops a URL from that host's back queue, and re-inserts the host into
the front queue with the next allowed_after time. Heap-based, O(log H)
where H is the number of hosts.
Dedup with Bloom filters
Before adding a URL to the frontier, ask: have we seen this URL before? At 10B URLs, a hash set in memory is too big (~600 GB), and a database round-trip per URL kills throughput. A Bloom filter at 10 bits per entry is ~12 GB total, fits in memory on one shard, and gives a ~1% false-positive rate. A false positive means we skip a URL we have not seen — annoying but not fatal; the URL gets re-discovered later through another link.
Canonical URLs
Same page, many URLs: ?utm_source=…, trailing slashes, scheme variants,
session IDs. Canonicalise before hashing. Strip known tracking parameters, normalise
the host case, sort query parameters, follow <link rel="canonical">
when the parser finds one. Without canonicalisation, dedup misses by 2–5×.
Priority and freshness
Front-queue priority is a combination of static importance (PageRank-like) and freshness deficit (time since last crawl ÷ expected re-crawl interval). News fronts float to the top; long-tail product pages sink. Re-crawl intervals are learned per page from observed change rate — pages that change every fetch get re-fetched frequently; pages that have not changed in a year get a 30-day TTL.
6 · Failure modes & runbook
The open web is hostile. Most of a crawler's code is failure handling, not happy path. The interview-room version of the operational story.
| Failure | Symptom | Mitigation |
|---|---|---|
| Host returns 5xx | Repeated failure on one host | Exponential backoff per host (1 min → 1 h → 6 h). Quarantine after N consecutive failures; revisit in 24 h. |
| DNS resolution fails | NXDOMAIN or timeout | Retry with backoff; quarantine the host for 1 h on persistent failure. Log to dead-letter for operator review. |
| robots.txt blocks | Disallow rule matches the URL | Respect it. Mark the URL as status = -1 and skip permanently. Re-check robots.txt every 24 h per host. |
| Content too large | Content-Length > 10 MB or transfer exceeds limit | Truncate at the limit; record truncated = true. Do not retry — the limit is a policy choice, not a transient error. |
| Redirect loop | 3xx chain longer than 5 hops | Track redirect depth in the fetcher; abort at depth 5. Record the URL as status = 599 (custom) for operator visibility. |
| Crawler trap (infinite calendar / faceted nav) | One host produces unlimited URLs per crawl | Per-host crawl-depth limit (default 8). URL-similarity heuristic: if > 100 URLs differ only in a date or query parameter, sample 10 and drop the rest. |
| Bloom filter overflow | False-positive rate above target | Rotate to a fresh, larger filter; keep the old one for a grace period as a secondary check. |
| Politeness violation | Host returns 429 / blocks our IP range | Halve the per-host rps immediately. Page if it happens to a top-1000 host. |
7 · Cost & operability
A ballpark in 2026 cloud prices. The cost shape is unusual: bandwidth and storage dominate; CPU is comparatively cheap.
| Line | Estimate | Note |
|---|---|---|
| Egress / ingress bandwidth | ~$30K / month | ~80 TB/day in; inbound is usually free, but cross-AZ transfer adds up. |
| Fetcher compute (500 pods) | ~$8K / month | I/O-bound; HTTP/2 keepalives reduce per-connection cost. |
| Renderer compute (50 pods, opt-in) | ~$5K / month | Headless Chrome is ~10× CPU and memory per page vs plain fetch. |
| Raw content storage (1 PB) | ~$15K / month | ~$15 / TB / month on S3 standard; ~$4 / TB on S3 IA for older crawls. |
| Frontier + URL DB | ~$4K / month | Sharded KV; mostly RAM-bound for the politeness state. |
| Bloom filter hosts | ~$1K / month | 3 × 32 GB RAM nodes for the dedup filter and its replicas. |
| Total | ~$65K / month | Per billion URLs/month indexed: ~$2 / M URLs. |
8 · Trade-offs & "what would you change at 10×"
The interviewer's last 5 minutes. These are the answers that distinguish a competent design from a thoughtful one.
| If… | Then… |
|---|---|
| Focused crawl (single vertical) | Drop the generic priority function; replace with a relevance classifier on the fetched page. Stop following links from low-scoring pages. Cuts crawl volume by 10–100× for the same index value. |
| Broad crawl (search index) | The architecture above. Politeness and dedup dominate; freshness is per-priority-tier. |
| JavaScript rendering everywhere | Roughly 10× cost on compute and memory. Usually not worth it; instead, render on demand when a heuristic (empty body, single <script> tag, content-length mismatch) flags a likely SPA. |
| Domain-specific path (news, e-commerce) | Add a sidecar with per-domain extractors and sitemap consumers. Sitemaps cut frontier discovery cost by 10–100× and let you respect <lastmod> for freshness. |
| Multi-region | Fetchers in each region crawl their local web (latency wins). Storage is single-region with cross-region replicas; the frontier is partitioned by host country, not by request origin. |
| 10× the URL count | The Bloom filter is the first thing that breaks. Switch to a partitioned filter, one per shard, with consistent hashing. The frontier also needs to spill to disk — see Mercator's two-tier disk-backed queue. |
| "What would a more senior answer add?" | Talk about adversarial hosts (cloaking, IP-based rate limits, captchas), the legal / TOS layer, and the cost of running a near-duplicate detector (MinHash / SimHash) on top of exact-hash dedup. The honest next-tier answer is the moderation and abuse pipeline — most candidates skip it. |
Further reading
- Heydon & Najork — "Mercator: A Scalable, Extensible Web Crawler" (1999). The canonical paper on the two-tier frontier. Old but the model is still right.
- Olston & Najork — "Web Crawling" (2010). A survey covering politeness, freshness, dedup, and the trade-offs between them. The best single reference if you read only one.
- Common Crawl — monthly crawl reports. Public data and operator writeups from one of the few open large-scale crawlers; useful for sanity-checking your numbers.
- Google — "Sitemaps protocol". The other half of the freshness story; respecting sitemaps cuts wasted fetches dramatically.
- Cloudflare — "How we built our internet-scale crawler" (2023). A modern operator's view; covers bot detection and the politeness-versus-freshness tension.
- Adjacent: Bloom filter simulator. The memory–FPR curve you cite when defending the dedup choice.
- Adjacent: How DNS works. A crawler is the worst-case DNS client; caching is non-optional.
- Adjacent: Rate limiter simulator. The per-host politeness gate is the same algorithm at a smaller scale.