From "what's a load balancer" to
"I can design Twitter."
Fifteen stages, ~80 topics, every one linked to a Semicolony deep dive, simulator, or worked-problem playbook entry. The architecture diagram below is the map; each stage card down the page lights up the part of the diagram it owns. Read top to bottom, or jump to wherever you're stuck.
Depth
Showing 79 of 91 topics across 15 of 15 stages.
01app · db-primary
Trade-off vocabulary
The three pairs of words every interview uses.
Senior interviews are conducted in trade-off language. Before any architecture sketch, you should be able to use these six words precisely: performance, scalability, latency, throughput, availability, consistency. Most candidates blur them.
Core
Performance vs scalability
Performance is how fast one request gets through. Scalability is how the system holds up when you add a hundred more. They look the same when load is low and diverge sharply once you hit a bottleneck.
Latency is the time one operation takes. Throughput is how many you can do per second. Little's Law ties them: concurrency = throughput × latency. Bumping one almost always costs the other.
You cannot have both during a network partition. CAP says you pick. PACELC extends the question to the no-partition case (where the trade is latency vs consistency). Most real systems settle for "mostly consistent, occasionally stale" with a hard limit on staleness.
Weak, eventual, session, strong, linearisable. Five strength bands every database picks from. Postgres, DynamoDB, Spanner, Cassandra all live in different regions of the spectrum.
Failover (cold, warm, hot), replication shapes, and the math behind "five nines." Plus why redundancy past a point gives you less availability, not more.
Everything before your origin server gets to think.
Most "fast" sites are not fast at the origin. They are fast because three-quarters of every page is served from a cache that sits within a few milliseconds of the user. Before you talk about anything happening at your app servers, know what is happening at the edge.
Core
DNS resolution
Names become IPs through a chain of cached lookups. Every request starts here. TTL is the dial that trades freshness for query volume.
A network of edge caches that sit close to users. Pull-CDNs fetch on first miss and cache; push-CDNs are pre-populated by the origin. Most sites use pull because it requires no extra plumbing.
How the request gets to an app server that's actually free.
Once a request crosses your edge, a load balancer picks which app instance handles it. The L4 vs L7 distinction, active-active vs active-passive, and the algorithms (round-robin, least-connections, consistent hash) are the bread and butter of every system-design discussion.
Core
Layer-4 vs layer-7 load balancing
L4 picks a backend based on TCP/UDP headers (fast, opaque to the request). L7 reads the HTTP path / headers and routes intelligently (slower, smarter). Both are used together in practice.
Active-active runs N healthy load balancers in parallel, each taking traffic. Active-passive keeps a hot standby that takes over when the primary fails. The trade is cost vs failover speed.
Nginx, HAProxy, Envoy. A proxy in front of your origin that handles TLS, compression, caching, rate limiting, and routing. Often colocated with the load balancer; logically a different layer.
When the request needs to land on the same backend repeatedly (sessions, sharded caches, WebSocket connections), consistent hashing keeps key movement to K/N when you add or remove a node.
Monoliths, microservices, and what service discovery is for.
Most systems start as a monolith and split into services when team size makes a single deploy painful. The interesting parts are not "should we use microservices" (the question is usually misframed) but how services find each other, how they share schemas, and how they handle partial failure.
Core
Monolith vs microservices
A monolith is one deployable. Microservices are many. The trade is communication cost + operational surface vs deploy velocity + ownership clarity. The right shape depends on team size, not on system size.
How does service A find a healthy instance of service B? Client-side (Consul, Eureka) puts the lookup in the caller. Server-side (an LB) puts it in the proxy. Kubernetes uses both.
If a service holds no per-user state, any instance can handle any request, which makes horizontal scaling trivial. The state has to live somewhere; push it to the data tier or to an external session store.
Postgres, MySQL, the parts every senior engineer should know.
A single relational database can take a system surprisingly far. Senior engineers know how far before reaching for sharding or NoSQL. ACID, replication shapes, MVCC, isolation levels, and the page cache are the vocabulary the data-layer discussion runs on.
Core
ACID & isolation levels
Read uncommitted, read committed, repeatable read, snapshot, serializable. Each protects you from one more class of anomaly at one more class of cost.
Multi-version concurrency control. Readers see a snapshot; writers create a new version. The key to non-blocking reads in Postgres, Oracle, and most modern engines.
The two storage shapes every modern engine picks between. B-trees for read-heavy with mixed updates (Postgres, MySQL). LSM-trees for write-heavy with periodic compaction (Cassandra, RocksDB).
One node accepts writes, N replicas serve reads. Async replication is fast but can return stale reads; sync is consistent but blocks on a slow replica. Most systems run async with read-your-writes pinning.
Split one database into multiple by function — users in one, products in another. Often the first scaling step before sharding, since each database stays simple.
KV, document, wide-column, graph — when each one fits.
NoSQL is not one thing. It is four shapes, each tuned for a different access pattern. The right way to pick is: write down your reads (the queries you actually need), then pick the shape that serves them with one round trip.
Core
Key-value stores
Redis, Memcached, DynamoDB. Look up by exact key. The simplest shape; the highest throughput per node. Used for session stores, leaderboards, rate-limiter counters.
MongoDB, Couchbase, DynamoDB-with-secondary-indexes. JSON documents keyed by ID with optional indexes on inner fields. Good for catalogs, user profiles, content systems.
Cassandra, ScyllaDB, Bigtable. Rows have a partition key + a sort key, with columns added dynamically. Tuned for very high write throughput and time-series.
Start with SQL. Move to NoSQL when one specific access pattern needs throughput or scale that SQL cannot give, and you can accept the loss of joins / strong consistency on that workload. Mixing both in one system is common.
Sharding is the move every senior interview eventually reaches for. The hard part is not splitting a table; it is picking a shard key that survives growth, avoids hot spots, and tolerates re-sharding. Denormalization is the smaller cousin: duplicate data to make the read fast.
Core
Sharding strategies
Hash partitioning spreads load evenly but breaks range scans. Range partitioning supports scans but invites hot shards. Geographic / tenant-based partitioning works when the access pattern aligns naturally.
When one key gets 100x the traffic of the average, your evenly-sharded system grinds on that one shard. Mitigations: caching layer in front, per-key replication, or fan-out the writes ahead of time.
Duplicate fields across tables so the read does not need a join. Write becomes more work; read becomes one query. Trade explicit consistency for read latency.
Five layers. Each one is the right answer to a different question.
Every fast system caches at four or five layers. Knowing where to cache, what invalidation strategy fits, and what happens when the cache goes cold under a thundering herd is what the caching discussion is really about.
Core
Client-side cache
HTTP cache headers, service workers, the browser memory + disk caches. The fastest cache is the one that never sends a request.
Redis or Memcached sitting between your service and the database. Cache-aside (read-through with explicit fill), write-through (write hits cache and DB), and write-behind (cache absorbs writes, batches to DB) are the three patterns.
When the cache is full, who goes? LRU is the standard. LFU is better for stable hot-key sets. TinyLFU + Window-TinyLFU dominate modern caches (Caffeine, Redis 4+).
TTL-based is simple. Event-driven (pub/sub on writes) is stronger but couples your services. Write-through gives you no staleness; cache-aside accepts brief staleness in exchange for simplicity.
When the work doesn't need to finish before the response.
Anything that can be done after the response should be. Email, image processing, indexing, analytics, anything with a "this will arrive within five minutes" SLA — it goes through a queue. The interesting bits are at-least-once vs exactly-once, ordering guarantees, and how the queue itself doesn't collapse under load.
Core
Message queues vs task queues
A message queue is dumb pipes (RabbitMQ, SQS); a task queue layers application semantics on top (Celery, Sidekiq, Resque). Pick the simpler one that meets your needs.
Partitioned, append-only, consumer-tracked log. The substrate under most modern event-driven systems. Different shape from a queue: messages are replayable, partitions enforce per-key ordering.
At-most-once (drop on failure), at-least-once (retry until acked, may duplicate), exactly-once (the holy grail, expensive). Exactly-once usually means at-least-once + idempotent consumers.
What happens when consumers can't keep up? Buffering is the wrong answer. Slow producers, drop work intentionally, or shed load — anything except letting queues grow without bound.
When a message has retried too many times, move it to a separate queue for human inspection. Without this, a single poison message can crash all consumers.
Every system-design discussion eventually asks "what protocol do these services speak?". The answer depends on call shape (single request, streaming, bidirectional), latency budget, payload size, and operational tooling.
Core
TCP
Ordered, reliable, congestion-controlled bytes. The default. The cost is three handshakes and head-of-line blocking.
Fire-and-forget datagrams. No ordering, no retries, no congestion control. The right answer for DNS, real-time gaming, video, anything where late is worse than missing.
HTTP/1.1 = one request per connection (kept alive). HTTP/2 = multiplexed streams over one TCP connection, head-of-line blocking at the TCP layer. HTTP/3 = QUIC over UDP, fuses TCP+TLS, no HoL blocking.
REST is HTTP/JSON: human-readable, debuggable, slow. gRPC is HTTP/2 + Protobuf: schema-first, fast, harder to debug at the wire. Most companies use REST at the public edge, gRPC between internal services.
Persistent bidirectional connection (WebSocket) or server-to-client push (SSE). The right answer for chat, presence, live dashboards, anything pushed from server.
Powers of two, latency numbers, five-minute back-of-envelope.
Every system-design round eventually asks you to estimate. DAU to QPS, QPS to storage, storage to shards, shards to instance count. The math is simple; the practice is what makes it second-nature in the room.
Core
Powers of two
Internalise: 2^10 = 1K, 2^20 = 1M, 2^30 = 1B, 2^40 = 1T. Five orders of magnitude in four numbers.
L1: 0.5 ns. RAM: 100 ns. SSD: 100 µs. Network round-trip same DC: 500 µs. Cross-continental: 100 ms. The factor-of-1000 gaps are what make the difference between a snappy system and a slow one.
A common trick: DAU × actions-per-user / 86400 = average QPS. Multiply by 3-5x for peak. A 10M-DAU service doing 5 actions/user averages ~600 QPS; peaks near 3000.
Average record size × records-per-day × days-of-retention × replication-factor × index-overhead. The replication and index multipliers are where candidates usually forget.
The point where backend engineering stops being about correctness and starts being about correctness under failure. The network is not reliable, latency is not zero, machines fail. Every senior loop reaches into this stage.
Core
Idempotence
A request is idempotent if doing it twice has the same effect as doing it once. Every safe retry strategy starts here.
Naive retry storms when the original cause is overload. Exponential backoff + jitter spreads retries out so the dependency recovers instead of drowning further.
How a distributed system decides a node is dead. Naive timeouts cause split brain; gossip + suspicion levels (SWIM, phi-accrual) plus leases is the production answer.
Security questions in system-design interviews are usually not deep cryptography. They are "how does service A authenticate to service B", "where are the secrets", and "what happens if a token leaks". Cover those and you're competitive.
Core
TLS — what it actually does
Encryption in transit + identity verification. Not the same as encryption at rest. mTLS extends both directions, the standard between internal services.
Authorization (OAuth) is "what can you do". Authentication (OIDC, layered on top) is "who are you". Three legs (auth code with PKCE) for browsers, two legs (client credentials) for service-to-service.
A signed token that carries claims. The pitfalls: no built-in revocation, easy to mis-configure (alg=none), short-lived access tokens + long-lived refresh tokens is the standard pattern.
Token bucket, leaky bucket, fixed window, sliding window. The four shapes under the same load. Apply at edge, gateway, and per-service for defence in depth.
Logs, metrics, traces — and the methodologies (USE, RED) that organise them.
Three signals (logs, metrics, traces) plus two methodologies (USE for resource saturation, RED for request health). Cover most of what an on-call rotation actually needs. The harder skill is not collecting the data; it is knowing what to put on the dashboard so the right thing is obvious during an incident.
Core
The three signals
Logs are events. Metrics are aggregated numbers. Traces tie a request to its constituent calls. Together they give the three windows into what your system is actually doing.
Fourteen canonical interview problems, end to end.
Reading about components and designing with them are different skills. The way to bridge the gap is to work the canonical problems out loud and defend every choice. The book to read first is Designing Data-Intensive Applications. The problems below are the laps every senior candidate practises.
Core
The 6-step framework
Scope → estimate → API → data model → high-level design → deep dive. The repeatable 45-minute pass you can do in your sleep after enough practice.