Multi-page · for backend engineers
AI systems

The AI stack, from the engineer's seat.

Not how to train a model. How the model you call runs in production: how a prompt becomes tokens, how embeddings turn meaning into coordinates, why serving is a memory problem, and how retrieval and agents bolt real systems onto a next-token predictor. Same level the rest of the codex works at — what is the system actually doing, and where do the costs hide.

All five sub-pages are live. Each links to its plain-English ELI5 front door and the matching simulator where one exists.


Live deep dives

Start here.

01 Live

How LLMs work

A language model is a next-token predictor wrapped in a loop. Tokenization, embeddings, the transformer block, attention, and autoregressive decoding — the whole path from your prompt to one word at a time, with no maths you do not need.

tokens ·embeddings ·attention ·transformer block ·decoding
Read
02 Live

Embeddings & vector search

Turn text into coordinates so "find similar" becomes "find nearby". What an embedding is, why cosine distance works, and how approximate nearest-neighbour indexes (HNSW, IVF) make search over a billion vectors fast enough to serve.

embeddings ·cosine ·ANN ·HNSW ·IVF
Read
03 Live

Inference & serving

Why serving an LLM is a memory problem, not a compute one. The KV cache, prefill vs decode, continuous batching, PagedAttention, and why throughput and latency pull in opposite directions on the same GPU.

KV cache ·prefill/decode ·batching ·PagedAttention ·vLLM
Read
04 Live

Retrieval-augmented generation

Give the model an open-book exam. The ingestion and retrieval halves, chunking trade-offs, hybrid search, re-ranking, and how to tell whether a wrong answer came from retrieval or from generation.

chunking ·hybrid search ·re-ranking ·evaluation ·hallucination
Read
05 Live

Agents & tool use

What turns a chat model into something that takes actions. Tool calling, the plan-act-observe loop, memory, MCP, and the guardrails that keep an autonomous loop from doing real damage.

tool calling ·ReAct ·memory ·MCP ·guardrails
Read