The AI stack, from the engineer's seat.
Not how to train a model. How the model you call runs in production: how a prompt becomes tokens, how embeddings turn meaning into coordinates, why serving is a memory problem, and how retrieval and agents bolt real systems onto a next-token predictor. Same level the rest of the codex works at — what is the system actually doing, and where do the costs hide.
All five sub-pages are live. Each links to its plain-English ELI5 front door and the matching simulator where one exists.
Start here.
How LLMs work
A language model is a next-token predictor wrapped in a loop. Tokenization, embeddings, the transformer block, attention, and autoregressive decoding — the whole path from your prompt to one word at a time, with no maths you do not need.
Embeddings & vector search
Turn text into coordinates so "find similar" becomes "find nearby". What an embedding is, why cosine distance works, and how approximate nearest-neighbour indexes (HNSW, IVF) make search over a billion vectors fast enough to serve.
Inference & serving
Why serving an LLM is a memory problem, not a compute one. The KV cache, prefill vs decode, continuous batching, PagedAttention, and why throughput and latency pull in opposite directions on the same GPU.
Retrieval-augmented generation
Give the model an open-book exam. The ingestion and retrieval halves, chunking trade-offs, hybrid search, re-ranking, and how to tell whether a wrong answer came from retrieval or from generation.
Agents & tool use
What turns a chat model into something that takes actions. Tool calling, the plan-act-observe loop, memory, MCP, and the guardrails that keep an autonomous loop from doing real damage.