The AI stack, from the engineer's seat.

Not how to train a model. How the model you call runs in production: how a prompt becomes tokens, how embeddings turn meaning into coordinates, why serving is a memory problem, and how retrieval and agents bolt real systems onto a next-token predictor. Same level the rest of the codex works at — what is the system actually doing, and where do the costs hide.

All five sub-pages are live. Each links to its plain-English ELI5 front door and the matching simulator where one exists.

Live deep dives

Start here.

01 Live

How LLMs work

A language model is a next-token predictor wrapped in a loop. Tokenization, embeddings, the transformer block, attention, and autoregressive decoding — the whole path from your prompt to one word at a time, with no maths you do not need.

tokens ·embeddings ·attention ·transformer block ·decoding

Read

02 Live

Embeddings & vector search

Turn text into coordinates so "find similar" becomes "find nearby". What an embedding is, why cosine distance works, and how approximate nearest-neighbour indexes (HNSW, IVF) make search over a billion vectors fast enough to serve.

embeddings ·cosine ·ANN ·HNSW ·IVF

Read

03 Live

Inference & serving

Why serving an LLM is a memory problem, not a compute one. The KV cache, prefill vs decode, continuous batching, PagedAttention, and why throughput and latency pull in opposite directions on the same GPU.

KV cache ·prefill/decode ·batching ·PagedAttention ·vLLM

Read

04 Live

Retrieval-augmented generation

Give the model an open-book exam. The ingestion and retrieval halves, chunking trade-offs, hybrid search, re-ranking, and how to tell whether a wrong answer came from retrieval or from generation.

chunking ·hybrid search ·re-ranking ·evaluation ·hallucination

Read

05 Live

Agents & tool use

What turns a chat model into something that takes actions. Tool calling, the plan-act-observe loop, memory, MCP, and the guardrails that keep an autonomous loop from doing real damage.

tool calling ·ReAct ·memory ·MCP ·guardrails

Read

Start with the mental model

How LLMs work

A prompt turns into tokens, the tokens run through stacked transformer layers, and the model samples one token at a time and feeds it back in. Tokenization, embeddings, attention, and autoregressive decoding — the loop that everything else on this page sits on top of.

The AI stack, from the engineer's seat.

Start here.

How LLMs work

Embeddings & vector search

Inference & serving

Retrieval-augmented generation

Agents & tool use

Where this connects.