From NAND to a working CPU

One gate is universal. NAND(A, B) = NOT(A AND B) can be composed into every Boolean function — and from there into adders, ALUs, registers, RAM, and a CPU that runs real instructions. Six stages, one URL. Click the bits.

stage
1 / 6
gates
1
cycle

The single gate that builds everything

NAND = NOT(A AND B). Output is 0 only when both inputs are 1. Click the input bits to toggle them. Sheffer proved in 1913 that this one gate is enough to express any Boolean function.

A
B
NAND
out
1
ABNAND
001
011
101
110

What you're looking at

Six tabs climb from a single gate to a running CPU, and each one is live. The boxes are wires carrying a 0 or a 1; the ones you can click are inputs, the "out" boxes light up when their value is 1. Stage 1 is one NAND with its truth table; stage 2 derives NOT, AND, OR, and XOR from it; stages 3 and 4 wire full-adders into a 4-bit adder; stage 5 is an ALU you give an op to; and stage 6 is a small CPU with A, D, and PC registers, program memory, and eight RAM cells you step through. The header counts gates used and, on the last stage, clock cycles.

Start on stage 1 and set both inputs to 1, the one case where NAND outputs 0. Then jump to stage 6 and Step through the program seven times. What should surprise you is that nothing new gets added at any tab: the CPU computing 2 + 3 into RAM[0] is built entirely from the same gate you toggled at the start. The whole climb is composition, not new primitives, which is the point the article below unpacks.


One gate, every computer

The Sheffer stroke is functionally complete. So is NOR. CMOS picked NAND for a transistor reason.

Henry Sheffer proved in 1913 that a single binary operator — what he called the "stroke" and what we now write as NAND — is enough to express every Boolean function of any number of variables. NOT is NAND(A, A). AND is NOT(NAND(A, B)). OR is NAND(NOT A, NOT B). Once you have those three you have everything: XOR, multiplexers, decoders, comparators, adders. The same result holds for NOR; the two are the only single-operator complete sets among the sixteen binary Boolean functions.

CMOS — the process every commodity logic chip has used since the late 1980s — happens to build a 2-input NAND with four transistors: two PMOS in parallel pulling the output high, two NMOS in series pulling it low. NOR is the dual and also takes four. But in silicon, electrons (carriers in NMOS) move about three times faster than holes (carriers in PMOS) for the same geometry, and NAND puts the slow PMOS pair in parallel while NOR puts it in series. That makes NAND faster at the same area and the same drive strength, so EDA tools synthesise netlists that are NAND-dominant by default.

You can still buy the chip that started the era. Texas Instruments shipped the SN7400 in 1968: four 2-input NAND gates in a 14-pin DIP package, TTL logic, about 10 ns per gate. The 7400 series became the standard cell library of an entire industry. Every chip in your phone, your laptop's SSD controller, your car's engine management unit, your fridge's compressor controller is functionally a forest of NAND-equivalent gates — usually billions of them.


Adder by adder, ALU by ALU

A half-adder is XOR plus AND. An N-bit ripple-carry adder is N full-adders. Everything else is mux.

A half-adder is the smallest piece of arithmetic in any computer: two 1-bit inputs, two 1-bit outputs. The sum bit is A XOR B; the carry bit is A AND B. A full-adder takes a carry-in too and chains two half-adders, ORing the two carries to make carry-out. Build N of these, wire each one's carry-out into the next one's carry-in, and you have a ripple-carry adder. The pattern is identical from a 4-bit calculator chip in 1971 to the 64-bit add unit in an Apple M-series core in 2026.

The catch with ripple-carry is the wait. Each full-adder cannot finish until its carry-in arrives, so the worst-case delay is proportional to N — adding two 64-bit numbers in pure ripple would take 64 gate delays. Real CPUs do not pay that. They use carry-lookahead (compute all the carries in parallel from the inputs and a pair of generate/propagate signals), carry-select (compute the high half twice — once assuming carry=0, once assuming carry=1 — and pick at the end), or a Brent-Kung or Kogge-Stone prefix tree that gets the depth down to log N gates. The building blocks are still NAND-derived gates; the topology changes.

An ALU is the adder, a few bitwise units (AND, OR, NOT, XOR row-by-row), and a multiplexer that picks which output to expose based on the operation code. The Hack ALU from Nand2Tetris computes 18 useful functions of two 16-bit inputs out of just six control bits — by adding "zero this input" and "negate this input" pre-stages and a "negate the output" post-stage to a single ADD/AND core. The MIPS R2000 in 1985 used the same trick at higher precision. Mead and Conway's 1980 VLSI textbook (Introduction to VLSI Systems) is the canonical reference for how this maps to actual silicon.


Registers, memory, and the stored-program idea

Cross-couple two NAND gates and you have a memory cell. Address a grid of them and you have RAM.

Combinational logic alone has no memory: outputs are a pure function of current inputs. To get state you cross-couple a pair of NAND gates so each one's output feeds the other's input — an SR latch. Add a clock and a data input and you have a D flip-flop: one bit of stable, addressable memory. A register is W of these flip-flops side by side, sharing a clock and a write-enable line. RAM is a 2D grid of flip-flop cells (or, in DRAM, a single transistor and a capacitor per cell) addressed by row and column decoders that are themselves NAND-built. Even SRAM cache cells in your L1 are six transistors arranged in two cross-coupled inverters with two access transistors — gate-level circuitry all the way down.

John von Neumann's 1945 EDVAC report described the architecture every general-purpose CPU has used since: a single memory holds both program and data; a program counter tracks the next instruction; the CPU fetches an instruction, decodes it, executes it on the ALU, optionally writes back to memory, increments the PC, and repeats. The Intel 4004 in 1971 ran this loop on 2,300 transistors at 740 kHz. The Apple M4 in 2024 runs the same loop on roughly 30 billion transistors at around 4 GHz across a dozen cores. Everything between — register renaming, branch prediction, out-of-order execution, speculation, prefetchers — is performance engineering layered on top of the fetch-decode-execute cycle.

The Hack architecture from Noam Nisan and Shimon Schocken's 2005 textbook (and the Nand2Tetris course built on it) is the smallest pedagogical CPU that actually works. 16-bit instructions, two registers (A and D), one ALU, one program ROM, one data RAM, one memory-mapped screen, one keyboard. You can build the whole thing in a hardware description language in a weekend and run a tic-tac-toe game on it. The point of the exercise is not that you will design CPUs — almost nobody does — but that after you've wired the gates yourself, nothing the production chip does feels like magic anymore.


Why a working engineer should think one layer down

You don't need to design chips. You need to know what the layer below charges.

When you trace a perf regression to a misprediction and discover the branch predictor is paying for an indirect call you added, you are looking at the cost of a pipeline flush — the consequence of a CPU that fetches instructions before it knows where they go. When a 4 KiB random read on local DRAM costs 80 ns and the same read on a remote NUMA socket costs 200 ns, you are looking at the cost of physical wires: signals propagating across a chip and across an interposer. When a Redis instance pinned to one core does 200k ops/sec and the same workload spread across eight cores does 400k, you are looking at cache coherence traffic — MESI invalidations rippling across an interconnect.

These costs are only legible if you know the substrate has them. The senior engineer who can think one layer down does not redesign the substrate; they pick the design that the substrate is built to be good at. They keep hot data in cache lines, lay out structs to avoid false sharing, prefer branch-predictable patterns, choose data structures with linear access over random access when memory bandwidth is the bottleneck, and reach for SIMD when the inner loop is a vector op. RISC-V's open ISA, the explosive interest in custom silicon (Apple, Amazon Graviton, Google TPU), and the resurgence of hardware-software co-design at AI labs all reward this kind of thinking.

Patterson and Hennessy's Computer Organization and Design (now in its sixth edition) is the canonical entry point for the layer this simulator skims. Nisan and Schocken's Nand2Tetris is the experiential one. The gates beneath the abstraction never went away — they are just so densely packed and so reliable that we mostly stopped noticing. The point of this page is to make them visible again for one URL's worth of scrolling.

Found this useful?