Transistors, gates, and the ALU
Every computer is a tower of abstractions. At the bottom is one simple object — a transistor, a switch with no moving parts. Two of them wired together make a NAND gate. Four NANDs make an XOR. Five XORs and a few ANDs make a 4-bit adder. A 4-bit adder, an AND-block, an OR-block, an XOR-block, and a shifter, gathered behind a multiplexer, make an ALU. Five layers, ten billion repetitions, and the result is a chip that can add a billion 64-bit numbers per second. This page walks the bottom three layers.
The abstraction stack
Engineers don't actually reason about transistors when designing CPUs. They reason about gates. Gate designers don't reason about gates either — they reason about adders, multiplexers, and registers. The ALU designer treats the adder as a black box. The CPU designer treats the ALU as a black box. The compiler writer treats the CPU as a black box. The application developer treats the compiler as a black box. This is the move that makes computing tractable.
Every layer up multiplies the number of objects below by ~10. A modern Apple M3 Max ships ~92 billion transistors. A few hundred thousand gates per ALU. A few dozen ALUs per core. Sixteen cores. The compounding is what made this whole edifice possible.
code a + b
↓
ISA ADD x10, x11, x12 // RISC-V instruction
↓
microarch ALU lane 2, op = 0b0000 // dispatched µop
↓
gate 32-bit ripple-carry adder // ~150 NAND-equivalent gates
↓
transistor ~600 CMOS transistors
↓
silicon 0.4 V applied to a 3 nm gate oxideA transistor is a switch
The transistor in every modern CPU is a MOSFET — a Metal-Oxide- Semiconductor Field-Effect Transistor. It has three terminals: gate, source, and drain. Apply a voltage to the gate and a thin layer of charge forms in the silicon directly underneath, opening a conducting channel between source and drain. Take the voltage away and the channel disappears. It's a voltage-controlled switch.
There are two flavours, complementary to each other:
- NMOS. The channel turns on when the gate voltage is high. Strong at pulling outputs down to ground; weak at pulling them up. The "pull-down" half of the pair.
- PMOS. The channel turns on when the gate voltage is low. Strong at pulling outputs up to the supply rail; weak at pulling them down. The "pull-up" half of the pair.
Use NMOS alone and you can build logic, but it leaks current whenever an output is high. Use PMOS alone and the same problem reverses. Wire the two together so that exactly one of them is on at any given input — the famous CMOS structure — and the circuit is at rest in either state. Static power consumption drops to leakage current. CMOS won the 1980s because of this single property: at a given clock speed it dissipated 100× less power than NMOS-only logic.
C × V² × f; this is the equation behind the 5 GHz frequency wall and
the post-2005 turn to multicore.A gate is a function of two bits
Two transistors wired correctly implement a NAND gate — output is low only when both inputs are high. Click the inputs to flip them. Pick a gate type to see the same input mapped through different functions:
| A | B | NAND |
|---|---|---|
| 0 | 0 | 1 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
NAND is the universal gate
Every Boolean function can be built using only NANDs. NOT, AND, OR, XOR, an entire adder, an entire ALU — all from one gate type. This matters in chip design because you can lay out the silicon with a single uniform pattern of NANDs and route the interconnect to taste. Watch the construction:
Two gates make an adder
Adding two single bits has four cases: 0 + 0, 0 + 1, 1 + 0, 1 + 1. The first three give a single sum bit; the last gives a sum of 0 with a carry of 1. The circuit that computes this is the half-adder: an XOR for the sum and an AND for the carry. The full-adder is one step beyond — it also accepts a carry-in from the bit below it.
cout = (A · B) | (Cin · (A ⊕ B)) = 1
| A | B | Cin | SUM | COUT |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 1 | 0 | 1 | 0 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 1 | 0 |
| 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 |
Four full-adders make a ripple-carry adder
Chain four full-adders together — each one's carry-out feeds the next one's carry-in — and you can add two 4-bit numbers. The carry has to propagate from the least-significant bit all the way up before the most-significant sum bit settles, which is where the name ripple-carry comes from. The propagation delay grows linearly with the number of bits, which is why 64-bit ALUs use carry-lookahead or carry-select structures that compute the carry in O(log n) gate delays instead.
An ALU is a multiplexer over fixed-function blocks
An Arithmetic Logic Unit is a small bundle of pre-built circuits — an adder, a logical-AND, a logical-OR, an XOR, a shifter — fed the same two inputs in parallel. A multiplexer at the output, controlled by an opcode, selects which block's result is the answer. All the blocks always run. Power is paid for operations you didn't pick. This is why an integer ALU has roughly the same energy cost regardless of which integer instruction was issued.
From 2,300 to 153 billion
The Intel 4004 in 1971 had 2,300 transistors and ran at 740 kHz. AMD's Instinct MI300X GPU in 2024 has 153 billion transistors and runs at 2.1 GHz. Plotted on a log scale, this is roughly a doubling every two years — Moore's law, observed for 53 years and counting. The feature size has dropped from 10 µm to ~3 nm: a single transistor now occupies a smaller footprint than a virus.
Where Moore's law actually lives now
"3 nm" is a marketing name. The actual gate length on TSMC's N3 process is closer to 18 nm; the name describes a generation, not a measurement. The real engineering now happens in three places:
- FinFET → GAAFET. Intel and TSMC's N3E still use FinFETs (fins of silicon wrapped on three sides by a gate). Samsung's 3 nm and TSMC N2 introduce gate-all-around (GAAFET, "nanosheets") — the gate wraps the channel on all four sides, allowing better control at smaller scales.
- 3D stacking. Memory has been stacked for a decade (HBM is 8–16 DRAM dies on top of a silicon interposer). Logic is starting to follow. AMD's V-Cache stacks an L3 die on top of the CPU die. Apple's "Ultra" chips fuse two dies through silicon interposer with 2.5 TB/s of bandwidth.
- Chiplets. Instead of one monolithic die, build many small dies and bond them. AMD's Zen 4 / Zen 5 use this for I/O + compute separation. Yields are higher (small dies have fewer defects) and you can mix process nodes (compute on N5, I/O on N7).
Frequency stopped scaling around 2005 (~3.8 GHz on a Pentium 4) because dynamic power scales with the square of voltage and the cube of frequency. Voltage has barely moved since (around 1.0 V). The remaining ways forward are parallelism (more cores), specialization (NPUs, AMX, GPU), and energy-per-operation (the gap between Apple silicon and x86 is largely about how much energy each instruction costs).
Common misconceptions
- "More transistors means a faster CPU." Not directly. Most transistors today go into cache (a 36 MB L3 is ~3 billion transistors all by itself), interconnect, and power-management circuitry. The portion doing actual arithmetic is a small minority of the die.
- "NAND vs NOR doesn't matter." Both are universal, but NAND is preferred in CMOS because the pull-down NMOS network in NAND is two transistors in series (high effective resistance), while in NOR it's two in parallel (lower resistance, faster) — but PMOS in NAND is parallel and PMOS in NOR is series, and PMOS is the slower transistor. The net effect is that NAND is symmetric and balanced; NOR is faster pulling up but slow pulling down.
- "The clock speed is the speed of the chip." The clock determines when state can change. Inside one cycle, signals propagate through several layers of gates; the clock period must be long enough to accommodate the worst-case path. A 3 GHz chip has ~333 ps to do everything in a cycle, which is roughly 30 gate delays.
- "Smaller transistors are always faster." Smaller transistors switch faster but leak more (gate oxide is thinner) and the wiring between them dominates delay at scale. Past ~7 nm, interconnect resistance is the limit, not gate switching.
Numbers worth remembering
| Quantity | Value | Notes |
|---|---|---|
| Transistors, Intel 4004 (1971) | 2,300 | 10 µm process |
| Transistors, Apple M3 Max (2023) | ~92 billion | TSMC N3 process |
| Transistors, AMD MI300X (2024) | ~153 billion | Largest single product die in 2024 |
| NAND transistor count | 4 (CMOS, 2-input) | 2 NMOS + 2 PMOS |
| Full-adder transistor count | ~28 (textbook) | Real designs use ~24 with sharing |
| Single-cycle gate budget at 3 GHz | ~333 ps | ~30 gate delays |
| Modern CMOS gate oxide thickness | ~1 nm | ≈ 4 atoms thick |
| Switching energy, modern transistor | ~1 fJ | 10⁻¹⁵ joules |
| Dynamic power scaling | P ∝ C·V²·f | Why frequency stalled at ~5 GHz |
Further reading
- Nisan & Schocken — The Elements of Computing Systems (nand2tetris) — free textbook + course that builds a whole computer from NAND gates up. The cleanest practical path through this material.
- Bryant & O'Hallaron — Computer Systems: A Programmer's Perspective — Chapter 4 covers the digital-logic foundations briefly and then ties them to real ISAs.
- Wikipedia — Transistor count — periodically-updated table of every notable chip's transistor count and process node.
- Weste & Harris — CMOS VLSI Design. The graduate textbook on actually laying out CMOS at scale. Most engineers will never need this depth, but it's the reference.
- Patterson & Hennessy — Computer Organization and Design (RISC-V Edition), Appendix A. A self-contained 60-page tour of digital logic: gates, latches, adders, register files, memories.
- Bottom Up Computer Science — a free online book that starts from gates and works upward toward operating systems.
- Ben Eater — 8-bit computer from scratch — long video series building a working CPU on breadboards. Slow, thorough, the right antidote to abstraction.