01 / 15
Internals / 01

Transistors, gates, and the ALU

Every computer is a tower of abstractions. At the bottom is one simple object — a transistor, a switch with no moving parts. Two of them wired together make a NAND gate. Four NANDs make an XOR. Five XORs and a few ANDs make a 4-bit adder. A 4-bit adder, an AND-block, an OR-block, an XOR-block, and a shifter, gathered behind a multiplexer, make an ALU. Five layers, ten billion repetitions, and the result is a chip that can add a billion 64-bit numbers per second. This page walks the bottom three layers.


The abstraction stack

Engineers don't actually reason about transistors when designing CPUs. They reason about gates. Gate designers don't reason about gates either — they reason about adders, multiplexers, and registers. The ALU designer treats the adder as a black box. The CPU designer treats the ALU as a black box. The compiler writer treats the CPU as a black box. The application developer treats the compiler as a black box. This is the move that makes computing tractable.

Every layer up multiplies the number of objects below by ~10. A modern Apple M3 Max ships ~92 billion transistors. A few hundred thousand gates per ALU. A few dozen ALUs per core. Sixteen cores. The compounding is what made this whole edifice possible.

code         a + b
  ↓
ISA          ADD x10, x11, x12          // RISC-V instruction
  ↓
microarch    ALU lane 2, op = 0b0000    // dispatched µop
  ↓
gate         32-bit ripple-carry adder  // ~150 NAND-equivalent gates
  ↓
transistor   ~600 CMOS transistors
  ↓
silicon      0.4 V applied to a 3 nm gate oxide

A transistor is a switch

The transistor in every modern CPU is a MOSFET — a Metal-Oxide- Semiconductor Field-Effect Transistor. It has three terminals: gate, source, and drain. Apply a voltage to the gate and a thin layer of charge forms in the silicon directly underneath, opening a conducting channel between source and drain. Take the voltage away and the channel disappears. It's a voltage-controlled switch.

There are two flavours, complementary to each other:

  • NMOS. The channel turns on when the gate voltage is high. Strong at pulling outputs down to ground; weak at pulling them up. The "pull-down" half of the pair.
  • PMOS. The channel turns on when the gate voltage is low. Strong at pulling outputs up to the supply rail; weak at pulling them down. The "pull-up" half of the pair.

Use NMOS alone and you can build logic, but it leaks current whenever an output is high. Use PMOS alone and the same problem reverses. Wire the two together so that exactly one of them is on at any given input — the famous CMOS structure — and the circuit is at rest in either state. Static power consumption drops to leakage current. CMOS won the 1980s because of this single property: at a given clock speed it dissipated 100× less power than NMOS-only logic.

Why this matters today: CMOS still leaks during a switch (short-circuit current as both transistors are briefly partially-on) and through the gate oxide (~1 nm thin in modern processes). Dynamic power scales as C × V² × f; this is the equation behind the 5 GHz frequency wall and the post-2005 turn to multicore.

A gate is a function of two bits

Two transistors wired correctly implement a NAND gate — output is low only when both inputs are high. Click the inputs to flip them. Pick a gate type to see the same input mapped through different functions:

NAND is universal — every other gate can be built from NANDs alone.
NAND
OUT = 1
ABNAND
001
011
101
110

NAND is the universal gate

Every Boolean function can be built using only NANDs. NOT, AND, OR, XOR, an entire adder, an entire ALU — all from one gate type. This matters in chip design because you can lay out the silicon with a single uniform pattern of NANDs and route the interconnect to taste. Watch the construction:

NAND(A, A) = !A = 0
OUT = 0
The construction uses De Morgan's laws. NOT-A is just NAND(A, A) since A · A = A. AND is NAND followed by NOT (which is itself NAND-of-itself). OR is built by inverting both inputs first, then NANDing.

Two gates make an adder

Adding two single bits has four cases: 0 + 0, 0 + 1, 1 + 0, 1 + 1. The first three give a single sum bit; the last gives a sum of 0 with a carry of 1. The circuit that computes this is the half-adder: an XOR for the sum and an AND for the carry. The full-adder is one step beyond — it also accepts a carry-in from the bit below it.

sum = A ⊕ B ⊕ Cin = 0
cout = (A · B) | (Cin · (A ⊕ B)) = 1
SUM = 0
COUT = 1
AB Cin SUMCOUT
00000
00110
01010
01101
10010
10101
11001
11111

Four full-adders make a ripple-carry adder

Chain four full-adders together — each one's carry-out feeds the next one's carry-in — and you can add two 4-bit numbers. The carry has to propagate from the least-significant bit all the way up before the most-significant sum bit settles, which is where the name ripple-carry comes from. The propagation delay grows linearly with the number of bits, which is why 64-bit ALUs use carry-lookahead or carry-select structures that compute the carry in O(log n) gate delays instead.

A
0
1
1
0
B
0
1
0
1
carry
1
0
0
0
0
sum
1
0
1
1
6 + 5 = 11
Try 0b1111 + 0b0001 (15 + 1) and watch the carry ripple all four bits, ending in the overflow flag. A 64-bit ripple-carry adder would take ~64 gate delays. A 64-bit Kogge-Stone (carry-lookahead) adder takes ~7. This is why nothing modern actually uses ripple-carry past four bits.

An ALU is a multiplexer over fixed-function blocks

An Arithmetic Logic Unit is a small bundle of pre-built circuits — an adder, a logical-AND, a logical-OR, an XOR, a shifter — fed the same two inputs in parallel. A multiplexer at the output, controlled by an opcode, selects which block's result is the answer. All the blocks always run. Power is paid for operations you didn't pick. This is why an integer ALU has roughly the same energy cost regardless of which integer instruction was issued.

ADD 1111 = 15
SUB 1001 = 9
AND 0000 = 0
OR 1111 = 15
XOR 1111 = 15
SHL 11000 = 24
SHR 0110 = 6
mux selects ADD
OUT = 0b1111 = 15
All seven blocks compute in parallel. The opcode picks one to route to the output. Modern x86 cores have multiple copies of the ALU — typically four — so they can retire four integer ops in the same cycle. Apple M4 has six.

From 2,300 to 153 billion

The Intel 4004 in 1971 had 2,300 transistors and ran at 740 kHz. AMD's Instinct MI300X GPU in 2024 has 153 billion transistors and runs at 2.1 GHz. Plotted on a log scale, this is roughly a doubling every two years — Moore's law, observed for 53 years and counting. The feature size has dropped from 10 µm to ~3 nm: a single transistor now occupies a smaller footprint than a virus.

1971 · Intel 40042K · 10 µm1978 · Intel 808629K · 3 µm1985 · Intel 386275K · 1.5 µm1993 · Intel Pentium3M · 800 nm2000 · Pentium 442M · 180 nm2008 · Core i7 (Nehalem)731M · 45 nm2014 · Apple A82.0B · 20 nm2020 · Apple M116B · 5 nm2023 · Apple M3 Max92B · 3 nm2024 · AMD MI300X (GPU)153B · 5 nm CoWoS
Two scales lost in the chart: the log axis flattens the gap. From the 4004 to the M3 Max is 40 million× more transistors. From 10 µm to 3 nm is 3,000× shorter. Multiplied: you can fit roughly 10¹¹ 4004-era transistors into the area of one M3 Max die. The hardware industry has compounded for fifty years.

Where Moore's law actually lives now

"3 nm" is a marketing name. The actual gate length on TSMC's N3 process is closer to 18 nm; the name describes a generation, not a measurement. The real engineering now happens in three places:

  • FinFET → GAAFET. Intel and TSMC's N3E still use FinFETs (fins of silicon wrapped on three sides by a gate). Samsung's 3 nm and TSMC N2 introduce gate-all-around (GAAFET, "nanosheets") — the gate wraps the channel on all four sides, allowing better control at smaller scales.
  • 3D stacking. Memory has been stacked for a decade (HBM is 8–16 DRAM dies on top of a silicon interposer). Logic is starting to follow. AMD's V-Cache stacks an L3 die on top of the CPU die. Apple's "Ultra" chips fuse two dies through silicon interposer with 2.5 TB/s of bandwidth.
  • Chiplets. Instead of one monolithic die, build many small dies and bond them. AMD's Zen 4 / Zen 5 use this for I/O + compute separation. Yields are higher (small dies have fewer defects) and you can mix process nodes (compute on N5, I/O on N7).

Frequency stopped scaling around 2005 (~3.8 GHz on a Pentium 4) because dynamic power scales with the square of voltage and the cube of frequency. Voltage has barely moved since (around 1.0 V). The remaining ways forward are parallelism (more cores), specialization (NPUs, AMX, GPU), and energy-per-operation (the gap between Apple silicon and x86 is largely about how much energy each instruction costs).

Common misconceptions

  • "More transistors means a faster CPU." Not directly. Most transistors today go into cache (a 36 MB L3 is ~3 billion transistors all by itself), interconnect, and power-management circuitry. The portion doing actual arithmetic is a small minority of the die.
  • "NAND vs NOR doesn't matter." Both are universal, but NAND is preferred in CMOS because the pull-down NMOS network in NAND is two transistors in series (high effective resistance), while in NOR it's two in parallel (lower resistance, faster) — but PMOS in NAND is parallel and PMOS in NOR is series, and PMOS is the slower transistor. The net effect is that NAND is symmetric and balanced; NOR is faster pulling up but slow pulling down.
  • "The clock speed is the speed of the chip." The clock determines when state can change. Inside one cycle, signals propagate through several layers of gates; the clock period must be long enough to accommodate the worst-case path. A 3 GHz chip has ~333 ps to do everything in a cycle, which is roughly 30 gate delays.
  • "Smaller transistors are always faster." Smaller transistors switch faster but leak more (gate oxide is thinner) and the wiring between them dominates delay at scale. Past ~7 nm, interconnect resistance is the limit, not gate switching.

Numbers worth remembering

QuantityValueNotes
Transistors, Intel 4004 (1971)2,30010 µm process
Transistors, Apple M3 Max (2023)~92 billionTSMC N3 process
Transistors, AMD MI300X (2024)~153 billionLargest single product die in 2024
NAND transistor count4 (CMOS, 2-input)2 NMOS + 2 PMOS
Full-adder transistor count~28 (textbook)Real designs use ~24 with sharing
Single-cycle gate budget at 3 GHz~333 ps~30 gate delays
Modern CMOS gate oxide thickness~1 nm≈ 4 atoms thick
Switching energy, modern transistor~1 fJ10⁻¹⁵ joules
Dynamic power scalingP ∝ C·V²·fWhy frequency stalled at ~5 GHz

Further reading

Found this useful?