02 / 15
Internals / 02

Clocks, latches, flip-flops

A clock is a square wave. Every register in the CPU samples its input on one edge of that wave and holds the value until the next. This is the move that turns a soup of gates into a deterministic state machine. The price is that every signal must settle within one clock period, which sets a hard ceiling on how fast a chip can run. By 2005 this ceiling had been hit. The story since then has been about getting more work done per cycle and more cycles in parallel — not about going faster.


Why a clock

Imagine a circuit that's just gates wired together — no clock. Inputs arrive, signals propagate through gate delays, outputs eventually settle. The eventually is the problem. Different gates have different delays. Different paths through the circuit have different lengths. Inputs that look simultaneous on paper arrive nanoseconds apart in silicon. Without a synchronising agent, the circuit's output depends on which signals raced to where, which depends on temperature, voltage, and manufacturing variation. You can't ship that.

The fix is to insert state elements — flip-flops — at every boundary between stages of logic. Each flop has two inputs: a data input D and a clock input. On every rising edge of the clock, the flop snaps a fresh sample of D into its output Q and holds it there until the next edge. Logic between flops is "combinational" and doesn't care when signals arrive, as long as they all settle before the next clock edge. This is synchronous design, and it's the discipline every modern CPU is built on.

It helps to split every digital circuit into two kinds of logic. Combinational logic is memoryless: its output is a pure function of its current inputs. An adder, a multiplexer, a comparator, feed the same bits in and you always get the same bits out, after some propagation delay. There is no notion of "before" or "after." Sequential logic is the opposite: its output depends on the inputs and on what it has stored. A counter that reads 41 and steps to 42 only knows to produce 42 because it remembered 41. The thing that lets a circuit remember is the state element, and the thing that decides when the remembered value updates is the clock. A CPU is sequential logic at the top level: islands of combinational logic separated by ranks of state elements that all step together.

The state elements themselves are built out of the same transistors as everything else (see transistors), wired into a feedback loop. A pair of inverters whose outputs feed each other's inputs has two stable states, output high or output low, and will sit in whichever one it is pushed into. That bistable loop is the seed of every memory cell on the chip. Add a way to force the loop into a chosen state, gated by the clock, and you have built a place to store a bit. Everything from a single status flag to a 512-bit vector register is that idea, replicated.

async (chaos):  inputs ────[gates]──── outputs
                              propagation race; output undefined timing

sync (modern):  D ──→[FF]── combinational ──→[FF]── combinational ──→[FF]── Q
                clk          one clock cycle           clk           one cycle  clk

                Each flip-flop snapshots its input on the rising clock edge.
                Time advances in lockstep across the whole chip.

A D flip-flop in motion

The D flip-flop is the workhorse. Click D to flip the input. Click CLK to toggle the clock. Watch Q only update when the clock goes from 0 to 1 — the rising edge — and only with whatever value D happened to hold at that instant.

The word "edge" is the whole point. A plain D latch is transparent: while its enable is high, Q tracks D continuously, like an open gate. That is dangerous in a pipeline, because data could race straight through two stages in a single clock phase. A D flip-flop closes that door. It is built from two latches in series, a master and a slave, driven by opposite clock phases. While the clock is low the master is transparent and samples D; on the rising edge the master freezes and the slave opens, passing the captured value to Q. The net effect is that Q changes only at the instant of the edge and is otherwise frozen. That single, well-defined update moment is what lets a designer reason about the chip one cycle at a time.

CLKDQcapturecapturecapture
Q samples D at each rising edge (dashed lines) and holds the value flat until the next edge. Changes on D between edges are ignored.
D · data
CLK · clock
Q · output
Q = 0
tickCLKDQevent
The flip-flop is the smallest amount of memory in a CPU. A 64-bit register is 64 of these in parallel, sharing a clock. Apple M4 has roughly 100,000 such registers per core.

Registers and the clocked datapath

A single flip-flop stores one bit. Stack a row of them and share one clock line across the row and you have a register: a word of state that updates atomically on the edge. The program counter, the general-purpose registers, the status flags, the pipeline latches that sit between stages, every architectural and microarchitectural register is a bank of D flip-flops. When an instruction "writes a register," the result sits on the flops' D inputs through the cycle and is captured at the next edge. Reading a register is just wiring its Q outputs into the combinational logic downstream.

This gives the canonical shape of a synchronous datapath: register, then a cloud of combinational logic, then another register. The first register launches a stable value at the edge. The combinational logic, an adder, a shifter, an address calculation, has the rest of the cycle to compute on that value. The second register captures the result at the following edge. As long as the slowest path through the logic settles before that edge arrives, the circuit is correct. The clock period is the contract: everything between two flops must finish inside one period.

reg Aflip-flopscombinational logicadd / shift / comparereg Bflip-flopsCLKone clock period: launch at A's edge, capture at B's edgeslowest path here sets the maximum clock frequency
The unit of synchronous design: register to combinational logic to register. The longest delay through the middle block is the critical path that bounds the clock.

The slowest such path on the whole chip is the critical path, and it sets the maximum clock frequency directly. If the worst register-to-register delay is 280 ps, the chip cannot run faster than about 3.5 GHz no matter what, because at higher frequencies the result would not be stable by the time the capturing edge arrives. Most of the work of "making a chip faster" is finding the critical path and shortening it: splitting it across more pipeline stages so each stage does less (the subject of pipelining), using faster logic styles, or rebalancing where the registers sit. The clock period and the critical path are two views of the same constraint.

Setup and hold

A flop's setup time is the minimum interval before the clock edge during which D must already be stable. Its hold time is the minimum interval after the edge during which D must remain stable. Violate either and the flop enters a metastable state — the output is neither 0 nor 1 for an unpredictable interval, possibly nanoseconds, possibly much longer. In a chip that runs at gigahertz, this is a guaranteed crash.

These two numbers are why setup and hold bound the clock speed. Write the timing budget out and it reads: the clock period must be at least the launching flop's clock-to-Q delay, plus the longest combinational delay, plus the capturing flop's setup time, plus any clock skew between the two flops. Every term on the right is a fixed cost of physics and the chosen transistor library. Shrink the period below their sum and the data is still moving when the edge samples it, so you violate setup and the flop goes metastable. That inequality, not any single component, is the wall a designer runs into.

Hold is the mirror image and is sneakier. A setup violation is a max-delay problem you can fix by slowing the clock; a hold violation is a min-delay problem that no clock speed fixes. If a path is too short, new data launched at one edge can race through the logic and reach the next flop before that flop has finished holding the value it just captured. Designers fix this by deliberately padding fast paths with buffers, which is one of the few times in engineering you add delay on purpose. Because hold depends on the relationship between the data path and the clock path, it can only be checked after the physical layout is known, which is why hold bugs are a classic late-stage scramble.

50 30 80
CLKrising edgeDtsuth✓ valid sample
At 3 GHz the clock period is ~333 ps. A typical modern flop has tsu ≈ 50 ps and th ≈ 30 ps, leaving ~250 ps for combinational logic between flops. That's roughly 25 gate delays. Increase the clock to 5 GHz and the period drops to 200 ps — minus 80 ps of setup-and-hold, you have only 120 ps left, or about 12 gate delays. This is the wall.

Why frequency stalled

Dynamic power dissipation in CMOS follows a simple equation:

P = α · C · V² · f

where α is the activity factor (fraction of gates switching per cycle), C is the effective capacitance being switched, V is the supply voltage, and f is the clock frequency. The catch: faster transistors need more voltage to switch reliably (below ~0.7 V you start losing margin against noise and threshold variation), and the equation has V squared. Push frequency up by 25%; voltage rises ~10%; power increases by roughly 50%. Heat density goes up at the same rate. By 2005 a Pentium 4 at 3.8 GHz was dissipating ~115 W in a die of ~1 cm² — the heat density of a stovetop element.

1.00 3.0
dynamic power (relative)
100%
vs. baseline at 1.0 V, 3.0 GHz
Move voltage from 1.0 V to 1.3 V at 5 GHz: power roughly triples. The chip industry's response after 2005 was to keep voltage low, keep frequency near the cliff, and add cores. A second core at the same frequency adds 100% of the per-core power but doubles the throughput on parallel work. A 25% frequency bump on a single core adds 50% power for 25% sequential speedup — a much worse trade.

Frequency milestones

The trajectory in numbers, from 740 kHz in 1971 to ~6 GHz boost in 2024. The plateau between 2002 and 2018 is the wall in action — single-core frequency barely budged for sixteen years while transistor counts kept doubling.

YearChipTop frequency
1971Intel 4004740 kHz
1989Intel 48625 MHz
1995Pentium Pro200 MHz
2000Pentium III1 GHz — first to break the gigahertz
2002Pentium 43.06 GHz — Hyper-Threading debut
2005Pentium 4 6703.8 GHz — about where the wall arrived
2011Intel Core i7-2600K3.4 GHz · 4 cores
2018Intel i9-9900KS5.0 GHz boost · 8 cores
2023Intel i9-14900KS6.2 GHz boost · 24 cores
2024Apple M4 (P-core)4.4 GHz · 14 cores · sustained

Clock skew

In a chip the size of a fingernail, the speed of light matters. The clock signal has to travel from the PLL through a tree of buffers and wires to every flip-flop on the die. Wires have RC delay. Buffers have propagation delay. By the time the edge reaches a flop near the corner, it can be hundreds of picoseconds behind the edge at the centre. This is clock skew.

centrecorner~50 ps skew
Two flops on different ends of the die receive the clock at slightly different times. Modern chips fight this with H-tree distribution networks that equalise wire lengths, deliberately balanced buffer chains, and per-region active deskew. Even so, every chip ships with a few hundred picoseconds of unavoidable skew.

PLLs and dynamic frequency

The clock that arrives on the die is not the clock the chip uses. A few hundred MHz reference oscillator (often a quartz crystal on the motherboard) feeds a phase-locked loop that synthesises the multi-gigahertz core clock and locks its phase to the reference. The PLL is the analog island in an otherwise digital chip — voltage-controlled oscillator, charge pump, low-pass filter, all tuned at design time.

Modern CPUs change frequency aggressively at runtime. Intel calls this SpeedStep + Turbo Boost. AMD calls it Precision Boost. Apple uses power-state-management ("E-cores" run at one PLL plan, "P-cores" at another). The transition takes microseconds — the PLL has to re-lock at the new frequency, so frequency changes are batched. A typical laptop CPU sits idle at ~400 MHz to save power and ramps to 4–5 GHz when work shows up.

The "5 GHz" on a spec sheet is a maximum, not an average. Modern boost frequencies are sustained for milliseconds and only when one or two cores are active. All-core sustained frequency is typically 30–40% lower because thermal and power limits dominate.

Clock domains and crossing between them

The neat picture of "one clock for the whole chip" stopped being true decades ago. A modern SoC is carved into dozens of clock domains, each a region of logic driven by its own clock at its own frequency. Each CPU core has a domain so it can boost or idle independently. The last-level cache runs in another. The memory controller tracks the DRAM's clock. The PCIe lanes, the USB blocks, the display engine, the GPU, the neural engine, the on-chip interconnect, each sits in a domain tuned to its own job. Splitting the chip this way lets idle blocks drop to a crawl while busy ones run flat out, which is the single biggest lever on a chip's power.

The trouble starts when a signal has to cross from one domain to another. The receiving flop's clock has no fixed relationship to the moment the sending domain changed the data, so sooner or later the data will change inside the receiver's setup-and-hold window. That is a metastability event by construction, and across billions of crossings per second it will happen. You cannot prevent it, only bound how often it causes a failure.

The standard fix for a single-bit crossing is a two-flop synchroniser: two flip-flops in series in the receiving domain. If the first flop goes metastable on a crossing, it almost always settles to a clean 0 or 1 within one clock period, so the second flop captures a stable value. This does not eliminate failure; it pushes the mean time between failures out to years or centuries, which is good enough to ship. Multi-bit crossings need more care, because the bits can resolve on different cycles and produce a value that was never sent. Designers move whole words across with handshakes, Gray-coded counters, or asynchronous FIFOs rather than a flop per bit. Getting these crossings wrong is one of the most common sources of intermittent, impossible-to-reproduce hardware bugs.

From one fast clock to many cores

For thirty years the easy way to make software faster was to wait. Each new process node let transistors switch faster, the clock went up, and existing code sped up with no effort from anyone. Dennard scaling was the reason: as transistors shrank, you could keep power density constant while raising frequency. Around 2005 that bargain broke. Leakage current stopped falling with size, voltage could no longer drop in step, and the P = α · C · V² · f term turned every frequency gain into a heat problem the package could not shed. The single fast clock had hit a ceiling set by physics, not ambition.

The industry's answer was to stop chasing frequency and spend the still-doubling transistor budget on more cores instead. Two cores at the same clock roughly double throughput on parallel work while adding power linearly, a far better trade than a small single-thread gain bought at quadratic power. Every chip since is a study in this shift: more cores, wider pipelines that do more per cycle, big and little cores on the same die, and aggressive per-domain frequency scaling so only the busy parts burn power. The catch landed on software. A faster clock sped up every program for free; more cores only help code that can be split across them, which is why concurrency, parallelism, and the cost of synchronising threads became the defining problems of the multicore era. The clock did not get faster, so the work had to get wider, and that pushed complexity straight up the stack into the software people write.

The flip-flop family

ElementBehaviourWhere it's used
SR latchSet / Reset, level-sensitive. Output toggles when S or R is asserted; both high is forbidden.Building block; rarely used directly today.
D latchTransparent when enable is high. Q follows D while enable is high; holds when enable goes low.Cheap state element in synchronous logic, but two-phase clocking needed.
D flip-flopEdge-triggered. Q samples D only on the rising (or falling) clock edge; holds otherwise.The default state element in modern CPUs. Almost every register is a row of D flip-flops.
JK flip-flopSet / Reset / Toggle. J=K=1 toggles. More flexible than D but uses more transistors.Counters, state machines in older designs.
T flip-flopToggle. Q flips on every clock edge if T is high.Frequency dividers, counters.

Modern CPUs are built on the positive-edge-triggered D flip-flop, almost universally. Other shapes appear in specialised places: scan flops for testability, master-slave pairs for clock-domain crossings, latch-based pipelines where designers want to "borrow time" from one stage to relax timing in another. Apple's silicon is unusual in using a substantial number of pulse latches alongside flops to shave picoseconds off critical paths.

Common misconceptions

  • "The chip has one clock." Modern chips have dozens of clock domains: each core, the L3, the memory controller, the PCIe controller, the display engine, the GPU, the NPU, the SoC fabric. They run at different frequencies and cross domains through synchronisers — often 2-flop synchronisers that introduce a small mean-time-between-failure but bound it.
  • "Higher frequency means faster." Only if work-per-cycle is the same. Modern Apple silicon at 4.4 GHz often beats x86 at 5.5 GHz on the same workload because each cycle accomplishes more — wider dispatch, larger reorder buffer, deeper register file.
  • "Asynchronous chips would be better." A handful of true asynchronous chips have shipped (the Caltech AMULET ARM chips, Achronix's processors). They handle variable workloads gracefully and use less power on light tasks. They're vastly harder to design and verify; the tooling industry never followed.
  • "Hold violations don't happen because hold time is so small." They do, especially after layout: a fast clock path to one flop combined with a fast data path can deliver new data before the previous edge has been "held" long enough. Every modern timing tool checks both.

Numbers worth remembering

QuantityValueNotes
Clock period at 3 GHz~333 psThe whole budget per cycle
Clock period at 5 GHz200 psWhere the wall sits
Modern flip-flop setup time~30–80 psProcess- and library-dependent
Modern flip-flop hold time~5–30 psOften near zero in current libraries
Combinational budget at 3 GHz~250 ps~25 gate delays
Single CMOS gate delay (3 nm)~10 psFanout-of-1, no wire load
Idle laptop CPU frequency~400 MHz~10× power savings vs boost
P ∝ C·V²·fThe equation behind the wall
Pentium 4 670 (2005) TDP115 W~3.8 GHz, ~1 cm² die
Number of clock domains, modern SoC~30–80Each core, cache, fabric, PCIe lane group, etc.

Further reading

Found this useful?