Clocks, latches, flip-flops
A clock is a square wave. Every register in the CPU samples its input on one edge of that wave and holds the value until the next. This is the move that turns a soup of gates into a deterministic state machine. The price is that every signal must settle within one clock period, which sets a hard ceiling on how fast a chip can run. By 2005 this ceiling had been hit. The story since then has been about getting more work done per cycle and more cycles in parallel — not about going faster.
Why a clock
Imagine a circuit that's just gates wired together — no clock. Inputs arrive, signals propagate through gate delays, outputs eventually settle. The eventually is the problem. Different gates have different delays. Different paths through the circuit have different lengths. Inputs that look simultaneous on paper arrive nanoseconds apart in silicon. Without a synchronising agent, the circuit's output depends on which signals raced to where, which depends on temperature, voltage, and manufacturing variation. You can't ship that.
The fix is to insert state elements — flip-flops — at every
boundary between stages of logic. Each flop has two inputs: a data input
D and a clock input. On every rising edge of the clock, the flop
snaps a fresh sample of D into its output Q and holds it
there until the next edge. Logic between flops is "combinational" and doesn't
care when signals arrive, as long as they all settle before the next clock edge.
This is synchronous design, and it's the discipline every modern
CPU is built on.
It helps to split every digital circuit into two kinds of logic. Combinational logic is memoryless: its output is a pure function of its current inputs. An adder, a multiplexer, a comparator, feed the same bits in and you always get the same bits out, after some propagation delay. There is no notion of "before" or "after." Sequential logic is the opposite: its output depends on the inputs and on what it has stored. A counter that reads 41 and steps to 42 only knows to produce 42 because it remembered 41. The thing that lets a circuit remember is the state element, and the thing that decides when the remembered value updates is the clock. A CPU is sequential logic at the top level: islands of combinational logic separated by ranks of state elements that all step together.
The state elements themselves are built out of the same transistors as everything else (see transistors), wired into a feedback loop. A pair of inverters whose outputs feed each other's inputs has two stable states, output high or output low, and will sit in whichever one it is pushed into. That bistable loop is the seed of every memory cell on the chip. Add a way to force the loop into a chosen state, gated by the clock, and you have built a place to store a bit. Everything from a single status flag to a 512-bit vector register is that idea, replicated.
async (chaos): inputs ────[gates]──── outputs
propagation race; output undefined timing
sync (modern): D ──→[FF]── combinational ──→[FF]── combinational ──→[FF]── Q
clk one clock cycle clk one cycle clk
Each flip-flop snapshots its input on the rising clock edge.
Time advances in lockstep across the whole chip.A D flip-flop in motion
The D flip-flop is the workhorse. Click D to flip the input. Click
CLK to toggle the clock. Watch Q only update when the
clock goes from 0 to 1 — the rising edge — and only with whatever value
D happened to hold at that instant.
The word "edge" is the whole point. A plain D latch is transparent: while its
enable is high, Q tracks D continuously, like an open gate. That
is dangerous in a pipeline, because data could race straight through two stages in a
single clock phase. A D flip-flop closes that door. It is built from two latches
in series, a master and a slave, driven by opposite clock phases. While the clock is low
the master is transparent and samples D; on the rising edge the master
freezes and the slave opens, passing the captured value to Q. The net effect
is that Q changes only at the instant of the edge and is otherwise frozen.
That single, well-defined update moment is what lets a designer reason about the chip one
cycle at a time.
| tick | CLK | D | Q | event |
|---|
Registers and the clocked datapath
A single flip-flop stores one bit. Stack a row of them and share one clock line across the row and you have a register: a word of state that updates atomically on the edge. The program counter, the general-purpose registers, the status flags, the pipeline latches that sit between stages, every architectural and microarchitectural register is a bank of D flip-flops. When an instruction "writes a register," the result sits on the flops' D inputs through the cycle and is captured at the next edge. Reading a register is just wiring its Q outputs into the combinational logic downstream.
This gives the canonical shape of a synchronous datapath: register, then a cloud of combinational logic, then another register. The first register launches a stable value at the edge. The combinational logic, an adder, a shifter, an address calculation, has the rest of the cycle to compute on that value. The second register captures the result at the following edge. As long as the slowest path through the logic settles before that edge arrives, the circuit is correct. The clock period is the contract: everything between two flops must finish inside one period.
The slowest such path on the whole chip is the critical path, and it sets the maximum clock frequency directly. If the worst register-to-register delay is 280 ps, the chip cannot run faster than about 3.5 GHz no matter what, because at higher frequencies the result would not be stable by the time the capturing edge arrives. Most of the work of "making a chip faster" is finding the critical path and shortening it: splitting it across more pipeline stages so each stage does less (the subject of pipelining), using faster logic styles, or rebalancing where the registers sit. The clock period and the critical path are two views of the same constraint.
Setup and hold
A flop's setup time is the minimum interval before the clock edge
during which D must already be stable. Its hold time
is the minimum interval after the edge during which D must remain
stable. Violate either and the flop enters a metastable state — the output
is neither 0 nor 1 for an unpredictable interval, possibly nanoseconds, possibly
much longer. In a chip that runs at gigahertz, this is a guaranteed crash.
These two numbers are why setup and hold bound the clock speed. Write the timing budget out and it reads: the clock period must be at least the launching flop's clock-to-Q delay, plus the longest combinational delay, plus the capturing flop's setup time, plus any clock skew between the two flops. Every term on the right is a fixed cost of physics and the chosen transistor library. Shrink the period below their sum and the data is still moving when the edge samples it, so you violate setup and the flop goes metastable. That inequality, not any single component, is the wall a designer runs into.
Hold is the mirror image and is sneakier. A setup violation is a max-delay problem you can fix by slowing the clock; a hold violation is a min-delay problem that no clock speed fixes. If a path is too short, new data launched at one edge can race through the logic and reach the next flop before that flop has finished holding the value it just captured. Designers fix this by deliberately padding fast paths with buffers, which is one of the few times in engineering you add delay on purpose. Because hold depends on the relationship between the data path and the clock path, it can only be checked after the physical layout is known, which is why hold bugs are a classic late-stage scramble.
Why frequency stalled
Dynamic power dissipation in CMOS follows a simple equation:
P = α · C · V² · fwhere α is the activity factor (fraction of gates switching per cycle), C is the effective capacitance being switched, V is the supply voltage, and f is the clock frequency. The catch: faster transistors need more voltage to switch reliably (below ~0.7 V you start losing margin against noise and threshold variation), and the equation has V squared. Push frequency up by 25%; voltage rises ~10%; power increases by roughly 50%. Heat density goes up at the same rate. By 2005 a Pentium 4 at 3.8 GHz was dissipating ~115 W in a die of ~1 cm² — the heat density of a stovetop element.
Frequency milestones
The trajectory in numbers, from 740 kHz in 1971 to ~6 GHz boost in 2024. The plateau between 2002 and 2018 is the wall in action — single-core frequency barely budged for sixteen years while transistor counts kept doubling.
| Year | Chip | Top frequency |
|---|---|---|
| 1971 | Intel 4004 | 740 kHz |
| 1989 | Intel 486 | 25 MHz |
| 1995 | Pentium Pro | 200 MHz |
| 2000 | Pentium III | 1 GHz — first to break the gigahertz |
| 2002 | Pentium 4 | 3.06 GHz — Hyper-Threading debut |
| 2005 | Pentium 4 670 | 3.8 GHz — about where the wall arrived |
| 2011 | Intel Core i7-2600K | 3.4 GHz · 4 cores |
| 2018 | Intel i9-9900KS | 5.0 GHz boost · 8 cores |
| 2023 | Intel i9-14900KS | 6.2 GHz boost · 24 cores |
| 2024 | Apple M4 (P-core) | 4.4 GHz · 14 cores · sustained |
Clock skew
In a chip the size of a fingernail, the speed of light matters. The clock signal has to travel from the PLL through a tree of buffers and wires to every flip-flop on the die. Wires have RC delay. Buffers have propagation delay. By the time the edge reaches a flop near the corner, it can be hundreds of picoseconds behind the edge at the centre. This is clock skew.
PLLs and dynamic frequency
The clock that arrives on the die is not the clock the chip uses. A few hundred MHz reference oscillator (often a quartz crystal on the motherboard) feeds a phase-locked loop that synthesises the multi-gigahertz core clock and locks its phase to the reference. The PLL is the analog island in an otherwise digital chip — voltage-controlled oscillator, charge pump, low-pass filter, all tuned at design time.
Modern CPUs change frequency aggressively at runtime. Intel calls this SpeedStep + Turbo Boost. AMD calls it Precision Boost. Apple uses power-state-management ("E-cores" run at one PLL plan, "P-cores" at another). The transition takes microseconds — the PLL has to re-lock at the new frequency, so frequency changes are batched. A typical laptop CPU sits idle at ~400 MHz to save power and ramps to 4–5 GHz when work shows up.
Clock domains and crossing between them
The neat picture of "one clock for the whole chip" stopped being true decades ago. A modern SoC is carved into dozens of clock domains, each a region of logic driven by its own clock at its own frequency. Each CPU core has a domain so it can boost or idle independently. The last-level cache runs in another. The memory controller tracks the DRAM's clock. The PCIe lanes, the USB blocks, the display engine, the GPU, the neural engine, the on-chip interconnect, each sits in a domain tuned to its own job. Splitting the chip this way lets idle blocks drop to a crawl while busy ones run flat out, which is the single biggest lever on a chip's power.
The trouble starts when a signal has to cross from one domain to another. The receiving flop's clock has no fixed relationship to the moment the sending domain changed the data, so sooner or later the data will change inside the receiver's setup-and-hold window. That is a metastability event by construction, and across billions of crossings per second it will happen. You cannot prevent it, only bound how often it causes a failure.
The standard fix for a single-bit crossing is a two-flop synchroniser: two flip-flops in series in the receiving domain. If the first flop goes metastable on a crossing, it almost always settles to a clean 0 or 1 within one clock period, so the second flop captures a stable value. This does not eliminate failure; it pushes the mean time between failures out to years or centuries, which is good enough to ship. Multi-bit crossings need more care, because the bits can resolve on different cycles and produce a value that was never sent. Designers move whole words across with handshakes, Gray-coded counters, or asynchronous FIFOs rather than a flop per bit. Getting these crossings wrong is one of the most common sources of intermittent, impossible-to-reproduce hardware bugs.
From one fast clock to many cores
For thirty years the easy way to make software faster was to wait. Each new process node let
transistors switch faster, the clock went up, and existing code sped up with no effort from
anyone. Dennard scaling was the reason: as transistors shrank, you could keep power density
constant while raising frequency. Around 2005 that bargain broke. Leakage current stopped
falling with size, voltage could no longer drop in step, and the
P = α · C · V² · f term turned every frequency gain into a heat problem the
package could not shed. The single fast clock had hit a ceiling set by physics, not
ambition.
The industry's answer was to stop chasing frequency and spend the still-doubling transistor budget on more cores instead. Two cores at the same clock roughly double throughput on parallel work while adding power linearly, a far better trade than a small single-thread gain bought at quadratic power. Every chip since is a study in this shift: more cores, wider pipelines that do more per cycle, big and little cores on the same die, and aggressive per-domain frequency scaling so only the busy parts burn power. The catch landed on software. A faster clock sped up every program for free; more cores only help code that can be split across them, which is why concurrency, parallelism, and the cost of synchronising threads became the defining problems of the multicore era. The clock did not get faster, so the work had to get wider, and that pushed complexity straight up the stack into the software people write.
The flip-flop family
| Element | Behaviour | Where it's used |
|---|---|---|
| SR latch | Set / Reset, level-sensitive. Output toggles when S or R is asserted; both high is forbidden. | Building block; rarely used directly today. |
| D latch | Transparent when enable is high. Q follows D while enable is high; holds when enable goes low. | Cheap state element in synchronous logic, but two-phase clocking needed. |
| D flip-flop | Edge-triggered. Q samples D only on the rising (or falling) clock edge; holds otherwise. | The default state element in modern CPUs. Almost every register is a row of D flip-flops. |
| JK flip-flop | Set / Reset / Toggle. J=K=1 toggles. More flexible than D but uses more transistors. | Counters, state machines in older designs. |
| T flip-flop | Toggle. Q flips on every clock edge if T is high. | Frequency dividers, counters. |
Modern CPUs are built on the positive-edge-triggered D flip-flop, almost universally. Other shapes appear in specialised places: scan flops for testability, master-slave pairs for clock-domain crossings, latch-based pipelines where designers want to "borrow time" from one stage to relax timing in another. Apple's silicon is unusual in using a substantial number of pulse latches alongside flops to shave picoseconds off critical paths.
Common misconceptions
- "The chip has one clock." Modern chips have dozens of clock domains: each core, the L3, the memory controller, the PCIe controller, the display engine, the GPU, the NPU, the SoC fabric. They run at different frequencies and cross domains through synchronisers — often 2-flop synchronisers that introduce a small mean-time-between-failure but bound it.
- "Higher frequency means faster." Only if work-per-cycle is the same. Modern Apple silicon at 4.4 GHz often beats x86 at 5.5 GHz on the same workload because each cycle accomplishes more — wider dispatch, larger reorder buffer, deeper register file.
- "Asynchronous chips would be better." A handful of true asynchronous chips have shipped (the Caltech AMULET ARM chips, Achronix's processors). They handle variable workloads gracefully and use less power on light tasks. They're vastly harder to design and verify; the tooling industry never followed.
- "Hold violations don't happen because hold time is so small." They do, especially after layout: a fast clock path to one flop combined with a fast data path can deliver new data before the previous edge has been "held" long enough. Every modern timing tool checks both.
Numbers worth remembering
| Quantity | Value | Notes |
|---|---|---|
| Clock period at 3 GHz | ~333 ps | The whole budget per cycle |
| Clock period at 5 GHz | 200 ps | Where the wall sits |
| Modern flip-flop setup time | ~30–80 ps | Process- and library-dependent |
| Modern flip-flop hold time | ~5–30 ps | Often near zero in current libraries |
| Combinational budget at 3 GHz | ~250 ps | ~25 gate delays |
| Single CMOS gate delay (3 nm) | ~10 ps | Fanout-of-1, no wire load |
| Idle laptop CPU frequency | ~400 MHz | ~10× power savings vs boost |
| P ∝ C·V²·f | — | The equation behind the wall |
| Pentium 4 670 (2005) TDP | 115 W | ~3.8 GHz, ~1 cm² die |
| Number of clock domains, modern SoC | ~30–80 | Each core, cache, fabric, PCIe lane group, etc. |
Further reading
- Wikipedia — Flip-flop (electronics) — comprehensive coverage of every flop variant and their timing parameters.
- Wikipedia — Metastability — what happens when setup or hold is violated, plus how 2-flop synchronisers bound the failure rate.
- Agner Fog — Microarchitecture of Intel, AMD, and VIA CPUs — Section 1 covers the pipeline frontend, including the relationship between clock and stage budgets.
- Weste & Harris — CMOS VLSI Design. Chapter 7 on sequential circuits is the canonical reference for setup/hold, clock distribution, and PLL design.
- Patterson & Hennessy — Computer Organization and Design (RISC-V Edition), Appendix A.7. Brief, engineer-facing introduction to flip-flops and their use as state elements.
- Wikipedia — Dynamic voltage and frequency scaling — how modern chips trade frequency for power and back.
- Chips and Cheese — modern measurements of frequency curves vs power on every recent CPU and GPU.