CPU Cache Simulator: MESI coherence

Four cores, each with a tiny per-core L1, sharing a small L3 backed by DRAM. Every cache line carries a MESI state. Issue reads and writes manually, or run a scripted scenario — shared reads, false sharing, producer-consumer — and watch the protocol invalidate, fetch, or hit. The cost panel tallies cycles using realistic latency numbers.


Caches

core 0 L1 · 4 lines
0
I
1
I
2
I
3
I
core 1 L1 · 4 lines
0
I
1
I
2
I
3
I
core 2 L1 · 4 lines
0
I
1
I
2
I
3
I
core 3 L1 · 4 lines
0
I
1
I
2
I
3
I
shared L3 · 12 lines
0
I
1
I
2
I
3
I
4
I
5
I
6
I
7
I
8
I
9
I
10
I
11
I

Controls

Same address → same line. Step the same line through different cores to watch the MESI transitions.

Cost panel

cycles ticked
0
L1 hits
0
L3 hits
0
DRAM misses
0
coherence msgs
0
writebacks
0
L1 hit rate
0%
amortised cy/op
0
Cycle costs assumed: L1 hit 4 cy, L3 hit 45 cy, DRAM fetch 250 cy, coherence message 30 cy. Real-world values vary by chip but are within roughly 30% of these on Apple M4, Intel Raptor Lake, AMD Zen 5.

Bus log

No bus activity yet. Issue an op or run a scenario.

What you're looking at

The top board is four cores, each with four L1 cache lines, over one shared L3. Every line shows its slot, the address it holds, and a MESI letter: M (this core has the only dirty copy), E (the only clean copy), S (shared with other cores), I (empty). Issue a read or write from the manual controls, or play a scenario, and the cost panel tallies L1 hits, L3 hits, DRAM misses, and coherence messages using realistic cycle costs. The bus log narrates each operation's outcome.

Start by issuing the same address as a read from core 0, then core 1: the line goes E, then both flip to S. Now write that address from core 0 and watch it invalidate core 1 and jump to M. The scenario that should surprise you is false sharing — two cores writing different bytes of the same 64-byte line. Nothing is actually shared, yet the coherence-message counter climbs about one per write while the L1 hit rate stays near zero, because each store invalidates the other core's copy of the whole line. That invisible ping-pong is what padding variables onto separate lines exists to kill.

What is MESI cache coherence?

MESI is the protocol modern CPUs use to keep multiple cores' caches consistent. Each cache line carries a state: Modified (this core has the only, dirty copy), Exclusive (this core has the only, clean copy), Shared (multiple cores have clean copies), Invalid (no valid copy here). When a core reads or writes, the protocol broadcasts messages that update the states across all caches so reads and writes always see a sane value. The simulator above tracks every line's state on every operation.

The two states most surprising to people: Exclusive exists so that a write to a never-shared line doesn't need any bus traffic at all (E silently upgrades to M), and Shared means more than one core may have it, not necessarily that anyone else does — coherence overhead at the next write is paid based on what other cores actually hold.

How false sharing wrecks throughput

MESI works at the granularity of a 64-byte cache line. Two threads on different cores writing different variables that happen to live on the same line will invalidate each other on every store. The simulator's false sharing scenario reproduces this: cores 0 and 1 alternately write addresses 0x200 and 0x208, both in the same line. Watch the coherence-message counter climb roughly one per op while the L1 hit count stays stuck — every write turns into a write-miss because the previous write invalidated the line.

In real production code the fix is padding each per-thread variable onto its own cache line. Linux uses ____cacheline_aligned; Java has @Contended; Rust has crossbeam_utils::CachePadded; .NET has StructLayout. On Apple silicon the cache line is 128 bytes, so a 64-byte pad is half the protection you think it is.

What the simulator simplifies

  • One cache level per core. Real chips have L1 (split I/D) and L2 per core, plus a shared L3. Here, L1 stands in for the whole per-core hierarchy.
  • Round-robin replacement. Real caches use pseudo-LRU or RRIP. The simpler policy makes the eviction visible in the four lines we have to work with.
  • Tag-only addressing. The simulator treats each address as a line index. Real caches split addresses into tag, set index, and byte offset (covered in the deep dive).
  • No prefetching. Real CPUs aggressively pull lines ahead of demand. Adding it would obscure the MESI mechanics, which are the point.
  • No NUMA, no MOESI/MESIF. Plain MESI with one socket is enough to demonstrate the protocol; the deep dive covers the extensions.

FAQ

Why does writing on Exclusive not show any coherence message?
Exclusive means no other core holds the line, so MESI knows there's no one to notify. The write silently upgrades the state to Modified — zero bus traffic. This is the optimisation that makes single-threaded workloads cheap on multi-core hardware.
Why is my L1 hit rate so low in the false-sharing scenario?
Every write on a Shared or Invalid line is a coherence event. After the other core writes, your line is Invalid; you have to refetch. Hit rate stays near 0 because consecutive writes on the same line by different cores cancel each other's caching. Padding the variables onto separate lines is the architectural fix.
Why does the producer-consumer scenario stay at high hit rate?
After core 0 writes the line and the readers fetch it, all four cores hold the line Shared. Subsequent reads hit L1 with no bus traffic. The next write from core 0 invalidates everyone, so the cycle repeats — but each invalidation amortises across many reads. This is why batched producer-consumer beats fine-grained shared updates.

Further reading

Found this useful?