How Kafka stores its log on disk.
A topic is a directory of append-only files, replicated across brokers, indexed by offset. That is the entire system. The mental model that confuses you stops confusing you the moment you accept it.
Kafka is a log, written to disk, replicated
Not a queue. Not a database. A directory of files.
Every confusing thing about Kafka stops being confusing the moment you stop thinking of it as a message broker and start thinking of it as a distributed, append-only file. Producers append. Consumers read by byte offset. Brokers replicate. Old bytes get deleted on a schedule or rewritten by compaction. The wire protocol, the durability model, the consumer-group semantics — all of it falls out of that one primitive.
Jay Kreps made this case explicitly in 2013 in "The Log: What every software engineer should know about real-time data's unifying abstraction" on the LinkedIn engineering blog. The thesis: the log is the simplest possible storage abstraction, and almost every interesting distributed system — replicated databases, stream processors, event-sourced services — is implementable as N consumers tailing a shared log. Kafka was the productisation of that idea.
The follow-on companies built around the same insight: Confluent (Kreps's company, the commercial Kafka distribution), Redpanda (a C++ rewrite of the broker protocol, no JVM, no ZooKeeper), WarpStream (Kafka semantics over S3 — no local disks at all), Amazon MSK (Kafka as a managed AWS service). They all expose the same API because the API is the log. What differs is the storage substrate.
The file layout
A topic is a directory. Each partition is a subdirectory. Each segment is three files.
SSH into a Kafka broker, look at /var/lib/kafka/data/, and the entire storage model is sitting there in front of you. A topic orders with three partitions creates three directories: orders-0/, orders-1/, orders-2/. Inside each, the partition's log is split into segment files, each named by its base offset:
orders-0/ 00000000000000000000.log 00000000000000000000.index 00000000000000000000.timeindex 00000000000000487253.log 00000000000000487253.index 00000000000000487253.timeindex ...
The .log file holds the records themselves — a length-prefixed framing of (timestamp, key, value, headers), with the whole record batch optionally compressed (LZ4, Snappy, Zstd). The .index is a sparse map from message offset to byte position in the log file, written every ~4096 bytes by default (log.index.interval.bytes). The .timeindex is a parallel sparse map from timestamp to offset, used for time-based reads (consumer.offsetsForTimes()).
The active segment is the one being written to. When it hits log.segment.bytes (default 1 GB) or log.segment.ms (default 7 days) it gets sealed, its base offset goes on a new file, and writes flow into the new active segment. Sealed segments are immutable — they can be read, deleted, or rewritten by compaction, but never appended to. This immutability is what makes the consumer's job trivial: an offset always points at the same byte forever.
Reads exploit two facts. First, sequential disk reads on modern NVMe push 3+ GB/s, and the page cache makes the recent tail of every partition essentially free. Second, the sparse offset index means a random fetch (give me offset 487260) is one binary search in the index file (a few KB), one seek (within the same 4 KB sector of the log file), and one sequential read. Nothing here needs CPU.
Replication and the in-sync replica set
Leader writes. Followers fetch. ISR is who's keeping up.
Every partition has a leader replica and N follower replicas, one on each of N brokers (N = replication factor, typically 3). All produce and consume traffic goes to the leader. Followers run a fetch loop: every few hundred milliseconds they ask the leader "what do you have past offset X?", pull the bytes, append to their own segment file. This is the same wire protocol consumers use, with one extra flag.
The in-sync replica set (ISR) is the leader plus every follower that has fetched within the last replica.lag.time.max.ms (default 30 seconds). Fall behind by more than that and you get evicted from the ISR. Catch up and you rejoin. The ISR is metadata, maintained by the controller, and the entire durability model is built on it.
A produce request specifies acks. acks=0 is fire-and-forget — the producer doesn't wait for any response. Used essentially only for telemetry. acks=1 waits for the leader to write to its local log and ack. Sub-millisecond latency on a healthy cluster. The risk: leader crashes between ack and replicating, and the message vanishes. acks=all waits for every member of the ISR to ack. That is the durability you want for revenue-impacting writes. Cost: latency goes from ~1 ms to ~5-20 ms intra-AZ, ~50-200 ms cross-AZ.
The knob that matters is min.insync.replicas (typically 2 on RF=3 clusters). With acks=all and min.insync.replicas=2, a produce fails if fewer than 2 replicas are in the ISR. That's the right behaviour: you'd rather refuse the write than ack it to a single replica that can crash and lose it.
When a leader dies, the controller (one elected broker, or now KRaft) picks a new leader from the ISR. Any member of the ISR has every committed record, by definition, so no data is lost. If the ISR is empty — every follower fell behind, then the leader died — Kafka asks you to choose. unclean.leader.election.enable=false (the default since 0.11) refuses to elect a non-ISR replica; the partition goes offline until an ISR member returns. CP behaviour. Flipping it to true elects a lagged replica anyway, accepting silent data loss but staying up. AP behaviour. The right choice depends entirely on what the topic carries.
Retention: by time, by size, or by compaction
Three policies. Two delete old bytes. One rewrites them.
Logs would grow forever without a reclamation strategy. Kafka offers three, configured per topic via cleanup.policy.
Time-based retention (cleanup.policy=delete, the default) deletes any sealed segment older than log.retention.hours (default 168, i.e. 7 days). A background thread — the log cleaner — walks each partition every log.retention.check.interval.ms (default 5 minutes), checks the last-modified timestamp on each sealed segment, and unlinks the file. Deletion is cheap: it's whole-file unlink(), no rewriting. The active segment is never deleted, which is why a low-traffic partition can keep a multi-day-old message indefinitely.
Size-based retention uses log.retention.bytes, which caps the partition's total size. Useful when ingest rate is unpredictable and you'd rather drop old data than fill the disk. Most production clusters set both: keep 7 days or 100 GB per partition, whichever comes first.
Log compaction (cleanup.policy=compact) is the surprising one. Instead of deleting old segments by age, the log cleaner rewrites them, keeping only the latest record for each key. A compacted segment is a snapshot of the latest value per key across all of history — no event log, just current state. The active segment stays append-only; compaction runs on sealed segments in the background.
Compaction is what makes Kafka usable as a state store, not just an event bus. The __consumer_offsets internal topic is compacted: each (group, topic, partition) key has its latest committed offset preserved, older offsets dropped. Kafka Connect's offsets topic, Kafka Streams's changelog topics, and any "rebuild state from topic" pattern in event-sourced systems all rely on this. Kreps's 2013 article called this out as the move that promoted Kafka from a log to a database substrate — without compaction, state-rebuild-from-topic is unbounded; with it, the topic is a bounded snapshot you can replay in linear time.
You can combine: cleanup.policy=compact,delete compacts AND enforces a max retention. Useful when state should be compacted but also forgotten eventually (GDPR, expired sessions).
Zero-copy, page cache, and 600 MB/s per broker
The broker barely touches the bytes.
Kafka's throughput is the consequence of refusing to look at message bytes. A naive design copies every record from disk to a JVM byte array, deserialises it, re-serialises it, writes it to a socket buffer. Four copies, two allocations, one full GC pause per minute. That design tops out around 50 MB/s per broker and falls over under load.
The Kafka design uses the sendfile(2) syscall. Consumer fetches offset 487260 from broker. Broker opens the corresponding segment file, computes the byte range from the offset index, and hands the file descriptor + range to sendfile(). The kernel reads the bytes from page cache (which is essentially always hot for the tail of a recent log) and DMAs them straight to the NIC. The bytes never enter userspace, never enter the JVM heap, never allocate a Java object. The producer side is symmetric — bytes arrive on the socket, get appended to the active segment, and live in the page cache for the next consumer's sendfile() a few hundred microseconds later.
The result is that broker throughput is bottlenecked by disk and NIC, not by CPU. A single broker on commodity hardware sustains 600+ MB/s on a 10 GbE link with ~5% CPU utilisation. LinkedIn has published benchmarks showing 2 million writes/second across three brokers on three cheap machines, back in 2014. Modern clusters at LinkedIn, Uber, and Netflix process trillions of messages per day. The architecture has not fundamentally changed.
The page cache is the unsung hero. Linux happily uses every byte of unused RAM as a disk cache, and a Kafka broker with 64 GB of RAM and 4 GB of JVM heap has 60 GB of page cache, which is enough to hold the recent tail of every active partition. Consumer lag stays in cache. Catching up from yesterday hits the SSD. Replaying from a week ago hits HDDs (if you have a tiered storage tier, like Confluent's S3 backend or AWS MSK's tiered storage). The cost gradient is built into the system.
KRaft and the post-ZooKeeper era
Kafka grew up and ate its own dog food.
For its first decade, Kafka leaned on ZooKeeper for cluster metadata: topic configs, partition assignments, broker membership, ISR state, controller election. Every operational decision a Kafka cluster made (who's the leader of partition 7? is broker 4 alive? what's the ISR of partition 12?) went through ZooKeeper. This was a 2010-era choice — distributed consensus was hard, ZooKeeper was the only mature option, and running it as a separate cluster seemed fine.
It was not fine. ZooKeeper became the operational headache of every production Kafka deployment. Two systems to monitor. Two failure modes. Two upgrade cycles. Two security boundaries. Worse: ZooKeeper's data model doesn't scale to millions of partitions, because every partition's state lives as a znode and ZooKeeper's getChildren() on a path with 500k znodes takes seconds. Large Kafka deployments hit this wall around 200k partitions per cluster, and the workaround was always "run more, smaller clusters", which is its own operational nightmare.
KRaft (Kafka Raft) is Kafka's own Raft implementation, introduced in Kafka 2.8 (2021) and production-ready in 3.3 (2022), with ZooKeeper removed entirely in 4.0 (2025). The metadata that used to live in ZooKeeper now lives in a Kafka topic — specifically the __cluster_metadata topic — replicated by a Raft quorum of "controller" brokers. The metadata IS a Kafka log, written by the same code paths as user data. Kafka eating its own substrate.
Operationally, KRaft replaces "operate two systems" with "operate one." Scaling-wise, KRaft handles millions of partitions per cluster (Confluent's published numbers: 2M+ on a single cluster). Latency-wise, controller failover dropped from ~tens-of-seconds (ZooKeeper session timeouts, znode reloads, controller bootstrap) to ~hundreds-of-milliseconds (Raft leader election). The migration story dominated Kafka's architectural conversation for the early 2020s and is now mostly complete — KRaft is the default for every new cluster, and ZooKeeper-backed clusters are end-of-life in 4.0.
The historical irony is sharp. Kafka's original design — the log as the unifying abstraction — was the right answer all along, including for Kafka's own metadata. It took ten years to apply it to itself.
The shape of a Kafka cluster on disk is the shape of the design. Directories of segment files, replicated by a fetch loop, reclaimed by age or compaction, served via sendfile from the page cache, coordinated by a Raft quorum that is itself a log. Everything else in the product — exactly-once semantics, transactions, Connect, Streams, ksqlDB, tiered storage — is a feature layered on that primitive. The companion Kafka overview covers producers, consumers, and rebalancing. This page is the storage layer. Pair them.