13 min read · Guide · Distributed systems
How it works · Distributed systems · Streaming

How Kafka works: a log, not a queue

A queue that doesn't lose its memory. A log file that scales horizontally. A pipe whose readers move at their own speed. Kafka is a small idea — the partitioned, replayable log — given enormous engineering.

Parts01 – 08 Interactive3 partitions · 2 consumers PrereqLogs / TCP

What is Kafka?

Underneath, it is an append-only log that many consumers can read.

Apache Kafka is a distributed, partitioned, replicated log. Jay Kreps and the LinkedIn team released it in 2011; the design was inspired by database commit logs. Kafka shines at high-throughput pub/sub (Netflix, Uber, LinkedIn each handle trillions of events per day) because it does sequential disk writes, lets the OS page cache do most of the read-hot work, and uses sendfile() to avoid copying data through user space.

Strip away the marketing and Kafka is one idea: an append-only, immutable log. Producers append to the end; consumers read forward; nothing ever gets edited or deleted (until retention expires). That single shape — a sequence of bytes with an ever-growing offset — is what makes the rest possible.

A queue forgets a message once a consumer reads it. Kafka does not. Two consumers can read the same message at independent speeds, weeks apart, and a third one added next month can replay the entire history from offset zero. The log is the source of truth; the consumer's position is just a number. That one difference drives most of the Kafka vs RabbitMQ decision.

append-only

Writes are sequential.

Producers always write to the end. No edits. No deletes (except by retention or compaction). This is what lets Kafka match the throughput of the underlying disk — sequential writes are an order of magnitude faster than random ones.

offset-addressed

Position, not time.

Every record gets a monotonic 64-bit offset within its partition. A consumer "position" is just a saved offset. Replay = seek to N and read forward. Restart = same. No "what was the last message I processed at 3:42pm?" — it's just an integer.

durable

Disks survive reboots.

Records hit disk before the producer's ack returns. Replicated to followers. A broker can lose power and come back, and every record persists at its original offset. The log is a contract.


Topics fan out into partitions

Partitions are how a single topic scales across brokers.

A single log on a single broker is fast — but it is one machine's throughput. Kafka scales by sharding. A topic is a logical name (orders); underneath, it is split into a fixed number of partitions. Each partition is its own append-only log, lives on its own broker (the leader), and is replicated to followers.

The shard key is the producer's choice. With no key, Kafka round-robins records across partitions. With a key, the producer hashes it (default murmur2 mod N) and that key always lands on the same partition. The implication: ordering is per-partition, not per-topic. If you need user X's events in order, key them by user ID and they all funnel to one partition. Order is preserved.

Pick partition count carefully

Partition count caps consumer-group parallelism — at most one consumer per partition per group. Too few partitions and you can't scale consumers. Too many and you pay per-partition overhead and slow rebalances. Aim for partitions = peak consumer count × 2–3; resizing later is non-trivial.


A worked example: one topic, three partitions, two consumers

How partitions get divided up across a consumer group.

Below: topic orders with three partitions and a consumer group billing with two consumers. Start the producer and watch records hash to partitions by key. Take a consumer offline to trigger rebalance — ownership shifts, the survivor catches up the lag, the order each consumer sees stays per-partition.

topic orders 0 records produced
Producer
hash(key) mod 3 → partition · paused
orders.0 → consumer-1 lag 0
— empty —
orders.1 → consumer-1 lag 0
— empty —
orders.2 → consumer-2 lag 0
— empty —
Consumer group · billing
consumer-1
owns: orders.0 · orders.1
orders.0: offset 0orders.1: offset 0
consumer-2
owns: orders.2
orders.2: offset 0

How Kafka replicates partitions for durability

Leaders, followers, and the in-sync replica set (ISR).

Each partition has a leader broker (the only one that handles reads and writes for that partition) and a configurable number of followers that pull from the leader and stay caught up. The set of followers fully caught up at any moment is the in-sync replica set — the ISR.

A producer can ask for one of three durability levels via the acks setting: acks=0 (fire and forget — fast and unsafe), acks=1 (leader has it — survives leader crash only if a replica caught up first), or acks=all (every member of the ISR has it — survives any broker loss inside the ISR). Production typically uses acks=all with min.insync.replicas=2: if the ISR shrinks to one, refuse the write rather than risk silent loss.

When a leader fails, a controller picks the new leader from the ISR — never from a stale follower. Picking from outside the ISR is "unclean leader election" and means accepting the silent loss of records. Disable it.


Consumer groups, offsets, and rebalances

How a group splits the work and tracks where each consumer is.

A consumer group is a label. Every consumer with the same group.id is part of one cooperative; the group divides the topic's partitions among its members so each partition is owned by exactly one consumer at a time. Two consumers in different groups read independently and don't step on each other.

Each consumer commits its progress as offsets — a per-partition position kept in a special internal topic called __consumer_offsets. Crash, restart, redeploy: the new consumer reads its committed offset and resumes from exactly there.

When the group changes — a consumer joins, leaves, or dies — the group coordinator triggers a rebalance: ownership of every partition is reassigned. Old protocol versions stopped the world for the duration; the modern cooperative sticky assignor (KIP-429) keeps unaffected consumers running. Rebalance is the single most common source of "why is my consumer stuck?".


Kafka's three delivery guarantees

At-most-once, at-least-once, and exactly-once, and what each costs.

Distributed message systems can theoretically offer three semantics. Kafka does all three — picked by configuration, not by magic.

  1. at-most-once

    Lose, don't duplicate

    Producer acks=0; consumer commits offset before processing. If anything fails, the record is gone. Almost never what you want; sometimes acceptable for metrics or telemetry.

  2. at-least-once

    Duplicate, don't lose

    Producer acks=all; consumer commits offset only after processing. Crashes during processing mean replay; downstream code must be idempotent. The default for most production pipelines.

  3. exactly-once

    Read-process-write, transactionally

    Real-but-narrow: Kafka transactions atomically commit (a) the offset of records consumed and (b) the records produced. Holds end-to-end only for streams that consume from Kafka and produce to Kafka. The moment you write to a database or send an email, you're back to at-least-once + idempotency.


Why Kafka is fast

Sequential disk writes, the page cache, and zero-copy sendfile.

Kafka claims throughput numbers that look impossible — millions of messages per second per broker, on commodity hardware. The trick is that none of the work is clever. It is well-aligned with what hardware does fastest.

Producer batching

Many records per request.

Producers buffer records (per-partition, up to linger.ms) and send them in batches. Compression then runs over the whole batch — much higher ratio than per-record. One TCP round trip carries thousands of records.

Zero-copy fetches

Disk to socket, no copies.

When a consumer fetches, the broker uses sendfile() — a Linux syscall that transfers bytes directly from page cache to the network socket without copying through user space. Pair with the page cache (recent records still in RAM) and broker CPU is barely involved.


Where Kafka gets hard in production

The operational problems teams hit once Kafka is under real load.

A short, depressingly consistent list of how production Kafka deployments end up in incident review.

01 · HOT PARTITION

One key, all the keys.

Most records share a key (e.g., a single tenant has 90% of traffic). That partition takes 90% of the load while others sleep. Fix: include a secondary key, or use no key for non-ordered records. Hash partitioning with skewed inputs is just slow.

02 · POISON PILL

One bad record stalls a partition.

A single record causes a deserialization exception. The consumer can't commit past it; retrying forever. The partition's lag grows without bound. Fix: a dead-letter queue with skip-and-record logic, or schema validation at the producer.

03 · REBALANCE STORMS

Hot deploy = paused group.

Rolling deploys cause sequential rebalances; each pauses the group. With the old eager protocol, a deploy of N consumers freezes processing N times. Use the modern cooperative-sticky assignor and tune session.timeout.ms.


How a partition log is stored as segment files

A partition isn't a single file — that would make retention painful. It's a directory of segments, each a fixed-size append-only file that gets rolled when it reaches a threshold (default 1 GiB or 7 days). Old segments are deleted whole; the live one is appended to. Two sidecar indexes make any offset findable in O(log n).

# Per-partition directory on disk
/var/kafka/orders-7/
├── 00000000000000000000.log         # records, base offset 0
├── 00000000000000000000.index       # offset → byte position
├── 00000000000000000000.timeindex   # timestamp → offset
├── 00000000000000123456.log         # next segment, base offset 123456
├── 00000000000000123456.index
├── 00000000000000123456.timeindex
├── 00000000000000789012.log         # ACTIVE segment (being appended)
├── 00000000000000789012.index
├── 00000000000000789012.timeindex
└── leader-epoch-checkpoint          # epoch → start offset (Part 10)

The filename is the base offset — the offset of the first record inside. To find offset 500,000 the broker binary-searches the directory listing, lands in …123456.log, then does a binary search of …123456.index for the largest indexed offset ≤ 500,000, then a linear scan from that byte position. Indexes are sparse — by default one entry per 4 KiB of records — which keeps them fitting comfortably in page cache.

Record batch

Many records, one header.

Producers don't append individual records. They build a batch — up to batch.size bytes or linger.ms milliseconds of accumulated records, gzip/snappy/lz4/zstd compressed as a unit, with a single CRC and timestamp. The broker stores the batch unmodified; it never decompresses to write to disk. Compression amortises the per-record overhead and is why throughput scales with batch size.

Zero-copy fetch

Disk → socket, no userspace touch.

When a consumer fetches, the broker calls sendfile(2) to splice page cache directly to the socket. The bytes never enter Kafka's JVM heap. This is why Kafka can saturate a 25 Gbit NIC with single-digit CPU. It also means brokers can't peek at messages — there's no decompression, no transformation, no schema check. That stays at the producer or consumer.

Compaction · the other retention

Set cleanup.policy=compact and Kafka periodically rewrites old segments, keeping only the last record per key. Tombstones (null values) actually delete a key. The compacted log becomes a change-data capture stream — every key's current value is somewhere in the log, even if the original write was years ago. This is how Kafka backs replicated state stores in stream-processing frameworks like Kafka Streams, Flink, and ksqlDB.


ISR, high water mark, and leader epoch explained

Every partition has a leader and zero or more followers. Producers write only to the leader; followers fetch from it like any other consumer. The replication protocol revolves around three offsets: the leader's log end offset (LEO), each follower's LEO, and the high water mark (HW) — the highest offset acknowledged by every member of the in-sync replica set. Consumers can only read up to the HW; anything past it might still be lost.

  1. 01

    In-Sync Replicas (ISR): the trust circle

    A follower is "in-sync" if it has fetched all records up to the leader's LEO within replica.lag.time.max.ms (default 30 seconds). Slow replicas are evicted from the ISR; once they catch up, they re-enter. Producers configured with acks=all wait for every member of the ISR to acknowledge — durability scales with ISR size, not replica count.

  2. 02

    High Water Mark: the visible horizon

    HW = min(LEO across all ISR). It only advances when every in-sync replica confirms. Consumers see only records ≤ HW — this is what makes Kafka durable: nothing visible was unreplicated. The price is latency: HW lags the leader's LEO by exactly one round-trip to the slowest healthy follower.

  3. 03

    Leader epoch: what stops ghost records

    When a leader fails, an ISR follower becomes the new leader and increments its epoch. Old leaders that come back must truncate any records they had past the new leader's HW at the moment of failover. Without epoch tracking (pre-2018, KIP-101), a deposed leader could have records that the new leader never knew about — silent data loss. Modern Kafka stores leader-epoch-checkpoint per partition and uses it to refuse stale writes.

  4. 04

    Unclean leader election: the ack=all loophole

    If the entire ISR dies, Kafka has two choices: stay unavailable until at least one ISR member returns, or elect a non-ISR replica with stale data. unclean.leader.election.enable=false is the default and the right answer for systems that need correctness. Set it to true only when availability outweighs the chance of losing the records past the last common offset.

acks setting Producer waits for Loss possible if… Throughput
acks=0Nothing. Fire and forget.Network drop, broker crash, almost anything.Highest
acks=1Leader writes to its log.Leader crashes before replicating.High
acks=all (-1)Every ISR member acknowledges.Only if entire ISR + producer in-flight all die.Lower (set min.insync.replicas≥2)
A field-tested config

For durability-critical pipelines: acks=all, replication.factor=3, min.insync.replicas=2, unclean.leader.election.enable=false, enable.idempotence=true. This is the LinkedIn / Confluent recommended baseline. It tolerates one broker failure with zero data loss and gives at-most-once duplicate avoidance per producer instance.


With KRaft, cluster metadata lives in Kafka itself

For its first decade Kafka delegated cluster metadata — broker membership, topic config, partition assignments — to ZooKeeper, an external Paxos-backed coordination service. ZooKeeper worked, but it was a separate cluster to deploy, monitor, secure, and scale. Each topic created hundreds of znodes; clusters with tens of thousands of partitions could see ten-minute controller failovers as ZooKeeper read every znode back into memory.

KRaft (KIP-500, GA in 3.3) replaces ZooKeeper with an internal Raft-based metadata log. Three to five brokers are designated controllers; they form a Raft quorum, replicate the metadata log among themselves, and serve metadata requests to the rest of the cluster. The controller log is itself a compacted Kafka topic — same segments, same replication, same indexes as Part 09. The system is now self-hosting.

What got faster

Controller failover, partition limits.

Recovery time scales with the size of the change log, not the size of the metadata. Failover that took minutes now takes single-digit seconds. The practical partition ceiling per cluster moved from ~200,000 to millions, because controllers no longer have to stream the full metadata back from ZooKeeper on takeover — they already have it on local disk.

What got simpler

One system to deploy.

No external ZooKeeper ensemble; no zookeeper.connect string; no separate JVM to upgrade. Brokers and controllers can run on the same nodes (combined mode for small clusters) or be split (recommended above ~5 brokers). The kafka-storage tool generates a cluster ID and formats the metadata log on first start.

Tiered storage · the other modern shift

KIP-405 (GA in 3.6) lets brokers offload old segments to S3-compatible object storage while keeping the active segment on local SSD. Retention can stretch to months or years for the price of object storage. Consumers transparently fetch from the remote tier when they read past the local horizon. The entire log file format from Part 09 is reused — only the storage backend changes.

KRaft: how Kafka dropped ZooKeeper

The biggest architectural change in Kafka's history.

The legacy. For its first decade Kafka stored cluster metadata (topics, partitions, replicas, ACLs, configs) in ZooKeeper. Two systems to operate, two failure domains, two security models. ZooKeeper became a known scaling bottleneck — large clusters hit limits at ~200,000 partitions because of metadata-replication overhead.

KRaft (KIP-500, KIP-833). Apache Kafka 3.3 (2022) made KRaft mode production-ready; 4.0 (2025) removes ZooKeeper entirely. Kafka now stores metadata in an internal Kafka topic (__cluster_metadata) replicated via Raft consensus. Same Raft algorithm as etcd, CockroachDB, or Consul, embedded in Kafka.

What it changes operationally. Bootstrap is faster (one process to start instead of two), failover is faster (~100ms instead of ~10s), and large-cluster scaling no longer hits ZooKeeper's 200k-partition wall. Confluent's published benchmarks show million-partition clusters working comfortably.

Migration path. Existing ZooKeeper-mode clusters can migrate to KRaft via a documented in-place process. Most managed services (Confluent Cloud, AWS MSK, Aiven) had completed migration by 2024.



A closing note

Kafka is one of the rare pieces of infrastructure that gets simpler the more carefully you read it. The whole system is a partitioned, replicated, append-only log with cooperative readers. Producers append. Followers replicate. Consumers track an integer. Every operational concept — replication, lag, rebalance, exactly-once — is one of those four primitives in a different role. Once you see that, the rest is just tuning knobs around the log.

Read
further.

Related Async architecture High water mark In-Sync Replicas
Found this useful?