19 / 20

Topics / 19

Streaming systems

Batch jobs run when scheduled and emit one answer. Streaming jobs run forever and emit a sequence of answers, each one slightly better than the last. The shift sounds like a tooling swap, but it is a different physics. Time stops being a sort key and becomes the input. The answer at minute 9 may need revising at minute 11 because a packet arrived late. Exactly-once stops being a guarantee about message delivery and becomes a guarantee about what survives a checkpoint. This chapter walks through that physics: event time vs processing time, watermarks, the four window shapes, exactly-once as a property of state, and the architecture choices Flink, Kafka Streams, and Spark Structured Streaming each made when they sat down to ship.

Why streaming is not just fast batch

The clean way to think about batch is "the data sits still, the query moves". You point a query at a finite table; it scans, aggregates, returns. The world is paused while the job runs. The clean way to think about streaming is "the query sits still, the data moves". The query is a long-running operator graph; events arrive forever; the answer is whatever the operators currently hold.

That sounds like a tooling difference until you write your first non-trivial stream job. Then the question "what is the answer right now" stops having a clean meaning. The job has seen events with event timestamps between 10:00 and 10:05. Two more late events with timestamps at 09:58 arrive at 10:07. Was the answer at 10:05 wrong? It was correct given what had arrived. It is also wrong now. Streaming forces you to answer: are emitted results provisional or final, and if provisional, how do consumers revise?

Batch lets you avoid that question by waiting until the input is closed. Streaming never closes. Every concept that follows (watermarks, windows, triggers, exactly-once, state) exists to give you tools for that one decision.

Event time vs processing time

Every event in a stream carries two timestamps. The event time is when the event actually happened in the world: when the user clicked the button, when the sensor reading was taken, when the order was placed. The processing time is when the streaming engine got to see it.

These two clocks diverge for predictable reasons. A user clicks a button at 10:00:00.000 from a phone with intermittent cellular signal. The click sits in the device's outbox for eight minutes. It hits the API at 10:08:00.000, gets enqueued in Kafka at 10:08:00.040, and a Flink job consumes it at 10:08:00.120. The event time is 10:00:00.000; the processing time is 10:08:00.120; the eight-minute gap is the lag.

In a well-behaved pipeline lag is bounded: a few seconds, maybe a few minutes during incidents. In a pipeline that ingests from mobile clients, IoT devices, or partner systems, lag is unbounded in principle and routinely measured in hours. Choosing which clock you compute on is the single biggest design decision in a streaming job.

processing-time aggregation:    answers fast,  drifts under load,  wrong on replays
event-time aggregation:         answers slow,  stable across replays,  correct in the limit
ingest-time aggregation:        the bad middle  — uses one clock and pretends it's the other

Almost every business question is in event time. "How many users signed up today" means "today in their local day", not "today as seen by the cluster's CPU clock". The fast, tempting answer in processing time is almost always wrong. It is just wrong in ways that are easy to miss when the system is healthy and devastating when it is not. The Dataflow paper (Akidau et al., 2015) calls this out as the lesson learned from a decade of MillWheel and FlumeJava at Google.

Watermarks

Once you commit to event time, you need a way to answer the question "have I seen all the events for this window yet, or should I keep waiting". The window from 10:00 to 10:05 cannot emit a final answer at 10:05:00 in processing time; an event with event time 10:04 might arrive at 10:07.

A watermark is the engine's claim about progress in event time. "Watermark at 10:06" means the engine believes no event with event time earlier than 10:06 will arrive from now on. When the watermark crosses 10:05, the 10:00-10:05 window can fire.

The watermark is a heuristic. The engine cannot know what is still in flight, so it can only guess. Watermark generators fall into three families:

Bounded-out-of-orderness. Trust the source within a fixed lag. "Watermark = max(event time seen) − 30s". Simple, predictable, wrong on outliers. Good default for pipelines whose source skew is well-understood.

Heuristic on histogram. Compute the 99th percentile of arrival-to-event lag over a sliding window and use that as the offset. Adapts to source weather; harder to reason about because the watermark itself drifts.

Source-provided. Some sources (Kafka with per-partition idle detection, Kinesis with sequence numbers, Pulsar with topic-level checkpoints) can assert progress directly: "this partition has emitted everything up to offset N, no earlier records will come". Most accurate when available; requires source cooperation.

The late event problem. No watermark strategy eliminates late arrivals. It only chooses which late arrivals to ignore. Flink lets you set allowedLateness per window so a window stays open for that long after firing, accepting updates. Beam's triggers extend the same idea: a window can fire early, fire on-time, or fire late, each producing a pane. The consumer has to deal with revisions.

The four windows

Windowing is how you carve infinite streams into bounded groups for aggregation. Four shapes cover almost everything that ships.

Tumbling. Fixed size, no overlap. The 10:00-10:05 window, then 10:05-10:10, then 10:10-10:15. Each event belongs to exactly one window. The default for periodic rollups like "requests per minute" or "revenue per hour".

Sliding. Fixed size, overlapping by a stride. A 5-minute window every 1 minute gives you a continuously updated 5-minute moving total. Each event belongs to multiple windows (in this case five). Cost scales with size/stride. A 1-hour window sliding every second is 3600 windows per event and almost always a sign you wanted a session window or a continuous query instead.

Session. Gap-driven. A session ends when no event arrives for some idle gap (say, 30 minutes). Sessions have no fixed boundary in advance; they are computed from the data. Ideal for user-behaviour analytics ("how long did this user stay on the site"), fraud rings, login sessions. Hard to get right when events arrive out of order, because a late event can merge two sessions that were already finalized.

Global. One window, all events, forever. Useless on its own, since you would never get a result, but combined with custom triggers it expresses things like "emit a running count every 1000 events" or "emit only when the value crosses a threshold". The escape hatch for everything the other three do not cover.

// Flink, Java
input
  .keyBy(e -> e.userId)
  .window(EventTimeSessionWindows.withGap(Duration.ofMinutes(30)))
  .reduce(SessionStats::merge);

// Kafka Streams, Java
input
  .groupByKey()
  .windowedBy(SessionWindows.with(Duration.ofMinutes(30)))
  .aggregate(SessionStats::new, SessionStats::add, SessionStats::merge);

// Spark Structured Streaming, Scala
input
  .withWatermark("eventTime", "30 minutes")
  .groupBy(session_window($"eventTime", "30 minutes"), $"userId")
  .agg(...);

The API surface is similar across the three engines. The semantics around session merging under late events, though, diverge in ways that bite production. Flink merges sessions correctly even with late events landing between two already-fired windows. Kafka Streams (through 3.x) requires you to model session merging manually if you want it. Spark Structured Streaming added session windows in 3.2 with explicit watermark-based eviction.

Triggers, panes, and refinements

A window says "this is the group of events I am aggregating". A trigger says "this is when to emit a result for that group". The Beam / Dataflow model separates the two cleanly. Flink folds triggers into the windowing API. Kafka Streams uses time-based emission with suppression operators. Spark Structured Streaming uses watermarks and the query's output mode.

A pane is one emitted result for one window. A window can produce many panes over its lifetime: an early pane when 50% of expected events have arrived, an on-time pane when the watermark passes, late panes as stragglers come in. The downstream consumer can either replace (accumulate-and-retract: subtract the old, add the new) or append (accumulating: just emit the new total). The right choice depends on whether your sink can express updates.

Sinks that can naturally express updates: keyed KV stores, upsert-capable databases, compacted Kafka topics. Sinks that can only append: append-only logs, change feeds, analytics tables with time-partitioning. If your sink is append-only, late panes turn into "correction" records the consumer has to know how to apply. This is where most late-arriving-event logic gets ugly in practice.

State, the real cost of stateful streaming

Streaming operators that compute joins, windowed aggregates, sessions, or custom keyed logic all need state, a memory of what they have seen for each key. State is bounded by what you can keep on the operator nodes, and it has to survive failures, so it has to be checkpointed somewhere durable.

The state-storage spectrum:

In-memory only. Fastest, lost on restart. Usable for stateless transformations and small lookups; useless for production state.

Local embedded store, periodic checkpoint. Flink with RocksDB on local disk, periodically snapshotted to S3. Kafka Streams with RocksDB and changelog topics. Reads stay local (microseconds); checkpoints are async (the operator never blocks on a snapshot). The dominant pattern at scale.

Remote state. Faust-style with Redis; some Apache Beam runners; some cloud-managed engines. Reads cross the network on every operator step. Simpler operations, lower throughput, easier scaling. Sometimes worth it for state that needs to be queried from outside the streaming job.

State is the operations bill nobody budgets. A windowed aggregate on a high-cardinality key (user id, device id, session id) can hold tens of gigabytes per operator instance. A checkpoint to S3 can take minutes if the state is large and the network is slow. When checkpoints exceed the checkpoint interval, back-pressure rises and the job falls behind. The operations team that never realised streaming jobs had state at all gets a midnight page.

Exactly-once is a property of state, not of delivery

"Exactly-once" is the most contested phrase in distributed systems marketing. The honest version: networks deliver messages zero or more times, and you cannot guarantee any specific message is delivered exactly once. What you can guarantee is that the effect of the message on durable state appears exactly once after recovery. The message itself may have been delivered three times; the side-effect appears once.

Streaming engines achieve this through three pieces working together:

1. Replayable input. The source must support reading from a checkpointed position. Kafka offsets, Kinesis sequence numbers, Pulsar message ids. If the source cannot replay, exactly-once is impossible at the streaming layer.

2. Atomic state checkpoint. The engine periodically takes a consistent snapshot of all operator state plus the input position. On failure, all state is restored to the last snapshot and input is replayed from there. Flink uses asynchronous barrier snapshotting (Chandy-Lamport variant); Kafka Streams uses changelog topics; Spark Structured Streaming uses micro-batch boundaries with WAL.

3. Transactional sink. The sink must either support idempotent writes (so a replay is a no-op) or participate in a two-phase commit with the engine. Kafka has transactional producers; databases with primary keys can do upsert; Iceberg/Delta tables can do atomic table commits. Sinks that are plain append (S3 files, plain HTTP POST) need external dedup or accept at-least-once.

Take away any of the three and exactly-once degrades to at-least-once. Most teams that "have exactly-once" on paper actually have at-least-once with idempotent sinks doing the heavy lifting. That's fine, and often easier to reason about than the engine's transactional protocol, but worth being honest about.

Joins in streams

Joining two streams is where the design space explodes. Three common cases:

Stream-stream window join. Match events from stream A and stream B that arrived within a time bound of each other ("clicks within 5 minutes of an impression"). State holds both streams' recent events keyed by the join key; an arrival from either side emits matches and ages out old events past the window. Expensive, since state grows with the window size times event rate.

Stream-table join (enrichment). Each event in the stream looks up the current value of a key in a table held in state. The table is updated by a changelog topic or external source. Cheap on the stream side, but the table can be large — a global user table at a consumer-scale company is hundreds of millions of rows, hundreds of GB. Solutions: global tables (replicate everything to every operator, fine for thousands or low millions of rows) vs partitioned tables (co-partition by the join key, fine for high-cardinality tables but constrains key choice).

Temporal table join. "Match this event against the value the table had at the event's timestamp, not the current value." Needed for event-time-correct enrichment when the table itself changes over time (e.g., currency conversion using the rate at the time of the order, not now). Requires keeping a versioned table, which most engines model as a slowly-changing dimension.

Flink, Kafka Streams, Spark Structured Streaming

The three engines you actually choose between in 2026.

Flink is a true continuous-execution engine: operators run forever, processing one record at a time. Latency is record-by-record, typically sub-100ms. Watermarks, event-time windows, rich state, and exactly-once via aligned-barrier snapshotting are all native. The cost is operational complexity: you run a JobManager and TaskManagers, deal with checkpoint tuning, and need engineers comfortable with Java/Scala operator DAGs. The strongest choice when latency matters or when stateful event-time semantics are central.

Kafka Streams is a library, not a cluster. You write a Java/Scala app that reads from Kafka, processes, writes back to Kafka. State is kept in local RocksDB and backed up via changelog topics. There is no separate cluster to operate. Your app is the cluster. Latency is comparable to Flink on simple workloads; complex stateful joins are harder to express. Best fit for organizations already standardized on Kafka, especially for in-place enrichment, simple aggregations, and as-a-library deployment.

Spark Structured Streaming is micro-batch by default (continuous mode is experimental and limited). It runs as a sequence of small batch jobs, typically every few hundred ms to a few seconds. Latency is bounded by batch interval; throughput is extremely high. State management uses RocksDB or HDFS; exactly-once relies on idempotent sinks. The natural choice when streaming is the new face on an existing Spark warehouse, when the team knows Spark, or when latency in the seconds is acceptable.

                    Flink              Kafka Streams       Spark Structured
latency             10-100 ms          10-100 ms           seconds (micro-batch)
deployment          cluster            library             cluster
state backend       RocksDB local      RocksDB local       RocksDB / HDFS
event-time native   yes                yes                 yes (since 2.3)
exactly-once        ABS + 2PC sinks    transactions        idempotent sinks
session windows     correct            manual merge        since 3.2
operational cost    high               low (just an app)   medium (Spark cluster)

The lambda and kappa debates

Nathan Marz's lambda architecture (2011) proposed running batch and streaming in parallel: the batch layer recomputes the canonical answer slowly and correctly; the streaming layer fills the gap with an approximate answer until batch catches up; a serving layer merges both. The motivation was honest: streaming engines of the day could not be trusted for correctness, batch engines could not be trusted for latency, so run both.

Jay Kreps's kappa architecture (2014) proposed throwing batch away: if streaming can do exactly-once with replayable input, then a re-run of the same stream over the same input produces the same answer, which is a batch job. The serving layer becomes unnecessary; one pipeline does both real-time and historical recomputation.

In 2026, the practical answer is hybrid in a different way. Streaming engines now ship honest exactly-once on stateful work, so kappa is achievable. But warehouses (Snowflake, BigQuery, Databricks) are too convenient to ignore, so most production data platforms run a streaming layer for low-latency state (sessions, fraud flags, real-time dashboards) and feed the same source events into the warehouse for historical analytics. Whether you call that lambda or kappa is a matter of taste; the architecture is the same.

When not to stream

Three patterns that look like they want streaming and almost never do.

"Real-time dashboards" that refresh every minute. A minute is forever in streaming terms. A SQL query over an OLAP warehouse refreshed every minute is cheaper to build, cheaper to operate, and easier to debug than a Flink job.

"We need to react to events as they happen." If "react" means "emit a notification within a few seconds of the triggering event", a serverless function or a Kafka consumer with simple business logic is enough. You need streaming when the reaction depends on aggregated state — counts over windows, joins across streams, sessions.

"We want to process events one at a time instead of batching them." Latency is not the same as throughput. Batches of 100 with 50ms intervals can have lower per-event latency than naive one-at-a-time processing, thanks to amortized syscall cost and locality. Let what the consumer needs to see drive the choice, not what feels modern.

Operational realities

Three things every streaming-system operator learns the hard way.

Backfills are still a thing. A bug ships, the pipeline produces wrong output for six hours, you fix it. Now you have six hours of bad records to revise. If your engine supports replay and your sinks support upsert, you re-run from the right offset. If not, you write a one-off batch job over the source data to overwrite the bad output. Either way, exactly-once on the live path does not save you from backfill plumbing.

Schema evolution breaks state. The Avro schema for an event gains a new field. The streaming operator's keyed state still has the old shape on disk. The schema registry will deserialize new events fine; the operator's state migration is up to you. Practical defenses: keep state schemas separate from event schemas, version explicitly, plan migrations as part of every schema change.

Out-of-order is not the worst case. The worst case is "a single key with catastrophically more traffic than the others". Hot keys defeat partitioning; the operator handling that key is saturated; back-pressure rises; the whole job slows down. Fixes are workload-specific (pre-aggregation, key salting, separating hot keys onto dedicated partitions) and all of them complicate exactly-once. The simplest fix is monitoring per-key throughput and reacting before the hot key takes over.

What ships in 2026

The streaming landscape stabilized over the past five years. Flink owns the latency-sensitive stateful streaming niche, especially at scale; the open-source release cadence and the Confluent / Ververica backing keep it healthy. Kafka Streams has become the default for organizations whose data already lives in Kafka and whose use case is enrichment, simple aggregations, or simple joins. Spark Structured Streaming is the path of least resistence for teams whose batch warehouse runs on Spark and who can tolerate multi-second latency.

The interesting movements are on either flank. RisingWave and Materialize bring SQL-native streaming with materialised views as the programming model: write a CREATE MATERIALIZED VIEW, the system keeps it fresh, query it like a table. Both are still small ecosystems but represent where the SQL-shaped surface of streaming is going. Apache Beam remains the cross-runner abstraction but has lost steam as Flink and Spark consolidated. New projects more often pick one runner directly.

The decision is rarely "which engine is best" and almost always "which engine fits the team and the source". A team running Kafka with five engineers ships Kafka Streams faster than they ship Flink. A team running on Databricks ships Structured Streaming faster than either. A team building a latency-sensitive product where event-time correctness is in the spec ships Flink and pays the operations bill.

Streaming systems

Why streaming is not just fast batch

Event time vs processing time

Watermarks

The four windows

Triggers, panes, and refinements

State, the real cost of stateful streaming

Exactly-once is a property of state, not of delivery

Joins in streams

Flink, Kafka Streams, Spark Structured Streaming

The lambda and kappa debates

When not to stream

Operational realities

What ships in 2026

Further reading