MapReduce, embarrassing parallelism, productised.

MapReduce isn't a new algorithm. It's a programming model small enough to fit on a napkin: split input into records, apply a map function to each, group by key, apply a reduce function to each group. The paper's contribution is the engineering around it — a runtime that hides parallelism, failure, and data movement behind those two functions, letting an analyst on a single machine express a job that runs on ten thousand.

Authors Jeffrey Dean, Sanjay Ghemawat

Year 2004

Venue OSDI

PDF →

TL;DR

Programmers write two pure functions: map(key, value) → list((key, value)) and reduce(key, list(value)) → list(value). The MapReduce runtime splits the input into chunks, runs map tasks in parallel across a cluster, shuffles intermediate output by key, runs reduce tasks in parallel, and writes the result. Workers that fail or run slowly are re-executed; the framework guarantees that the final output is the same as a single-threaded execution. Google ran tens of thousands of MapReduce jobs a day by 2004, processing petabytes — and the paper made distributed batch processing accessible to a generation of engineers who had never written for a cluster before.

The problem

Google had thousands of programmers writing one-off cluster jobs: building the search index, computing PageRank, processing crawl logs, extracting structured data from web pages. Each job dealt with the same three problems — partitioning the input across machines, handling failed workers, and moving data between map and reduce phases — and each programmer solved them differently and badly.

The paper observes that "most computations involve applying a map operation to each logical record [...] to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key." Once you notice this, you can factor out everything else into a library. The shape of the user code shrinks to two functions; the shape of the runtime grows enormously, but the runtime is written once.

The key idea

The model assumes pure functional map and reduce — no shared state, no order dependence within a single key. Given that, the runtime is free to: split inputs however it wants, run any number of map tasks in parallel, restart failed tasks anywhere, and shuffle intermediate data through arbitrary network paths. The output is deterministic.

The architecture is master-worker. A single master assigns map and reduce tasks to workers, tracks task state, and handles failures. Map outputs are written to local disk on each worker, partitioned by reduce-task index. Reducers pull their partition from every mapper (the "shuffle"). The master is a single point of failure, but in 2004 a cluster of 1,800 machines crashed often enough that the master was rarely the bottleneck.

Fault tolerance is brutal-simple: a failed map task is re-executed; a failed reduce task is re-executed; intermediate data on disk is treated as recoverable but maps may be re-run to regenerate it. Stragglers (slow workers) are speculatively re-executed in parallel, and whichever copy finishes first is used.

Locality is the secret. Map tasks are scheduled on workers that already hold the input chunk on local disk (because GFS replicates chunks across the cluster). The paper reports that more than 90% of the data is read locally on Google's clusters, meaning no network bandwidth is consumed by the map phase. This is the single optimisation that made MapReduce feasible at petabyte scale on commodity gigabit Ethernet.

Contributions

The model. Map and Reduce as the only primitives a user writes. Every framework that came after — Hadoop, Spark, Flink, Beam, BigQuery — is either MapReduce-as-described or a generalisation (DAGs of map-reduce-like stages).

The runtime. Master-worker with task re-execution and speculative duplication. The paper's sections on fault tolerance, locality, and backup tasks read like a checklist that every batch framework has since implemented.

The engineering economics. The paper makes the case that batch processing on commodity hardware with re-execution-on-failure is cheaper than HPC-style supercomputers with hardware reliability. This is now the dominant assumption underneath every cloud-scale batch system.

Backup tasks. Section 3.6 introduces speculative re-execution of straggling tasks. The technique is now standard in Spark, Tez, Flink, and most database query engines. The paper measures: enabling backup tasks reduces total job time by ~44% on a representative workload.

Criticisms and limitations

The shuffle is the bottleneck. With M map tasks and R reduce tasks, the network carries M × R partitioned streams. For a job with thousands of mappers and reducers, this approaches network saturation. Spark and Flink later improved this with broadcast joins, partition pruning, and in-memory shuffle services.

Iterative algorithms (PageRank, machine-learning training, graph processing) require many MapReduce jobs chained together, each writing its output back to GFS and reading it again. This pays the shuffle cost on every iteration. Spark's RDD abstraction (Zaharia et al, NSDI 2012) addressed this by keeping intermediate data in memory between stages.

The two-function API is too rigid for many workloads. Real jobs want broadcast variables, joins, filters, custom partitioners — features that require user code outside the map/reduce boundary. Hadoop added Pig and Hive on top; Spark and Flink simply replaced the model with DAGs of operations.

Where it shows up today

Apache Hadoop MapReduce — the open-source descendant, written shortly after this paper at Yahoo. Powered the early big-data ecosystem (2007–2014) and is still used in legacy pipelines.

Apache Spark — a generalisation of MapReduce that keeps intermediate data in memory and supports DAGs of operations. The dominant batch engine of the 2010s and 2020s. Databricks, Palantir, Amazon EMR.

Apache Flink, Apache Beam, Google Dataflow — stream-first frameworks that subsume MapReduce as a special case (batch is a bounded stream).

BigQuery, Snowflake, Redshift — analytical databases whose execution engines use MapReduce-style shuffle stages internally, hidden behind SQL.

Follow-up reading

The Google File System — Ghemawat et al · 2003 · SOSP. The storage layer underneath. MapReduce depends on it for locality and for fault-tolerant intermediate storage.
Resilient Distributed Datasets — Zaharia et al · 2012 · NSDI. The Spark paper. RDDs as the in-memory generalisation of MapReduce.
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks — Isard et al · 2007 · EuroSys. Microsoft's DAG-based answer to MapReduce. Predates Spark by 5 years.
FlumeJava: Easy, Efficient Data-Parallel Pipelines — Chambers et al · 2010 · PLDI. Google's in-house MapReduce successor. Composes pipelines of MR stages and optimises them.
MillWheel: Fault-Tolerant Stream Processing at Internet Scale — Akidau et al · 2013 · VLDB. The streaming MapReduce. Became the Dataflow / Beam model.

More annotated papers

Back to the papers index

Foundational distributed-systems and database papers, read and annotated.

← All papers

Found this useful?