The Google File System, commodity disks, custom semantics.

GFS is the file system Google built for itself when it discovered that none of the existing distributed file systems matched its workload. The paper is unusually direct about the trade-offs: a single master, append-mostly writes, weak consistency for concurrent writes, and 64-MB chunks. By rejecting POSIX faithfulness, GFS got something the academic file systems of the 1990s never reached: production scale on commodity hardware.

→ The original PDF (research.google, free)

Authors Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

Year 2003

Venue SOSP

PDF →

TL;DR

GFS stores large files (typically multi-GB) split into fixed-size 64-MB chunks. Each chunk is replicated three times across chunkservers on commodity Linux boxes. A single master holds all metadata (file namespace, chunk locations) in memory — its state is small enough to fit because chunks are large. Clients read and write chunks directly to/from chunkservers; the master is only consulted for metadata. The consistency model is deliberately weak: appends are atomic and at-least-once, concurrent random writes can produce inconsistent regions, and applications are expected to use checksums and unique-ID-keyed records to detect duplicates. This trade-off bought Google a file system that scaled to thousands of disks and absorbed hundreds of failures per day.

The problem

In 2001 Google was outgrowing every available distributed file system. NFS, AFS, Coda, and the like assumed POSIX semantics, small files, and reliable disks. Google's actual workload had none of those properties: files were huge (multi-GB index shards, log files), writes were almost always appends from many producers, reads were either streaming or random-but-batched, and the disks failed constantly — at thousands of machines, hundreds of disks per day.

They also had freedom most academic systems didn't: they wrote both the file system and every program that used it. They could rewrite their applications to use any API the file system offered. That let them ask: if we don't have to be POSIX-compliant, what's the simplest design that scales?

The key idea

The architecture is master + chunkservers. The master holds every file's chunk list, every chunk's server locations, and the namespace tree. All of that fits in memory (the paper estimates ~64 bytes of metadata per 64-MB chunk). A single master simplifies everything — locking, replication policy, garbage collection — and the master is consulted only for metadata, not for data.

Data flows directly between clients and chunkservers. To write a chunk, a client gets the list of replicas from the master, pushes the data to the closest replica, which forwards in a pipeline to the next replica, and so on. One designated primary chunkserver serialises the order of writes within that chunk; the others apply writes in the same order.

The append-record API is the workhorse. The application calls RecordAppend and the file system appends the record atomically to at least one chunkserver. If a replica fails, the operation is retried and may produce duplicate records or padding, but the application is expected to handle this by tagging each record with a unique ID and ignoring duplicates on read. The paper makes the case that this matches application needs better than strict POSIX semantics.

Append is the contract. Reading the paper, the most surprising design choice is not the single master — it's how weak the write semantics are. Concurrent random writes can produce regions where different replicas hold different bytes. The system guarantees only that appends succeed atomically at least once. This sounds broken until you notice that Google's applications all wanted to append log records, not modify them in place. The file system was redesigned around the access pattern, not the other way around.

Contributions

Single master, scaled by metadata compactness. The 64-MB chunk size is the load-bearing design choice. It means a petabyte of data has ~16 million chunks, and the master can fit all of that metadata in 1 GB of RAM. Most distributed file systems before GFS struggled to scale metadata; GFS sidesteps the problem by making each unit of metadata cover 64 MB of data.

Replication as the consistency mechanism. Three copies per chunk, written via the primary-secondary pipeline. No quorum reads, no Paxos in the data path. The master uses a heartbeat protocol to detect failed chunkservers and re-replicates chunks that fall below their target.

The append API. Atomically appending records of unknown size, with at-least-once semantics, is the API every multi-producer log system has wanted since. It maps directly onto Kafka's log, BigQuery's sorted run files, and the various write-ahead-log libraries that succeeded.

Lazy reclamation. Deleted files are not immediately removed; the master renames them to a hidden name and reclaims chunks lazily in the background. The same pattern shows up later in HDFS and in object stores like S3.

Criticisms and limitations

The single master is the obvious limitation. As cluster sizes grew, the master became a bottleneck — operationally for failover, performance-wise for metadata-heavy workloads (many small files, many metadata operations per second). Google's successor, Colossus, distributes the master across many machines.

Small files are inefficient. Each file consumes at least one chunk's worth of metadata, and many applications later wanted millions of small files. GFS effectively forced applications to pack their small files into large container files.

The weak consistency model puts the burden on applications. Every multi-writer use case has to handle deduplication, padding bytes, and inconsistent regions. Google could afford this because they controlled both layers; an open-source descendant in HDFS later tightened the semantics for the same reason.

Where it shows up today

Apache HDFS — a near-clone of GFS, written at Yahoo for Hadoop in 2006. Same single-master design, same 64-MB chunks (later configurable), same replication factor of 3. Still ships with most Hadoop distributions.

Google Colossus — the unannounced successor to GFS, in production at Google since around 2010. Distributes the master, but the storage tier is recognisably descendant.

Object stores — S3, GCS, Azure Blob Storage — borrow GFS's append-only, replicate-then-serve ideas while presenting a flatter REST API. The internal architecture of any cloud blob store is essentially GFS-shaped.

Apache Ozone, MinIO, Ceph's RGW layer — all descend conceptually from GFS, with various tweaks (erasure coding, distributed masters).

Follow-up reading

MapReduce: Simplified Data Processing on Large Clusters — Dean & Ghemawat · 2004 · OSDI. The batch processing layer that consumed GFS. Annotated.
Bigtable: A Distributed Storage System for Structured Data — Chang et al · 2006 · OSDI. The structured-storage layer above GFS. Annotated.
The Hadoop Distributed File System — Shvachko et al · 2010 · MSST. The open-source clone. Worth reading for the operational details of running a single-master FS at scale.
Colossus: Successor to the Google File System (GFS) — Quinlan · 2010 · QCon talk. The unpublished successor. Notes from a Google talk; no formal paper.
Ceph: A Scalable, High-Performance Distributed File System — Weil et al · 2006 · OSDI. A contemporary, peer-to-peer alternative to GFS. CRUSH hashing instead of central master.

More annotated papers

Back to the papers index

Foundational distributed-systems and database papers, read and annotated.

← All papers

Found this useful?