Burns et al · 2016
Paper · Systems · Container orchestration

Borg, Omega, Kubernetes, one decade, three schedulers.

Three container-orchestration systems, each shipped by Google, each lasting a decade. Borg ran every Google service from 2004. Omega was the 2013 attempt to fix Borg's monolithic scheduler. Kubernetes is the open-source descendant that distilled the lessons into a product everyone outside Google could actually use. The paper is the rare candid retrospective from the architects of all three.

Authors Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, John Wilkes
Year 2016
Venue ACM Queue

TL;DR

Borg (production from ~2004) was a single monolithic scheduler that placed every job on every machine in a Google cluster. By 2010 the scheduler was a bottleneck and adding features required modifying its core. Omega (2013) decomposed Borg into a shared persistent state store and multiple specialised schedulers running optimistic concurrency control against it. Each scheduler could be optimised for its workload (batch, service, gpu) and they coexisted. Kubernetes (2014) was the open-source rewrite designed for non-Google environments. It kept Omega's key insights — declarative API, shared state via etcd, controller pattern — but added pod and label abstractions that hid Borg's "job" concept and made the system more accessible. The paper documents what each generation got right and what they had to throw away.

The problem

Google had been running container-orchestration internally since 2004 — long before Docker existed, long before "Kubernetes" was a word. Borg was an internal system; very few outside Google knew the details. By 2013 the team realised two things. First, Borg's monolithic scheduler had become a development bottleneck — every new feature touched core code. Second, the lessons of Borg deserved an open-source release, but a literal port of Borg wouldn't work for outside companies.

The paper documents this transition in unusually frank terms: what Borg got right, what Borg got wrong, what Omega tried to fix and what it had to compromise on, what Kubernetes deliberately copied and what it deliberately changed. Most retrospectives are sanitised; this one is closer to engineering postmortem honesty.

The key idea

Borg: monolithic scheduler, prescriptive API. A single scheduler decides where every task runs. The API is imperative — you submit a job descriptor, the scheduler places it. Resource limits, priority classes, anti-affinity, gang scheduling are all built into the central scheduler. This worked while Google had one team owning the scheduler; it stopped working when many teams wanted to extend it.

Omega: shared state, optimistic concurrency. The architectural insight is that the scheduler doesn't need to be a single component. A persistent state store holds the cluster's current allocation. Multiple schedulers read from it concurrently, propose changes optimistically, and commit using transactional updates. Conflicts are detected and the loser retries. Each scheduler can be specialised — batch, service, ML, etc. — without coordinating with the others except through the state store.

Kubernetes: declarative API, the controller pattern. Kubernetes inherits Omega's shared-state design (etcd is the store) but makes the API declarative: you tell the system what you want (a Deployment with three replicas), and controllers continuously reconcile the desired state with the observed state. Components are loosely coupled — the scheduler, kubelet, controllers all talk to the same API server, not to each other. The pod abstraction (a co-scheduled group of containers) replaces Borg's alloc concept and becomes the unit everyone reasons about.

Labels and selectors. The paper points out that Borg's tight coupling between jobs and machines made workload reconfiguration expensive. Kubernetes' labels are a deliberately decoupled scheme: every object has labels; selectors query objects by label. A Service's endpoints are "all pods with label app=foo"; if a pod changes labels, the Service's endpoint set changes automatically. This was a substantial departure from Borg.

The reconciliation loop. The controller pattern — controllers that watch state and continuously work to make observed reality match desired state — is the single most-borrowed idea from this paper. It's the foundation of every Kubernetes controller, every operator, every GitOps tool, and most modern infrastructure-as-code systems. The paper documents how Borg's imperative job submission gave way to Kubernetes' declarative resource specs, and how that change cascades through every component.

Contributions

Three generations of design. The paper documents the evolution honestly — what each generation got right, what it had to abandon. Most internal retrospectives stay internal; this one is public.

The Omega architecture. Shared state, optimistic concurrency, multiple specialised schedulers. The architectural pattern shows up in Kubernetes' API server + multiple controllers, in Mesos' two-level scheduling, and in modern multi-cluster systems.

Pods. The unit of co-scheduled, co-located containers. The paper makes the case that Borg's allocs were too rigid; pods are the same idea, refined.

Declarative API + controller pattern. The most-imitated idea from Kubernetes. Tells the whole industry that infrastructure is declarative, not imperative.

The honest "things we got wrong" list. The paper enumerates specific Borg decisions the team would revisit: job-centric resource model, opaque scheduling logic, in-place updates that didn't survive scheduler crashes, the absence of pod-like abstractions. Useful for any team designing a new orchestration layer.

Criticisms and limitations

The paper is short. Each generation gets a few pages. Some claims (especially about Omega's production performance) are stated without detail. The follow-up Borg paper (Verma et al, EuroSys 2015) provides more depth on Borg specifically.

It's a paper, not a manual. Some claims about Kubernetes are aspirational — the paper was written when Kubernetes was a year old. Several "we intend to" statements have become "we tried that and it didn't work".

Optimistic concurrency at cluster scale is hard. The paper presents shared-state-with-OCC as the clear win, but production Kubernetes (etcd-backed) has had performance issues at large cluster sizes. Multiple etcd-backed clusters of 5000+ nodes have run into write-throughput limits that the original paper didn't anticipate.

The "Kubernetes is for everyone" claim was always optimistic. The paper underemphasises how much Google-internal context Borg assumed (predictable workloads, central operations, deeply integrated security). Open-source Kubernetes inherits some of those assumptions awkwardly — multi-tenancy, network security, RBAC are all areas where the project has struggled.

Where it shows up today

Kubernetes — the descendant. Every cluster running CNCF Kubernetes traces its lineage to this paper.

Apache Mesos and Marathon — independent scheduler stack with similar two-level scheduling principles.

HashiCorp Nomad — a competitor scheduler that explicitly compares itself to Borg and Kubernetes in its own design docs.

Apache Airflow, Argo Workflows, Temporal — workflow orchestrators built on the Kubernetes controller pattern.

Crossplane, ArgoCD, Flux — GitOps tooling that uses the reconciliation-loop pattern at a higher level (Git as the desired state).

Every major cloud provider's managed Kubernetes (GKE, EKS, AKS) and their internal scheduler implementations (GCE's Borg-derived scheduler, AWS's Fargate, Azure's ACI).

Follow-up reading

More annotated papers
Back to the papers index
Foundational distributed-systems and database papers, read and annotated.
Found this useful?