12 / 19
Playbook / 12

Distributed scheduler

Cron, at scale. The mechanics are old — fire a job at a wall-clock time — but the interesting questions show up the moment one node isn't enough. Who owns which schedule. What happens when the owner dies mid-firing. Whether a missed tick gets dropped or caught up. How "fire once" survives a network partition. Most of the work is in the semantics, not the timer.


1 · Clarifying questions

Scheduler design lives or dies on the semantics. Pin them down before drawing boxes.

How many schedules?Order of magnitude matters more than the exact number. 10K vs 10M changes whether one node can hold the full set in memory.
What firing semantics?At-most-once (drop on failure), at-least-once (retry, target dedupes), or exactly-once (the hardest, usually needs target idempotency). Default to at-least-once.
How long does a job run?Seconds (call an HTTP endpoint) or hours (kick off a batch). Long-running jobs change the lease model and the "is it still running?" check.
Do we retry on failure?Yes, with backoff and a cap. Confirm the cap (e.g. 5 attempts over 30 min) so we can size the firings table.
What about missed firings?If the scheduler was down across a firing time, do we skip it or fire late with a catch-up marker? Most products want fire-late; some (e.g. "send me a notification at 9am") want skip.
Timezone story?Crons are stored with an IANA timezone. DST transitions are explicit: "2:30 AM every day" is undefined on the spring-forward day.
Multi-tenant?Yes. One tenant with a million schedules cannot starve the others — fairness is a design constraint, not a follow-up.
Latency budget for the fire?P99 within ~1 s of the scheduled time. Tighter budgets (sub-100 ms) need a different design.

2 · Capacity math, on a napkin

The numbers set the floor for the schedule store, the scheduler tier, and the worker tier. Each scales for a different reason.

NumberCalculationResult
Scheduled jobsgiven10M
Firings / sec (avg)amortised across the day~1K
Firings / sec (peak)~5× over midnight + top-of-hour~5K
Concurrent jobs in flight5K peak × 30 s avg duration~150K
Schedule row sizecron + target + payload + metadata~10 KB
Schedule store size10M × 10 KB~100 GB
Firings table / day~5K avg × 86,400 × ~1 KB~430 GB / day raw
Firings retention30 days at 80% compression~2.5 TB
Midnight cron burst10× the average~50K firings in 1 s

The thundering-herd line is the one to call out. A naive design that fires every "midnight UTC" cron at exactly 00:00:00.000 turns a 1K-rps system into a 50K-rps spike. Jitter the fire times by a few seconds and the spike flattens.

3 · API and data model

A narrow surface. The schedule is the contract; the firings table is the audit log.

Endpoints

POST /v1/schedules                    # create
{
  "cron_expr": "0 9 * * MON-FRI",
  "timezone": "America/New_York",
  "target_url": "https://acme.com/webhooks/morning-report",
  "payload": {"team": "ops"},
  "idempotency_key": "morning-report-ops"
}
→ 201 {"id": "sch_01HX...", "next_fire_at": "2026-05-19T13:00:00Z"}

GET    /v1/schedules/:id              # inspect
DELETE /v1/schedules/:id              # cancel
GET    /v1/schedules/:id/firings      # history (paginated)

Schema

schedules
  id              ULID PRIMARY KEY
  owner_id        BIGINT NOT NULL
  cron_expr       VARCHAR(120) NOT NULL
  timezone        VARCHAR(40)  NOT NULL    -- IANA, e.g. America/New_York
  target_url      TEXT         NOT NULL
  payload         JSONB        NULL
  idempotency_key VARCHAR(120) NOT NULL    -- forwarded to target on fire
  status          ENUM('active','paused','deleted') NOT NULL
  next_fire_at    TIMESTAMP    NOT NULL    -- precomputed; the hot index
  shard_id        SMALLINT     NOT NULL    -- hash(id) % N

  INDEX (shard_id, next_fire_at) WHERE status = 'active'  -- the only hot read
  INDEX (owner_id, status)

firings
  id              ULID PRIMARY KEY
  schedule_id     ULID NOT NULL
  fire_at         TIMESTAMP NOT NULL       -- the scheduled time
  attempt         SMALLINT  NOT NULL
  status          ENUM('pending','running','succeeded','failed','dead') NOT NULL
  started_at      TIMESTAMP NULL
  finished_at     TIMESTAMP NULL
  result_code     INT       NULL
  result_excerpt  TEXT      NULL

  INDEX (schedule_id, fire_at)
  INDEX (status, fire_at) WHERE status IN ('pending','running')

leases
  shard_id        SMALLINT PRIMARY KEY
  leader_id       VARCHAR(64) NOT NULL
  expires_at      TIMESTAMP   NOT NULL
  fenced_at       TIMESTAMP   NULL         -- monotonic fencing token

Two writes per firing: claim (mark running, set started_at) and finalise (set status, finished_at). The claim is what makes "fire once" work — a second leader trying to claim the same row loses on the optimistic-concurrency check.

4 · High-level architecture

Three tiers, each with its own scaling axis. The scheduler tier reads the index of next_fire_at; the worker tier executes; the durable store is the source of truth that both sides defer to when they disagree.

A shard is the unit of ownership. One scheduler leader per shard scans next_fire_at, claims rows, and pushes onto the fire queue. Workers are stateless and partition-aware — at-least-once delivery from the queue, idempotency at the target.

5 · Leader election, leases, and the missed-tick rule

The deep dive. One leader per shard is what keeps a firing from going out twice in the normal case. The lease is what keeps a dead leader from stalling that shard forever.

Single-leader-per-shard

Each shard has one elected leader via etcd / Zookeeper. The leader holds a lease with a TTL (e.g. 15 s) and renews it every 5 s. Only the leader for shard k scans rows where shard_id = k and claims firings. A non-leader is idle for that shard. This is the cleanest way to avoid double-firing — exactly one process is allowed to issue a claim at any time.

Lease loss and fencing

A leader that GCs for 20 s loses its lease. A new leader takes over the shard. The old leader might still try to enqueue firings it had in-flight before noticing the loss. Fencing tokens — a monotonically increasing number attached to every claim — let the firings table reject writes from a stale leader. Without fencing, the lease is advisory; with it, it's enforced.

Time triggers vs interval triggers

A cron expression resolves to wall-clock times. An interval trigger ("every 5 min from now") resolves to monotonic times. Mixing them is a common source of drift — if the system stores intervals as wall-clock and the host clock jumps, the next fire skews. Store the trigger as either-or, not both.

The missed-tick rule

Spell this out on the page. When the scheduler comes up after an outage, it scans for rows where next_fire_at < now(). The policy is per-schedule:

  • Fire-late. Default for most jobs. Enqueue one firing per missed slot up to a cap (e.g. last 10), tagged with a catch_up marker so the target can decide whether to skip.
  • Fire-most-recent. Common for "send the daily report". Skip intermediate missed slots; fire once for the most recent missed time.
  • Skip. For user-facing notifications. A "9am reminder" delivered at 2pm is worse than not delivering it.
Why one leader per shard, not one global leader. A single global leader is simpler but caps throughput at one process and makes failover a fleet-wide event. Sharding by shard_id lets failover affect only 1/N of the schedules and lets the scheduler tier scale horizontally with the row count.

6 · Failure modes & runbook

The honest part. Most of these are about getting at-least-once right without turning it into many-times.

FailureSymptomMitigation
Leader dies mid-firingFiring row in running with no progressLease expires (~15 s); new leader picks up; re-fires with same idempotency_key — target dedupes.
Schedule store partitionsOne shard's firings stallScheduler for that shard backs off, retries; alarm at 30 s of stall. Other shards unaffected.
Target returns 5xxFiring flips to failedExponential backoff, capped attempts. After cap, move to dead; alert owner; DLQ for inspection.
Clock skew between leadersOne leader fires a job a millisecond before the next leader doesIdempotency key on the target dedupes within a fire-window (e.g. 60 s). Without idempotency, this is a correctness bug.
Thundering herd at midnight50K firings in < 1 s; queue backs up; targets get hammeredAdd per-schedule jitter (0–60 s) at create time. Rate-limit per target hostname in the worker tier.
One tenant submits 1M schedulesThat tenant's shard saturates; neighbours starvePer-tenant rate limit on create; per-tenant concurrency cap in the worker pool. Fairness, not first-come-first-served.
etcd unreachableNo new leases granted; existing leaders keep going until their lease expiresHold steady for the lease TTL; degrade by pausing new schedules; page oncall before existing leases expire.
Worker stuck on a slow targetConcurrency exhausted on that worker podPer-target timeout (e.g. 30 s); circuit-break per-target after a threshold of consecutive failures.

7 · Cost & operability

The shape of the bill, and the levers that move it. The schedule store dominates; the rest scales with what's actually firing, not with what's scheduled.

LineEstimateNote
Schedule store (100 GB, sharded Postgres / DynamoDB)~$3K / monthBiggest line. Cost scales with stored schedules, not firings.
Firings log (2.5 TB at 30 d)~$500 / monthAppend-only; columnar or partitioned tables; older partitions to cold storage.
Scheduler tier (32 pods, 1 vCPU each)~$500 / monthCheap. Scales with shard count, not job count.
Worker tier (150K concurrent peaks)~$2K / monthScales with concurrent firings × average duration, not with schedule count.
Queue (Kafka / SQS)~$400 / month5K msg/s peak; negligible.
etcd / leader election~$200 / monthSmall cluster; the cost is operational, not compute.
Total~$6.6K / monthPer million schedules: ~$66 / M-schedules / month.
The anti-pattern to name. A user who registers a schedule that fires "every second" turns a sub-cent-per-month customer into a $10/month customer and eats real worker capacity. Cap minimum interval (e.g. 1 min) in the create API, and surface the option as a paid feature for anything tighter.

8 · Trade-offs & what's next

The last 5 minutes. These are the second-order problems that distinguish a one-tenant scheduler from a platform.

ConcernWhat it means
Fairness across tenantsWeighted fair queueing in the worker pool. One tenant with a million schedules gets a proportional slice, not 100%.
PrioritisationA "priority" field on the schedule; higher-priority firings preempt lower in the queue. Useful for paid tiers and SLA-tagged workloads.
The fire-window conceptInstead of "fire at exactly T", "fire within [T, T+window]". Lets the scheduler smooth load and gives operators a knob during incidents.
Time-zoned crons and DSTStore the cron + the IANA zone, not the resolved UTC time. Recompute next_fire_at after every firing using the zone's current offset.
Exactly-once via target idempotencyThe realistic path. The scheduler is at-least-once; the target uses idempotency_key to dedupe within a window. This is the same pattern Stripe uses for charges.
Saga rollback as an alternativeFor multi-step jobs where dedupe is hard, model the work as a Saga — every step has a compensating action. Costs more state, but tolerates partial failure cleanly.
"What would you change at 10×?"Reshard. Push the firings log into a stream store with longer retention. Add a regional scheduler tier so timezone-clustered loads (e.g. APAC mornings) stay regional.

Further reading

  • Google — "Reliable Cron across the Planet" (2016). The canonical writeup. How Google's internal cron service uses Paxos for leader election and survives datacenter failover. Read this once.
  • Airbnb — "Chronos" (deprecated, but the writeup is good). A distributed cron built on Mesos. The lessons on dependencies between jobs are worth skimming even though the project is dead.
  • Quartz Scheduler — "Clustering" docs. The Java-world reference. JDBC-backed lease via a row lock; a useful baseline before designing your own.
  • Marc Brooker — "Leases, fencing tokens, and you". The clearest treatment of why a lease alone is not enough and what a fencing token buys you.
  • Stripe — "Idempotency Keys at Stripe". The pattern the scheduler relies on at the target. Required reading for at-least-once with a clean user experience.
  • Adjacent: Idempotency. The deep dive on the dedupe contract.
  • Adjacent: Distributed locks. What the lease is, and what it isn't.
  • Adjacent: Time and clocks. Why "fire at 2:30 AM" is harder than it looks.
Found this useful?