Distributed scheduler
Cron, at scale. The mechanics are old — fire a job at a wall-clock time — but the interesting questions show up the moment one node isn't enough. Who owns which schedule. What happens when the owner dies mid-firing. Whether a missed tick gets dropped or caught up. How "fire once" survives a network partition. Most of the work is in the semantics, not the timer.
1 · Clarifying questions
Scheduler design lives or dies on the semantics. Pin them down before drawing boxes.
| How many schedules? | Order of magnitude matters more than the exact number. 10K vs 10M changes whether one node can hold the full set in memory. |
| What firing semantics? | At-most-once (drop on failure), at-least-once (retry, target dedupes), or exactly-once (the hardest, usually needs target idempotency). Default to at-least-once. |
| How long does a job run? | Seconds (call an HTTP endpoint) or hours (kick off a batch). Long-running jobs change the lease model and the "is it still running?" check. |
| Do we retry on failure? | Yes, with backoff and a cap. Confirm the cap (e.g. 5 attempts over 30 min) so we can size the firings table. |
| What about missed firings? | If the scheduler was down across a firing time, do we skip it or fire late with a catch-up marker? Most products want fire-late; some (e.g. "send me a notification at 9am") want skip. |
| Timezone story? | Crons are stored with an IANA timezone. DST transitions are explicit: "2:30 AM every day" is undefined on the spring-forward day. |
| Multi-tenant? | Yes. One tenant with a million schedules cannot starve the others — fairness is a design constraint, not a follow-up. |
| Latency budget for the fire? | P99 within ~1 s of the scheduled time. Tighter budgets (sub-100 ms) need a different design. |
2 · Capacity math, on a napkin
The numbers set the floor for the schedule store, the scheduler tier, and the worker tier. Each scales for a different reason.
| Number | Calculation | Result |
|---|---|---|
| Scheduled jobs | given | 10M |
| Firings / sec (avg) | amortised across the day | ~1K |
| Firings / sec (peak) | ~5× over midnight + top-of-hour | ~5K |
| Concurrent jobs in flight | 5K peak × 30 s avg duration | ~150K |
| Schedule row size | cron + target + payload + metadata | ~10 KB |
| Schedule store size | 10M × 10 KB | ~100 GB |
| Firings table / day | ~5K avg × 86,400 × ~1 KB | ~430 GB / day raw |
| Firings retention | 30 days at 80% compression | ~2.5 TB |
| Midnight cron burst | 10× the average | ~50K firings in 1 s |
The thundering-herd line is the one to call out. A naive design that fires every "midnight UTC" cron at exactly 00:00:00.000 turns a 1K-rps system into a 50K-rps spike. Jitter the fire times by a few seconds and the spike flattens.
3 · API and data model
A narrow surface. The schedule is the contract; the firings table is the audit log.
Endpoints
POST /v1/schedules # create
{
"cron_expr": "0 9 * * MON-FRI",
"timezone": "America/New_York",
"target_url": "https://acme.com/webhooks/morning-report",
"payload": {"team": "ops"},
"idempotency_key": "morning-report-ops"
}
→ 201 {"id": "sch_01HX...", "next_fire_at": "2026-05-19T13:00:00Z"}
GET /v1/schedules/:id # inspect
DELETE /v1/schedules/:id # cancel
GET /v1/schedules/:id/firings # history (paginated)Schema
schedules
id ULID PRIMARY KEY
owner_id BIGINT NOT NULL
cron_expr VARCHAR(120) NOT NULL
timezone VARCHAR(40) NOT NULL -- IANA, e.g. America/New_York
target_url TEXT NOT NULL
payload JSONB NULL
idempotency_key VARCHAR(120) NOT NULL -- forwarded to target on fire
status ENUM('active','paused','deleted') NOT NULL
next_fire_at TIMESTAMP NOT NULL -- precomputed; the hot index
shard_id SMALLINT NOT NULL -- hash(id) % N
INDEX (shard_id, next_fire_at) WHERE status = 'active' -- the only hot read
INDEX (owner_id, status)
firings
id ULID PRIMARY KEY
schedule_id ULID NOT NULL
fire_at TIMESTAMP NOT NULL -- the scheduled time
attempt SMALLINT NOT NULL
status ENUM('pending','running','succeeded','failed','dead') NOT NULL
started_at TIMESTAMP NULL
finished_at TIMESTAMP NULL
result_code INT NULL
result_excerpt TEXT NULL
INDEX (schedule_id, fire_at)
INDEX (status, fire_at) WHERE status IN ('pending','running')
leases
shard_id SMALLINT PRIMARY KEY
leader_id VARCHAR(64) NOT NULL
expires_at TIMESTAMP NOT NULL
fenced_at TIMESTAMP NULL -- monotonic fencing tokenTwo writes per firing: claim (mark running, set started_at) and finalise (set status, finished_at). The claim is what makes "fire once" work — a second leader trying to claim the same row loses on the optimistic-concurrency check.
4 · High-level architecture
Three tiers, each with its own scaling axis. The scheduler tier reads the index of
next_fire_at; the worker tier executes; the durable store is the source
of truth that both sides defer to when they disagree.
A shard is the unit of ownership. One scheduler leader per shard scans next_fire_at, claims rows, and pushes onto the fire queue. Workers are stateless and partition-aware — at-least-once delivery from the queue, idempotency at the target.
5 · Leader election, leases, and the missed-tick rule
The deep dive. One leader per shard is what keeps a firing from going out twice in the normal case. The lease is what keeps a dead leader from stalling that shard forever.
Single-leader-per-shard
Each shard has one elected leader via etcd / Zookeeper. The leader holds a lease with
a TTL (e.g. 15 s) and renews it every 5 s. Only the leader for shard k
scans rows where shard_id = k and claims firings. A non-leader is idle
for that shard. This is the cleanest way to avoid double-firing — exactly one process
is allowed to issue a claim at any time.
Lease loss and fencing
A leader that GCs for 20 s loses its lease. A new leader takes over the shard. The old leader might still try to enqueue firings it had in-flight before noticing the loss. Fencing tokens — a monotonically increasing number attached to every claim — let the firings table reject writes from a stale leader. Without fencing, the lease is advisory; with it, it's enforced.
Time triggers vs interval triggers
A cron expression resolves to wall-clock times. An interval trigger ("every 5 min from now") resolves to monotonic times. Mixing them is a common source of drift — if the system stores intervals as wall-clock and the host clock jumps, the next fire skews. Store the trigger as either-or, not both.
The missed-tick rule
Spell this out on the page. When the scheduler comes up after an outage, it scans
for rows where next_fire_at < now(). The policy is per-schedule:
- Fire-late. Default for most jobs. Enqueue one firing per missed slot up to a cap (e.g. last 10), tagged with a
catch_upmarker so the target can decide whether to skip. - Fire-most-recent. Common for "send the daily report". Skip intermediate missed slots; fire once for the most recent missed time.
- Skip. For user-facing notifications. A "9am reminder" delivered at 2pm is worse than not delivering it.
shard_id lets failover affect only 1/N of the schedules
and lets the scheduler tier scale horizontally with the row count.6 · Failure modes & runbook
The honest part. Most of these are about getting at-least-once right without turning it into many-times.
| Failure | Symptom | Mitigation |
|---|---|---|
| Leader dies mid-firing | Firing row in running with no progress | Lease expires (~15 s); new leader picks up; re-fires with same idempotency_key — target dedupes. |
| Schedule store partitions | One shard's firings stall | Scheduler for that shard backs off, retries; alarm at 30 s of stall. Other shards unaffected. |
| Target returns 5xx | Firing flips to failed | Exponential backoff, capped attempts. After cap, move to dead; alert owner; DLQ for inspection. |
| Clock skew between leaders | One leader fires a job a millisecond before the next leader does | Idempotency key on the target dedupes within a fire-window (e.g. 60 s). Without idempotency, this is a correctness bug. |
| Thundering herd at midnight | 50K firings in < 1 s; queue backs up; targets get hammered | Add per-schedule jitter (0–60 s) at create time. Rate-limit per target hostname in the worker tier. |
| One tenant submits 1M schedules | That tenant's shard saturates; neighbours starve | Per-tenant rate limit on create; per-tenant concurrency cap in the worker pool. Fairness, not first-come-first-served. |
| etcd unreachable | No new leases granted; existing leaders keep going until their lease expires | Hold steady for the lease TTL; degrade by pausing new schedules; page oncall before existing leases expire. |
| Worker stuck on a slow target | Concurrency exhausted on that worker pod | Per-target timeout (e.g. 30 s); circuit-break per-target after a threshold of consecutive failures. |
7 · Cost & operability
The shape of the bill, and the levers that move it. The schedule store dominates; the rest scales with what's actually firing, not with what's scheduled.
| Line | Estimate | Note |
|---|---|---|
| Schedule store (100 GB, sharded Postgres / DynamoDB) | ~$3K / month | Biggest line. Cost scales with stored schedules, not firings. |
| Firings log (2.5 TB at 30 d) | ~$500 / month | Append-only; columnar or partitioned tables; older partitions to cold storage. |
| Scheduler tier (32 pods, 1 vCPU each) | ~$500 / month | Cheap. Scales with shard count, not job count. |
| Worker tier (150K concurrent peaks) | ~$2K / month | Scales with concurrent firings × average duration, not with schedule count. |
| Queue (Kafka / SQS) | ~$400 / month | 5K msg/s peak; negligible. |
| etcd / leader election | ~$200 / month | Small cluster; the cost is operational, not compute. |
| Total | ~$6.6K / month | Per million schedules: ~$66 / M-schedules / month. |
8 · Trade-offs & what's next
The last 5 minutes. These are the second-order problems that distinguish a one-tenant scheduler from a platform.
| Concern | What it means |
|---|---|
| Fairness across tenants | Weighted fair queueing in the worker pool. One tenant with a million schedules gets a proportional slice, not 100%. |
| Prioritisation | A "priority" field on the schedule; higher-priority firings preempt lower in the queue. Useful for paid tiers and SLA-tagged workloads. |
| The fire-window concept | Instead of "fire at exactly T", "fire within [T, T+window]". Lets the scheduler smooth load and gives operators a knob during incidents. |
| Time-zoned crons and DST | Store the cron + the IANA zone, not the resolved UTC time. Recompute next_fire_at after every firing using the zone's current offset. |
| Exactly-once via target idempotency | The realistic path. The scheduler is at-least-once; the target uses idempotency_key to dedupe within a window. This is the same pattern Stripe uses for charges. |
| Saga rollback as an alternative | For multi-step jobs where dedupe is hard, model the work as a Saga — every step has a compensating action. Costs more state, but tolerates partial failure cleanly. |
| "What would you change at 10×?" | Reshard. Push the firings log into a stream store with longer retention. Add a regional scheduler tier so timezone-clustered loads (e.g. APAC mornings) stay regional. |
Further reading
- Google — "Reliable Cron across the Planet" (2016). The canonical writeup. How Google's internal cron service uses Paxos for leader election and survives datacenter failover. Read this once.
- Airbnb — "Chronos" (deprecated, but the writeup is good). A distributed cron built on Mesos. The lessons on dependencies between jobs are worth skimming even though the project is dead.
- Quartz Scheduler — "Clustering" docs. The Java-world reference. JDBC-backed lease via a row lock; a useful baseline before designing your own.
- Marc Brooker — "Leases, fencing tokens, and you". The clearest treatment of why a lease alone is not enough and what a fencing token buys you.
- Stripe — "Idempotency Keys at Stripe". The pattern the scheduler relies on at the target. Required reading for at-least-once with a clean user experience.
- Adjacent: Idempotency. The deep dive on the dedupe contract.
- Adjacent: Distributed locks. What the lease is, and what it isn't.
- Adjacent: Time and clocks. Why "fire at 2:30 AM" is harder than it looks.