12 / 19

Playbook / 12

Distributed scheduler

Cron, at scale. The mechanics are old — fire a job at a wall-clock time — but the interesting questions show up the moment one node isn't enough. Who owns which schedule. What happens when the owner dies mid-firing. Whether a missed tick gets dropped or caught up. How "fire once" survives a network partition. Most of the work is in the semantics, not the timer.

1 · Clarifying questions

Scheduler design lives or dies on the semantics. Pin them down before drawing boxes.

How many schedules?	Order of magnitude matters more than the exact number. 10K vs 10M changes whether one node can hold the full set in memory.
What firing semantics?	At-most-once (drop on failure), at-least-once (retry, target dedupes), or exactly-once (the hardest, usually needs target idempotency). Default to at-least-once.
How long does a job run?	Seconds (call an HTTP endpoint) or hours (kick off a batch). Long-running jobs change the lease model and the "is it still running?" check.
Do we retry on failure?	Yes, with backoff and a cap. Confirm the cap (e.g. 5 attempts over 30 min) so we can size the firings table.
What about missed firings?	If the scheduler was down across a firing time, do we skip it or fire late with a catch-up marker? Most products want fire-late; some (e.g. "send me a notification at 9am") want skip.
Timezone story?	Crons are stored with an IANA timezone. DST transitions are explicit: "2:30 AM every day" is undefined on the spring-forward day.
Multi-tenant?	Yes. One tenant with a million schedules cannot starve the others — fairness is a design constraint, not a follow-up.
Latency budget for the fire?	P99 within ~1 s of the scheduled time. Tighter budgets (sub-100 ms) need a different design.

2 · Capacity math, on a napkin

The numbers set the floor for the schedule store, the scheduler tier, and the worker tier. Each scales for a different reason.

Number	Calculation	Result
Scheduled jobs	given	10M
Firings / sec (avg)	amortised across the day	~1K
Firings / sec (peak)	~5× over midnight + top-of-hour	~5K
Concurrent jobs in flight	5K peak × 30 s avg duration	~150K
Schedule row size	cron + target + payload + metadata	~10 KB
Schedule store size	10M × 10 KB	~100 GB
Firings table / day	~5K avg × 86,400 × ~1 KB	~430 GB / day raw
Firings retention	30 days at 80% compression	~2.5 TB
Midnight cron burst	10× the average	~50K firings in 1 s

The thundering-herd line is the one to call out. A naive design that fires every "midnight UTC" cron at exactly 00:00:00.000 turns a 1K-rps system into a 50K-rps spike. Jitter the fire times by a few seconds and the spike flattens.

3 · API and data model

A narrow surface. The schedule is the contract; the firings table is the audit log.

Endpoints

POST /v1/schedules                    # create
{
  "cron_expr": "0 9 * * MON-FRI",
  "timezone": "America/New_York",
  "target_url": "https://acme.com/webhooks/morning-report",
  "payload": {"team": "ops"},
  "idempotency_key": "morning-report-ops"
}
→ 201 {"id": "sch_01HX...", "next_fire_at": "2026-05-19T13:00:00Z"}

GET    /v1/schedules/:id              # inspect
DELETE /v1/schedules/:id              # cancel
GET    /v1/schedules/:id/firings      # history (paginated)

Schema

schedules
  id              ULID PRIMARY KEY
  owner_id        BIGINT NOT NULL
  cron_expr       VARCHAR(120) NOT NULL
  timezone        VARCHAR(40)  NOT NULL    -- IANA, e.g. America/New_York
  target_url      TEXT         NOT NULL
  payload         JSONB        NULL
  idempotency_key VARCHAR(120) NOT NULL    -- forwarded to target on fire
  status          ENUM('active','paused','deleted') NOT NULL
  next_fire_at    TIMESTAMP    NOT NULL    -- precomputed; the hot index
  shard_id        SMALLINT     NOT NULL    -- hash(id) % N

  INDEX (shard_id, next_fire_at) WHERE status = 'active'  -- the only hot read
  INDEX (owner_id, status)

firings
  id              ULID PRIMARY KEY
  schedule_id     ULID NOT NULL
  fire_at         TIMESTAMP NOT NULL       -- the scheduled time
  attempt         SMALLINT  NOT NULL
  status          ENUM('pending','running','succeeded','failed','dead') NOT NULL
  started_at      TIMESTAMP NULL
  finished_at     TIMESTAMP NULL
  result_code     INT       NULL
  result_excerpt  TEXT      NULL

  INDEX (schedule_id, fire_at)
  INDEX (status, fire_at) WHERE status IN ('pending','running')

leases
  shard_id        SMALLINT PRIMARY KEY
  leader_id       VARCHAR(64) NOT NULL
  expires_at      TIMESTAMP   NOT NULL
  fenced_at       TIMESTAMP   NULL         -- monotonic fencing token

Two writes per firing: claim (mark running, set started_at) and finalise (set status, finished_at). The claim is what makes "fire once" work — a second leader trying to claim the same row loses on the optimistic-concurrency check.

4 · High-level architecture

Three tiers, each with its own scaling axis. The scheduler tier reads the index of next_fire_at; the worker tier executes; the durable store is the source of truth that both sides defer to when they disagree.

A shard is the unit of ownership. One scheduler leader per shard scans next_fire_at, claims rows, and pushes onto the fire queue. Workers are stateless and partition-aware — at-least-once delivery from the queue, idempotency at the target.

5 · Leader election, leases, and the missed-tick rule

The deep dive. One leader per shard is what keeps a firing from going out twice in the normal case. The lease is what keeps a dead leader from stalling that shard forever.

Single-leader-per-shard

Each shard has one elected leader via etcd / Zookeeper. The leader holds a lease with a TTL (e.g. 15 s) and renews it every 5 s. Only the leader for shard k scans rows where shard_id = k and claims firings. A non-leader is idle for that shard. This is the cleanest way to avoid double-firing — exactly one process is allowed to issue a claim at any time.

Lease loss and fencing

A leader that GCs for 20 s loses its lease. A new leader takes over the shard. The old leader might still try to enqueue firings it had in-flight before noticing the loss. Fencing tokens — a monotonically increasing number attached to every claim — let the firings table reject writes from a stale leader. Without fencing, the lease is advisory; with it, it's enforced.

Time triggers vs interval triggers

A cron expression resolves to wall-clock times. An interval trigger ("every 5 min from now") resolves to monotonic times. Mixing them is a common source of drift — if the system stores intervals as wall-clock and the host clock jumps, the next fire skews. Store the trigger as either-or, not both.

The missed-tick rule

Spell this out on the page. When the scheduler comes up after an outage, it scans for rows where next_fire_at < now(). The policy is per-schedule:

Fire-late. Default for most jobs. Enqueue one firing per missed slot up to a cap (e.g. last 10), tagged with a catch_up marker so the target can decide whether to skip.
Fire-most-recent. Common for "send the daily report". Skip intermediate missed slots; fire once for the most recent missed time.
Skip. For user-facing notifications. A "9am reminder" delivered at 2pm is worse than not delivering it.

Why one leader per shard, not one global leader. A single global leader is simpler but caps throughput at one process and makes failover a fleet-wide event. Sharding by shard_id lets failover affect only 1/N of the schedules and lets the scheduler tier scale horizontally with the row count.

6 · Failure modes & runbook

The honest part. Most of these are about getting at-least-once right without turning it into many-times.

Failure	Symptom	Mitigation
Leader dies mid-firing	Firing row in `running` with no progress	Lease expires (~15 s); new leader picks up; re-fires with same `idempotency_key` — target dedupes.
Schedule store partitions	One shard's firings stall	Scheduler for that shard backs off, retries; alarm at 30 s of stall. Other shards unaffected.
Target returns 5xx	Firing flips to `failed`	Exponential backoff, capped attempts. After cap, move to `dead`; alert owner; DLQ for inspection.
Clock skew between leaders	One leader fires a job a millisecond before the next leader does	Idempotency key on the target dedupes within a fire-window (e.g. 60 s). Without idempotency, this is a correctness bug.
Thundering herd at midnight	50K firings in < 1 s; queue backs up; targets get hammered	Add per-schedule jitter (0–60 s) at create time. Rate-limit per target hostname in the worker tier.
One tenant submits 1M schedules	That tenant's shard saturates; neighbours starve	Per-tenant rate limit on create; per-tenant concurrency cap in the worker pool. Fairness, not first-come-first-served.
etcd unreachable	No new leases granted; existing leaders keep going until their lease expires	Hold steady for the lease TTL; degrade by pausing new schedules; page oncall before existing leases expire.
Worker stuck on a slow target	Concurrency exhausted on that worker pod	Per-target timeout (e.g. 30 s); circuit-break per-target after a threshold of consecutive failures.

7 · Cost & operability

The shape of the bill, and the levers that move it. The schedule store dominates; the rest scales with what's actually firing, not with what's scheduled.

Line	Estimate	Note
Schedule store (100 GB, sharded Postgres / DynamoDB)	~$3K / month	Biggest line. Cost scales with stored schedules, not firings.
Firings log (2.5 TB at 30 d)	~$500 / month	Append-only; columnar or partitioned tables; older partitions to cold storage.
Scheduler tier (32 pods, 1 vCPU each)	~$500 / month	Cheap. Scales with shard count, not job count.
Worker tier (150K concurrent peaks)	~$2K / month	Scales with concurrent firings × average duration, not with schedule count.
Queue (Kafka / SQS)	~$400 / month	5K msg/s peak; negligible.
etcd / leader election	~$200 / month	Small cluster; the cost is operational, not compute.
Total	~$6.6K / month	Per million schedules: ~$66 / M-schedules / month.

The anti-pattern to name. A user who registers a schedule that fires "every second" turns a sub-cent-per-month customer into a $10/month customer and eats real worker capacity. Cap minimum interval (e.g. 1 min) in the create API, and surface the option as a paid feature for anything tighter.

8 · Trade-offs & what's next

The last 5 minutes. These are the second-order problems that distinguish a one-tenant scheduler from a platform.

Concern	What it means
Fairness across tenants	Weighted fair queueing in the worker pool. One tenant with a million schedules gets a proportional slice, not 100%.
Prioritisation	A "priority" field on the schedule; higher-priority firings preempt lower in the queue. Useful for paid tiers and SLA-tagged workloads.
The fire-window concept	Instead of "fire at exactly T", "fire within [T, T+window]". Lets the scheduler smooth load and gives operators a knob during incidents.
Time-zoned crons and DST	Store the cron + the IANA zone, not the resolved UTC time. Recompute `next_fire_at` after every firing using the zone's current offset.
Exactly-once via target idempotency	The realistic path. The scheduler is at-least-once; the target uses `idempotency_key` to dedupe within a window. This is the same pattern Stripe uses for charges.
Saga rollback as an alternative	For multi-step jobs where dedupe is hard, model the work as a Saga — every step has a compensating action. Costs more state, but tolerates partial failure cleanly.
"What would you change at 10×?"	Reshard. Push the firings log into a stream store with longer retention. Add a regional scheduler tier so timezone-clustered loads (e.g. APAC mornings) stay regional.

Distributed scheduler

1 · Clarifying questions

2 · Capacity math, on a napkin

3 · API and data model

Endpoints

Schema

4 · High-level architecture

5 · Leader election, leases, and the missed-tick rule

Single-leader-per-shard

Lease loss and fencing

Time triggers vs interval triggers

The missed-tick rule

6 · Failure modes & runbook

7 · Cost & operability

8 · Trade-offs & what's next

Further reading

Event ingestion