08 / 19

Playbook / 08

Ride matching

An Uber-shape problem. Drivers heartbeat their location every few seconds; riders request a trip; the system has to pick a driver, quote an ETA, and run the trip through its lifecycle. The depth comes from the geospatial index — geohash, S2, quadtree — and the dispatch loop's failure modes when a driver declines, when two riders compete for the same driver, and when a region surges.

1 · Clarifying questions

Five minutes of disambiguation before any design work. The interviewer is partly watching to see if you ask before you assume.

"Cities or global?"	Global, but matching is a single-city / single-region operation. Cross-region matching doesn't exist in the real product.
"How many drivers / riders?"	10M active drivers worldwide, ~3M heartbeating right now. 100M monthly riders, 10M MAUs in flight on a peak Friday evening.
"Heartbeat frequency?"	Every 4 seconds when on-duty; every 30 seconds when in a trip (no need for fine resolution).
"Match radius?"	Start at 1.5 km; expand to 3 km, then 5 km if no driver accepts. Reject after ~10 seconds with no match.
"Latency budget?"	End-to-end match (rider tap → driver accept push) under 2 seconds P95. Dispatch logic itself under 300 ms.
"Multi-rider / pool?"	Out of scope for this design — single rider, single driver. Pool/UberX-shared is a different matching algorithm.
"Pricing in scope?"	Surge multiplier is in scope (it's a regional state). Fare calculation per route is not.
"Map / routing in scope?"	Use it as a service. Treat `getETA(from, to)` as a 50 ms blackbox call to the maps team.

2 · Capacity math

Five numbers, derived from the clarifications.

Number	Derivation	Value
Driver heartbeats	3M drivers × 1 / 4 s	~750k QPS
Trip requests	10M MAU × 1.5 trips/day ÷ 86 400 s, peak factor 5×	~900 QPS sustained, 4500 QPS peak
Dispatch fan-out	Per request: 3–10 candidates × 1 RPC each	~30k driver-RPCs / second peak
Geo-index hot data	3M drivers × 64 bytes (lat, lng, ID, state, heading)	~200 MB — fits in one Redis shard's RAM
Trip log writes	~900 trips/s × ~10 lifecycle events	~9k writes/s to the trips table

The 750k heartbeats/s is the load to design for. Every other QPS number is at least 100× smaller. The choice of geospatial index has to handle a sustained write rate at that scale.

3 · The geospatial-index choice

The single biggest design decision. Four mainstream options:

Index	Shape	Write cost	Read cost	Range query
Geohash	Z-order base-32 string (e.g. `9q8yyk`)	O(1) — prefix lookup	O(1)	Tricky near cell boundaries — must query 9 cells
Google S2	64-bit cell ID via Hilbert curve on the sphere	O(1)	O(1)	Native spherical math; built-in cap/region cover
Quadtree	Recursive 2D subdivision	O(log n) — tree descent	O(log n)	Natural for k-NN; harder to shard
R-tree	Bounding-box tree	O(log n)	O(log n)	Best for arbitrary polygons; overkill for points

For ride matching the choice is between geohash (simplest) and S2 (Google's, used by Uber's H3 successor — H3 is hex-cell-based but same family). Both store driver positions in O(1) cells; both shard by cell-ID prefix.

The cell-size trade. Cell precision 7 (geohash) ≈ 150 m × 150 m; precision 6 ≈ 1.2 km × 600 m. Match radius 1.5 km → query the rider's cell at precision 6 plus its 8 neighbours (the "9-cell pattern"). Smaller cells = fewer drivers per cell but more cells to query. The sweet spot is "median cell holds 3–10 candidates".

The hot-data layer is Redis with geospatial commands (GEOADD, GEORADIUS, GEOSEARCH) — these internally use a sorted-set keyed by geohash, with the score being the geohash integer. Sharding is by region (Manhattan, San Francisco, Tokyo are separate Redis clusters); within a region, by geohash-prefix. The full picture sits behind a "location service" layer that abstracts the storage so the dispatch loop sees a single nearbyDrivers(lat, lng, radius_m) RPC.

4 · API and data model

External APIs (rider / driver apps)

// Driver side
PUT  /driver/v1/heartbeat
  body: { driver_id, lat, lng, heading, state: "available|busy|offline" }
  response: { acknowledged_at }

POST /driver/v1/trip/{trip_id}/accept       // driver responds to dispatch
POST /driver/v1/trip/{trip_id}/decline
POST /driver/v1/trip/{trip_id}/arrive

// Rider side
POST /rider/v1/trip
  body: { rider_id, pickup_lat, pickup_lng, dest_lat, dest_lng, vehicle_type }
  response: { trip_id, status, estimated_pickup_seconds, surge_multiplier }

GET  /rider/v1/trip/{trip_id}               // poll; or use websocket push
POST /rider/v1/trip/{trip_id}/cancel

Core tables (just the columns that matter)

drivers           (id PK, name, vehicle_type, rating, current_region)
driver_status     (driver_id PK, state, last_heartbeat_at, last_lat, last_lng)
trips             (id PK, rider_id, driver_id, state, requested_at, pickup_geohash,
                   dropoff_geohash, surge_multiplier, fare, region)
trip_events       (trip_id, ts, event_type, payload_json)    -- the audit log
dispatch_queue    (trip_id PK, attempts, last_offered_driver_id, expires_at)
regional_surge    (region, multiplier, updated_at)

Two tiers of storage. Hot tier: Redis (driver positions, dispatch queue). Warm tier: a sharded SQL store like Vitess or a managed Spanner (trips, trip_events). The trip_events table is append-only and is the source of truth for the trip lifecycle — every state change is a row.

5 · High-level architecture

The dispatch service is the hot path. It calls the location service to find nearby drivers, asks surge for the regional multiplier, asks maps for ETA, and writes the trip lifecycle to Spanner / Vitess. Driver heartbeats flow directly to the location service (skipping the API gateway's per-request auth — the gateway is bypassed by an internal heartbeat ingest, secured with mTLS).

6 · Deep dive — the dispatch loop

This is the part interviewers will probe. The loop is short to describe but has six distinct race conditions.

The happy path, sequentially

Rider's POST hits the trip service; trip is created with state PENDING_DISPATCH and assigned an idempotency key.
Trip service emits to a per-region Kafka topic; the regional dispatch worker picks it up.
Dispatch worker asks the location service: "nearbyDrivers(pickup_lat, pickup_lng, radius=1.5 km, max=10, type=UberX)".
Worker ranks the candidates — heuristic: ETA + driver acceptance rate + match-fairness score. Picks the top one.
Worker reserves the driver in Redis with a TTL (10 s), changes trip state to OFFERED, and pushes the offer to the driver app over WebSocket.
Driver app responds ACCEPT (or doesn't). On accept: state → ASSIGNED, ride lifecycle begins. On decline / timeout: release the reservation, pick the next candidate, re-offer (up to N tries, then expand the radius).
Trip state moves through ARRIVING → PICKED_UP → ON_TRIP → COMPLETED as the driver app posts events.

The six races

Race	How it manifests	Mitigation
Two riders, one driver	Two dispatch workers offer the same driver simultaneously	Atomic `SETNX` on the driver-reservation key. The loser falls back to the next candidate.
Driver double-accepts	Network retry on accept RPC	Accept is idempotent on (driver_id, trip_id). Second accept returns the same state.
Stale driver heartbeat	Driver's phone died; index still shows them as available	TTL on heartbeat entries — 15 seconds. Stale entries auto-expire from the geo index.
Rider cancels mid-dispatch	Rider taps cancel after offer sent but before driver accept	Cancel atomically transitions trip state. Driver-app receives `OFFER_REVOKED` push; dispatch loop exits.
Driver app crashes mid-trip	Trip in `ON_TRIP` but no heartbeats arriving	Separate "trip health" worker watches for missing-heartbeat-during-trip; alerts the safety team after 60 s.
Region surge spike	Demand suddenly 10×; every dispatch loop expands radius and exhausts candidates	Surge service raises the multiplier; some riders drop out at the new price; supply re-equilibrates. Slow but self-stabilising.

7 · Deep dive — surge pricing

Surge is a regional, slowly-changing state. The dispatch loop reads the regional multiplier per request; the surge service updates it every ~30 seconds based on a demand:supply ratio.

Component	Signal	Update rate
Demand	Trip-request rate in the region (last 1 min)	30 s sliding window
Supply	Available-driver count in the region (live geo index)	5 s sampling
Multiplier	min(3.0, demand / supply × tuned_coefficient)	30 s refresh
Storage	Cell-keyed Redis hash (region → multiplier)	—
Audit	Every update emitted to Kafka; OLAP rollup nightly	—

The interview signal is acknowledging that surge needs hysteresis — without it, a flash of demand can spike then crash the price in seconds, which is a terrible UX. A simple low-pass filter (EMA over a 5-minute window) gets you most of the way. The "fairness" wrinkle — surge applied to a request that completes a minute later — is a product question, not a design one.

8 · Failure modes

Failure	Symptom	Runbook
Redis geo-shard down	One region's dispatch goes to zero match-rate	Failover to replica (RTO ~10 s). If replica also down, dispatch returns "no drivers available"; trip stays in PENDING and retries after 30 s. Don't lose the rider request.
Spanner / Vitess shard partition	Trip-state writes fail for some region	Buffer trip events in Kafka; replay when shard returns. Trip lifecycle continues from in-memory state; persistence is eventually-consistent during the partition.
Maps service slow	P99 ETA latency climbs from 50 ms to 5 s	Circuit-break Maps; fall back to a coarse Haversine distance / 30 km h estimate. Worse UX but the dispatch loop completes.
Driver app reconnects	WebSocket reconnect storm after a network blip	Sharded WebSocket fleet with sticky sessions; exponential backoff in the app; per-region rate limit on connect.
Hot region (sports event)	One region's Redis hits CPU 95%	Per-region cluster sized for 5× normal peak. Vertical scale-up is fast (Redis Cluster). The choke point is the dispatch service, not Redis — autoscale dispatch workers on Kafka lag.
Fraudulent driver heartbeats	Driver app modified to spam heartbeats at false positions	Server-side validation: reject heartbeats moving faster than 200 km h. Cross-check with map-matching service. Suspicious driver routed to safety review.

9 · Cost & operability

Component	Driver per month	Order of magnitude
Redis geo-index	5 regions × 8 nodes × ~$300	~$12k
Spanner / Vitess	~10k writes/s × 1 KB × 30 days × storage tier	~$30k
Kafka	~50k events/s × peak factor	~$8k
WebSocket fleet	10M concurrent connections × shard ratio	~$25k
Maps service (external bill)	~900 ETA calls/s × $0.005 / 1000 calls	~$10k
Compute (dispatch + trip)	~200 cores worth, multi-region	~$15k

The maps bill is the surprise line. Reduce it with aggressive caching of common origin–destination pairs and approximate (straight-line × correction factor) ETAs during low-stakes phases (rider sees "5–8 min"); call the real maps service only at dispatch time.

10 · Trade-offs & what would change at 10×

If…	Then…
10× trip volume	Shard Redis geo by sub-region (Manhattan East vs West). Add a per-cell dispatch worker pool. Move trip events from Spanner to a write-optimised log (BigQuery streaming or ScyllaDB).
10× drivers	Switch from 2D geohash to H3 (hexagonal cells, Uber's open-source). Better uniformity at high density. Re-shard the heartbeat ingest by cell-prefix.
Sub-second match SLA	Pre-rank drivers per cell in the background (top-N waiting list per cell, refreshed every 5 s). Dispatch becomes a cache lookup instead of a recompute.
Strict regulator (e.g. London)	Per-region driver-availability audit log. Mandatory rest-period enforcement (pause heartbeat acceptance). Receipt-level data residency.
Multi-modal (car + bike + scooter)	Separate vehicle-type sub-indexes per cell. Type-specific match radius (bikes: 500 m, cars: 1.5 km). Surge per type.
What would a senior+ answer add?	The matching-fairness layer. Greedy nearest-driver maximises efficiency but starves the drivers far from city-centre. A real Uber dispatch uses a bipartite-matching solver every few seconds across a small set of pending requests + candidate drivers — what's called batch matching. That's the senior-level wrinkle.

Ride matching

1 · Clarifying questions

2 · Capacity math

3 · The geospatial-index choice

4 · API and data model

External APIs (rider / driver apps)

Core tables (just the columns that matter)

5 · High-level architecture

6 · Deep dive — the dispatch loop

The happy path, sequentially

The six races

7 · Deep dive — surge pricing

8 · Failure modes

9 · Cost & operability

10 · Trade-offs & what would change at 10×

Further reading

Next — Typeahead / search autocomplete