08 / 19
Playbook / 08

Ride matching

An Uber-shape problem. Drivers heartbeat their location every few seconds; riders request a trip; the system has to pick a driver, quote an ETA, and run the trip through its lifecycle. The depth comes from the geospatial index — geohash, S2, quadtree — and the dispatch loop's failure modes when a driver declines, when two riders compete for the same driver, and when a region surges.


1 · Clarifying questions

Five minutes of disambiguation before any design work. The interviewer is partly watching to see if you ask before you assume.

"Cities or global?"Global, but matching is a single-city / single-region operation. Cross-region matching doesn't exist in the real product.
"How many drivers / riders?"10M active drivers worldwide, ~3M heartbeating right now. 100M monthly riders, 10M MAUs in flight on a peak Friday evening.
"Heartbeat frequency?"Every 4 seconds when on-duty; every 30 seconds when in a trip (no need for fine resolution).
"Match radius?"Start at 1.5 km; expand to 3 km, then 5 km if no driver accepts. Reject after ~10 seconds with no match.
"Latency budget?"End-to-end match (rider tap → driver accept push) under 2 seconds P95. Dispatch logic itself under 300 ms.
"Multi-rider / pool?"Out of scope for this design — single rider, single driver. Pool/UberX-shared is a different matching algorithm.
"Pricing in scope?"Surge multiplier is in scope (it's a regional state). Fare calculation per route is not.
"Map / routing in scope?"Use it as a service. Treat getETA(from, to) as a 50 ms blackbox call to the maps team.

2 · Capacity math

Five numbers, derived from the clarifications.

NumberDerivationValue
Driver heartbeats3M drivers × 1 / 4 s~750k QPS
Trip requests10M MAU × 1.5 trips/day ÷ 86 400 s, peak factor 5×~900 QPS sustained, 4500 QPS peak
Dispatch fan-outPer request: 3–10 candidates × 1 RPC each~30k driver-RPCs / second peak
Geo-index hot data3M drivers × 64 bytes (lat, lng, ID, state, heading)~200 MB — fits in one Redis shard's RAM
Trip log writes~900 trips/s × ~10 lifecycle events~9k writes/s to the trips table

The 750k heartbeats/s is the load to design for. Every other QPS number is at least 100× smaller. The choice of geospatial index has to handle a sustained write rate at that scale.

3 · The geospatial-index choice

The single biggest design decision. Four mainstream options:

IndexShapeWrite costRead costRange query
GeohashZ-order base-32 string (e.g. 9q8yyk)O(1) — prefix lookupO(1)Tricky near cell boundaries — must query 9 cells
Google S264-bit cell ID via Hilbert curve on the sphereO(1)O(1)Native spherical math; built-in cap/region cover
QuadtreeRecursive 2D subdivisionO(log n) — tree descentO(log n)Natural for k-NN; harder to shard
R-treeBounding-box treeO(log n)O(log n)Best for arbitrary polygons; overkill for points

For ride matching the choice is between geohash (simplest) and S2 (Google's, used by Uber's H3 successor — H3 is hex-cell-based but same family). Both store driver positions in O(1) cells; both shard by cell-ID prefix.

The cell-size trade. Cell precision 7 (geohash) ≈ 150 m × 150 m; precision 6 ≈ 1.2 km × 600 m. Match radius 1.5 km → query the rider's cell at precision 6 plus its 8 neighbours (the "9-cell pattern"). Smaller cells = fewer drivers per cell but more cells to query. The sweet spot is "median cell holds 3–10 candidates".

The hot-data layer is Redis with geospatial commands (GEOADD, GEORADIUS, GEOSEARCH) — these internally use a sorted-set keyed by geohash, with the score being the geohash integer. Sharding is by region (Manhattan, San Francisco, Tokyo are separate Redis clusters); within a region, by geohash-prefix. The full picture sits behind a "location service" layer that abstracts the storage so the dispatch loop sees a single nearbyDrivers(lat, lng, radius_m) RPC.

4 · API and data model

External APIs (rider / driver apps)

// Driver side
PUT  /driver/v1/heartbeat
  body: { driver_id, lat, lng, heading, state: "available|busy|offline" }
  response: { acknowledged_at }

POST /driver/v1/trip/{trip_id}/accept       // driver responds to dispatch
POST /driver/v1/trip/{trip_id}/decline
POST /driver/v1/trip/{trip_id}/arrive

// Rider side
POST /rider/v1/trip
  body: { rider_id, pickup_lat, pickup_lng, dest_lat, dest_lng, vehicle_type }
  response: { trip_id, status, estimated_pickup_seconds, surge_multiplier }

GET  /rider/v1/trip/{trip_id}               // poll; or use websocket push
POST /rider/v1/trip/{trip_id}/cancel

Core tables (just the columns that matter)

drivers           (id PK, name, vehicle_type, rating, current_region)
driver_status     (driver_id PK, state, last_heartbeat_at, last_lat, last_lng)
trips             (id PK, rider_id, driver_id, state, requested_at, pickup_geohash,
                   dropoff_geohash, surge_multiplier, fare, region)
trip_events       (trip_id, ts, event_type, payload_json)    -- the audit log
dispatch_queue    (trip_id PK, attempts, last_offered_driver_id, expires_at)
regional_surge    (region, multiplier, updated_at)

Two tiers of storage. Hot tier: Redis (driver positions, dispatch queue). Warm tier: a sharded SQL store like Vitess or a managed Spanner (trips, trip_events). The trip_events table is append-only and is the source of truth for the trip lifecycle — every state change is a row.

5 · High-level architecture

Rider appDriver appAPI gatewayTrip servicelifecycle, idempotencyDispatch servicematching loopLocation servicegeo index facadeSurge servicedemand:supply ratioMaps / ETAexternal serviceSpanner / VitessKafka — eventsRedis (geo)Redis (state)OLAP (surge)

The dispatch service is the hot path. It calls the location service to find nearby drivers, asks surge for the regional multiplier, asks maps for ETA, and writes the trip lifecycle to Spanner / Vitess. Driver heartbeats flow directly to the location service (skipping the API gateway's per-request auth — the gateway is bypassed by an internal heartbeat ingest, secured with mTLS).

6 · Deep dive — the dispatch loop

This is the part interviewers will probe. The loop is short to describe but has six distinct race conditions.

The happy path, sequentially

  1. Rider's POST hits the trip service; trip is created with state PENDING_DISPATCH and assigned an idempotency key.
  2. Trip service emits to a per-region Kafka topic; the regional dispatch worker picks it up.
  3. Dispatch worker asks the location service: "nearbyDrivers(pickup_lat, pickup_lng, radius=1.5 km, max=10, type=UberX)".
  4. Worker ranks the candidates — heuristic: ETA + driver acceptance rate + match-fairness score. Picks the top one.
  5. Worker reserves the driver in Redis with a TTL (10 s), changes trip state to OFFERED, and pushes the offer to the driver app over WebSocket.
  6. Driver app responds ACCEPT (or doesn't). On accept: state → ASSIGNED, ride lifecycle begins. On decline / timeout: release the reservation, pick the next candidate, re-offer (up to N tries, then expand the radius).
  7. Trip state moves through ARRIVING → PICKED_UP → ON_TRIP → COMPLETED as the driver app posts events.

The six races

RaceHow it manifestsMitigation
Two riders, one driverTwo dispatch workers offer the same driver simultaneouslyAtomic SETNX on the driver-reservation key. The loser falls back to the next candidate.
Driver double-acceptsNetwork retry on accept RPCAccept is idempotent on (driver_id, trip_id). Second accept returns the same state.
Stale driver heartbeatDriver's phone died; index still shows them as availableTTL on heartbeat entries — 15 seconds. Stale entries auto-expire from the geo index.
Rider cancels mid-dispatchRider taps cancel after offer sent but before driver acceptCancel atomically transitions trip state. Driver-app receives OFFER_REVOKED push; dispatch loop exits.
Driver app crashes mid-tripTrip in ON_TRIP but no heartbeats arrivingSeparate "trip health" worker watches for missing-heartbeat-during-trip; alerts the safety team after 60 s.
Region surge spikeDemand suddenly 10×; every dispatch loop expands radius and exhausts candidatesSurge service raises the multiplier; some riders drop out at the new price; supply re-equilibrates. Slow but self-stabilising.

7 · Deep dive — surge pricing

Surge is a regional, slowly-changing state. The dispatch loop reads the regional multiplier per request; the surge service updates it every ~30 seconds based on a demand:supply ratio.

ComponentSignalUpdate rate
DemandTrip-request rate in the region (last 1 min)30 s sliding window
SupplyAvailable-driver count in the region (live geo index)5 s sampling
Multipliermin(3.0, demand / supply × tuned_coefficient)30 s refresh
StorageCell-keyed Redis hash (region → multiplier)
AuditEvery update emitted to Kafka; OLAP rollup nightly

The interview signal is acknowledging that surge needs hysteresis — without it, a flash of demand can spike then crash the price in seconds, which is a terrible UX. A simple low-pass filter (EMA over a 5-minute window) gets you most of the way. The "fairness" wrinkle — surge applied to a request that completes a minute later — is a product question, not a design one.

8 · Failure modes

FailureSymptomRunbook
Redis geo-shard downOne region's dispatch goes to zero match-rateFailover to replica (RTO ~10 s). If replica also down, dispatch returns "no drivers available"; trip stays in PENDING and retries after 30 s. Don't lose the rider request.
Spanner / Vitess shard partitionTrip-state writes fail for some regionBuffer trip events in Kafka; replay when shard returns. Trip lifecycle continues from in-memory state; persistence is eventually-consistent during the partition.
Maps service slowP99 ETA latency climbs from 50 ms to 5 sCircuit-break Maps; fall back to a coarse Haversine distance / 30 km h estimate. Worse UX but the dispatch loop completes.
Driver app reconnectsWebSocket reconnect storm after a network blipSharded WebSocket fleet with sticky sessions; exponential backoff in the app; per-region rate limit on connect.
Hot region (sports event)One region's Redis hits CPU 95%Per-region cluster sized for 5× normal peak. Vertical scale-up is fast (Redis Cluster). The choke point is the dispatch service, not Redis — autoscale dispatch workers on Kafka lag.
Fraudulent driver heartbeatsDriver app modified to spam heartbeats at false positionsServer-side validation: reject heartbeats moving faster than 200 km h. Cross-check with map-matching service. Suspicious driver routed to safety review.

9 · Cost & operability

ComponentDriver per monthOrder of magnitude
Redis geo-index5 regions × 8 nodes × ~$300~$12k
Spanner / Vitess~10k writes/s × 1 KB × 30 days × storage tier~$30k
Kafka~50k events/s × peak factor~$8k
WebSocket fleet10M concurrent connections × shard ratio~$25k
Maps service (external bill)~900 ETA calls/s × $0.005 / 1000 calls~$10k
Compute (dispatch + trip)~200 cores worth, multi-region~$15k

The maps bill is the surprise line. Reduce it with aggressive caching of common origin–destination pairs and approximate (straight-line × correction factor) ETAs during low-stakes phases (rider sees "5–8 min"); call the real maps service only at dispatch time.

10 · Trade-offs & what would change at 10×

If…Then…
10× trip volumeShard Redis geo by sub-region (Manhattan East vs West). Add a per-cell dispatch worker pool. Move trip events from Spanner to a write-optimised log (BigQuery streaming or ScyllaDB).
10× driversSwitch from 2D geohash to H3 (hexagonal cells, Uber's open-source). Better uniformity at high density. Re-shard the heartbeat ingest by cell-prefix.
Sub-second match SLAPre-rank drivers per cell in the background (top-N waiting list per cell, refreshed every 5 s). Dispatch becomes a cache lookup instead of a recompute.
Strict regulator (e.g. London)Per-region driver-availability audit log. Mandatory rest-period enforcement (pause heartbeat acceptance). Receipt-level data residency.
Multi-modal (car + bike + scooter)Separate vehicle-type sub-indexes per cell. Type-specific match radius (bikes: 500 m, cars: 1.5 km). Surge per type.
What would a senior+ answer add?The matching-fairness layer. Greedy nearest-driver maximises efficiency but starves the drivers far from city-centre. A real Uber dispatch uses a bipartite-matching solver every few seconds across a small set of pending requests + candidate drivers — what's called batch matching. That's the senior-level wrinkle.

Further reading

  • Uber Engineering — "H3: Uber's Hexagonal Hierarchical Spatial Index" (2018). The geospatial index Uber moved to. Why hexagons beat squares (uniform distance to neighbours; better for k-NN).
  • Uber Engineering — "Marketplace: Matching Drivers and Riders". Public writeup of the batch-matching algorithm. Hungarian algorithm at the core.
  • Google S2 Geometry library docs. The math behind cell-IDs from Hilbert curves on the sphere. Worth one read if you'd bring it up in interviews.
  • Lyft Engineering — "Real-time Driver-Rider Matching". The other side's writeup — different in detail but the same architectural shape.
  • DoorDash — "Matching algorithms at scale" (2022). Slightly different problem (dasher-driver-rider three-sided) but useful for the matching-as-optimisation framing.
  • Adjacent: Consistent hashing. Used for sharding the driver fleet across geo-index nodes.
  • Adjacent: Spatial indexing. Standalone mechanism explainer for geohash, S2, quadtrees.
  • Adjacent: Notifications. The WebSocket / push fleet that ride matching reuses.
Found this useful?