Ride matching
An Uber-shape problem. Drivers heartbeat their location every few seconds; riders request a trip; the system has to pick a driver, quote an ETA, and run the trip through its lifecycle. The depth comes from the geospatial index — geohash, S2, quadtree — and the dispatch loop's failure modes when a driver declines, when two riders compete for the same driver, and when a region surges.
1 · Clarifying questions
Five minutes of disambiguation before any design work. The interviewer is partly watching to see if you ask before you assume.
| "Cities or global?" | Global, but matching is a single-city / single-region operation. Cross-region matching doesn't exist in the real product. |
| "How many drivers / riders?" | 10M active drivers worldwide, ~3M heartbeating right now. 100M monthly riders, 10M MAUs in flight on a peak Friday evening. |
| "Heartbeat frequency?" | Every 4 seconds when on-duty; every 30 seconds when in a trip (no need for fine resolution). |
| "Match radius?" | Start at 1.5 km; expand to 3 km, then 5 km if no driver accepts. Reject after ~10 seconds with no match. |
| "Latency budget?" | End-to-end match (rider tap → driver accept push) under 2 seconds P95. Dispatch logic itself under 300 ms. |
| "Multi-rider / pool?" | Out of scope for this design — single rider, single driver. Pool/UberX-shared is a different matching algorithm. |
| "Pricing in scope?" | Surge multiplier is in scope (it's a regional state). Fare calculation per route is not. |
| "Map / routing in scope?" | Use it as a service. Treat getETA(from, to) as a 50 ms blackbox call to the maps team. |
2 · Capacity math
Five numbers, derived from the clarifications.
| Number | Derivation | Value |
|---|---|---|
| Driver heartbeats | 3M drivers × 1 / 4 s | ~750k QPS |
| Trip requests | 10M MAU × 1.5 trips/day ÷ 86 400 s, peak factor 5× | ~900 QPS sustained, 4500 QPS peak |
| Dispatch fan-out | Per request: 3–10 candidates × 1 RPC each | ~30k driver-RPCs / second peak |
| Geo-index hot data | 3M drivers × 64 bytes (lat, lng, ID, state, heading) | ~200 MB — fits in one Redis shard's RAM |
| Trip log writes | ~900 trips/s × ~10 lifecycle events | ~9k writes/s to the trips table |
The 750k heartbeats/s is the load to design for. Every other QPS number is at least 100× smaller. The choice of geospatial index has to handle a sustained write rate at that scale.
3 · The geospatial-index choice
The single biggest design decision. Four mainstream options:
| Index | Shape | Write cost | Read cost | Range query |
|---|---|---|---|---|
| Geohash | Z-order base-32 string (e.g. 9q8yyk) | O(1) — prefix lookup | O(1) | Tricky near cell boundaries — must query 9 cells |
| Google S2 | 64-bit cell ID via Hilbert curve on the sphere | O(1) | O(1) | Native spherical math; built-in cap/region cover |
| Quadtree | Recursive 2D subdivision | O(log n) — tree descent | O(log n) | Natural for k-NN; harder to shard |
| R-tree | Bounding-box tree | O(log n) | O(log n) | Best for arbitrary polygons; overkill for points |
For ride matching the choice is between geohash (simplest) and S2 (Google's, used by Uber's H3 successor — H3 is hex-cell-based but same family). Both store driver positions in O(1) cells; both shard by cell-ID prefix.
The hot-data layer is Redis with geospatial commands
(GEOADD, GEORADIUS, GEOSEARCH) — these
internally use a sorted-set keyed by geohash, with the score being the geohash
integer. Sharding is by region (Manhattan, San Francisco, Tokyo are separate Redis
clusters); within a region, by geohash-prefix. The full picture sits behind a
"location service" layer that abstracts the storage so the dispatch loop sees a
single nearbyDrivers(lat, lng, radius_m) RPC.
4 · API and data model
External APIs (rider / driver apps)
// Driver side
PUT /driver/v1/heartbeat
body: { driver_id, lat, lng, heading, state: "available|busy|offline" }
response: { acknowledged_at }
POST /driver/v1/trip/{trip_id}/accept // driver responds to dispatch
POST /driver/v1/trip/{trip_id}/decline
POST /driver/v1/trip/{trip_id}/arrive
// Rider side
POST /rider/v1/trip
body: { rider_id, pickup_lat, pickup_lng, dest_lat, dest_lng, vehicle_type }
response: { trip_id, status, estimated_pickup_seconds, surge_multiplier }
GET /rider/v1/trip/{trip_id} // poll; or use websocket push
POST /rider/v1/trip/{trip_id}/cancelCore tables (just the columns that matter)
drivers (id PK, name, vehicle_type, rating, current_region)
driver_status (driver_id PK, state, last_heartbeat_at, last_lat, last_lng)
trips (id PK, rider_id, driver_id, state, requested_at, pickup_geohash,
dropoff_geohash, surge_multiplier, fare, region)
trip_events (trip_id, ts, event_type, payload_json) -- the audit log
dispatch_queue (trip_id PK, attempts, last_offered_driver_id, expires_at)
regional_surge (region, multiplier, updated_at)Two tiers of storage. Hot tier: Redis (driver positions, dispatch queue). Warm tier: a sharded SQL store like Vitess or a managed Spanner (trips, trip_events). The trip_events table is append-only and is the source of truth for the trip lifecycle — every state change is a row.
5 · High-level architecture
The dispatch service is the hot path. It calls the location service to find nearby drivers, asks surge for the regional multiplier, asks maps for ETA, and writes the trip lifecycle to Spanner / Vitess. Driver heartbeats flow directly to the location service (skipping the API gateway's per-request auth — the gateway is bypassed by an internal heartbeat ingest, secured with mTLS).
6 · Deep dive — the dispatch loop
This is the part interviewers will probe. The loop is short to describe but has six distinct race conditions.
The happy path, sequentially
- Rider's POST hits the trip service; trip is created with state
PENDING_DISPATCHand assigned an idempotency key. - Trip service emits to a per-region Kafka topic; the regional dispatch worker picks it up.
- Dispatch worker asks the location service: "
nearbyDrivers(pickup_lat, pickup_lng, radius=1.5 km, max=10, type=UberX)". - Worker ranks the candidates — heuristic: ETA + driver acceptance rate + match-fairness score. Picks the top one.
- Worker reserves the driver in Redis with a TTL (10 s), changes trip state to
OFFERED, and pushes the offer to the driver app over WebSocket. - Driver app responds
ACCEPT(or doesn't). On accept: state →ASSIGNED, ride lifecycle begins. On decline / timeout: release the reservation, pick the next candidate, re-offer (up to N tries, then expand the radius). - Trip state moves through
ARRIVING → PICKED_UP → ON_TRIP → COMPLETEDas the driver app posts events.
The six races
| Race | How it manifests | Mitigation |
|---|---|---|
| Two riders, one driver | Two dispatch workers offer the same driver simultaneously | Atomic SETNX on the driver-reservation key. The loser falls back to the next candidate. |
| Driver double-accepts | Network retry on accept RPC | Accept is idempotent on (driver_id, trip_id). Second accept returns the same state. |
| Stale driver heartbeat | Driver's phone died; index still shows them as available | TTL on heartbeat entries — 15 seconds. Stale entries auto-expire from the geo index. |
| Rider cancels mid-dispatch | Rider taps cancel after offer sent but before driver accept | Cancel atomically transitions trip state. Driver-app receives OFFER_REVOKED push; dispatch loop exits. |
| Driver app crashes mid-trip | Trip in ON_TRIP but no heartbeats arriving | Separate "trip health" worker watches for missing-heartbeat-during-trip; alerts the safety team after 60 s. |
| Region surge spike | Demand suddenly 10×; every dispatch loop expands radius and exhausts candidates | Surge service raises the multiplier; some riders drop out at the new price; supply re-equilibrates. Slow but self-stabilising. |
7 · Deep dive — surge pricing
Surge is a regional, slowly-changing state. The dispatch loop reads the regional multiplier per request; the surge service updates it every ~30 seconds based on a demand:supply ratio.
| Component | Signal | Update rate |
|---|---|---|
| Demand | Trip-request rate in the region (last 1 min) | 30 s sliding window |
| Supply | Available-driver count in the region (live geo index) | 5 s sampling |
| Multiplier | min(3.0, demand / supply × tuned_coefficient) | 30 s refresh |
| Storage | Cell-keyed Redis hash (region → multiplier) | — |
| Audit | Every update emitted to Kafka; OLAP rollup nightly | — |
The interview signal is acknowledging that surge needs hysteresis — without it, a flash of demand can spike then crash the price in seconds, which is a terrible UX. A simple low-pass filter (EMA over a 5-minute window) gets you most of the way. The "fairness" wrinkle — surge applied to a request that completes a minute later — is a product question, not a design one.
8 · Failure modes
| Failure | Symptom | Runbook |
|---|---|---|
| Redis geo-shard down | One region's dispatch goes to zero match-rate | Failover to replica (RTO ~10 s). If replica also down, dispatch returns "no drivers available"; trip stays in PENDING and retries after 30 s. Don't lose the rider request. |
| Spanner / Vitess shard partition | Trip-state writes fail for some region | Buffer trip events in Kafka; replay when shard returns. Trip lifecycle continues from in-memory state; persistence is eventually-consistent during the partition. |
| Maps service slow | P99 ETA latency climbs from 50 ms to 5 s | Circuit-break Maps; fall back to a coarse Haversine distance / 30 km h estimate. Worse UX but the dispatch loop completes. |
| Driver app reconnects | WebSocket reconnect storm after a network blip | Sharded WebSocket fleet with sticky sessions; exponential backoff in the app; per-region rate limit on connect. |
| Hot region (sports event) | One region's Redis hits CPU 95% | Per-region cluster sized for 5× normal peak. Vertical scale-up is fast (Redis Cluster). The choke point is the dispatch service, not Redis — autoscale dispatch workers on Kafka lag. |
| Fraudulent driver heartbeats | Driver app modified to spam heartbeats at false positions | Server-side validation: reject heartbeats moving faster than 200 km h. Cross-check with map-matching service. Suspicious driver routed to safety review. |
9 · Cost & operability
| Component | Driver per month | Order of magnitude |
|---|---|---|
| Redis geo-index | 5 regions × 8 nodes × ~$300 | ~$12k |
| Spanner / Vitess | ~10k writes/s × 1 KB × 30 days × storage tier | ~$30k |
| Kafka | ~50k events/s × peak factor | ~$8k |
| WebSocket fleet | 10M concurrent connections × shard ratio | ~$25k |
| Maps service (external bill) | ~900 ETA calls/s × $0.005 / 1000 calls | ~$10k |
| Compute (dispatch + trip) | ~200 cores worth, multi-region | ~$15k |
The maps bill is the surprise line. Reduce it with aggressive caching of common origin–destination pairs and approximate (straight-line × correction factor) ETAs during low-stakes phases (rider sees "5–8 min"); call the real maps service only at dispatch time.
10 · Trade-offs & what would change at 10×
| If… | Then… |
|---|---|
| 10× trip volume | Shard Redis geo by sub-region (Manhattan East vs West). Add a per-cell dispatch worker pool. Move trip events from Spanner to a write-optimised log (BigQuery streaming or ScyllaDB). |
| 10× drivers | Switch from 2D geohash to H3 (hexagonal cells, Uber's open-source). Better uniformity at high density. Re-shard the heartbeat ingest by cell-prefix. |
| Sub-second match SLA | Pre-rank drivers per cell in the background (top-N waiting list per cell, refreshed every 5 s). Dispatch becomes a cache lookup instead of a recompute. |
| Strict regulator (e.g. London) | Per-region driver-availability audit log. Mandatory rest-period enforcement (pause heartbeat acceptance). Receipt-level data residency. |
| Multi-modal (car + bike + scooter) | Separate vehicle-type sub-indexes per cell. Type-specific match radius (bikes: 500 m, cars: 1.5 km). Surge per type. |
| What would a senior+ answer add? | The matching-fairness layer. Greedy nearest-driver maximises efficiency but starves the drivers far from city-centre. A real Uber dispatch uses a bipartite-matching solver every few seconds across a small set of pending requests + candidate drivers — what's called batch matching. That's the senior-level wrinkle. |
Further reading
- Uber Engineering — "H3: Uber's Hexagonal Hierarchical Spatial Index" (2018). The geospatial index Uber moved to. Why hexagons beat squares (uniform distance to neighbours; better for k-NN).
- Uber Engineering — "Marketplace: Matching Drivers and Riders". Public writeup of the batch-matching algorithm. Hungarian algorithm at the core.
- Google S2 Geometry library docs. The math behind cell-IDs from Hilbert curves on the sphere. Worth one read if you'd bring it up in interviews.
- Lyft Engineering — "Real-time Driver-Rider Matching". The other side's writeup — different in detail but the same architectural shape.
- DoorDash — "Matching algorithms at scale" (2022). Slightly different problem (dasher-driver-rider three-sided) but useful for the matching-as-optimisation framing.
- Adjacent: Consistent hashing. Used for sharding the driver fleet across geo-index nodes.
- Adjacent: Spatial indexing. Standalone mechanism explainer for geohash, S2, quadtrees.
- Adjacent: Notifications. The WebSocket / push fleet that ride matching reuses.