Design Mint
Aggregate every user's bank, brokerage, credit card, and loan account into one place. The technical core is unglamorous — scheduled sync jobs against thousands of partner APIs, each with its own auth model, error semantics, and rate limits. The interesting parts: making retries safe when transactions are involved, categorising spend without seeing every user's data, and never losing a transaction.
1 · Clarifying questions
| Functional scope? | Connect financial accounts (via Plaid-shape aggregator), pull transactions on a schedule, categorise, show net worth + budgets, alert on unusual activity. Read-only access; no money movement. |
| Scale? | 30M users, ~5 accounts each → 150M connected accounts. Each refreshed nightly + on-demand → ~150M sync jobs/day. ~50 transactions/account/month average → ~250M new transactions/day. |
| Latency? | Login-to-dashboard P99 ≤ 500 ms (data is already in our DB). On-demand sync ≤ 30 s end-to-end. Nightly sync runs over a 6-hour window. |
| Consistency? | Per-user strong consistency on balances + transactions. A double-count is much worse than a 1-minute-stale balance. |
| Failure tolerance? | Partner APIs fail constantly (small banks especially). Per-bank availability matters more than per-user. |
| Compliance? | SOC 2, PCI-DSS-adjacent (we don't store card numbers but we do store account/routing). State-by-state regulation. Encryption at rest is mandatory; encryption-in-use is the bar at this tier. |
| Multi-region? | US-only product; multi-region within US for DR. |
2 · Capacity math, on a napkin
| Number | Calculation | Result |
|---|---|---|
| Users | given | 30M |
| Connected accounts | 30M × 5 | 150M |
| Sync jobs/day | 150M nightly + ~10% on-demand | ~165M/day |
| Sync QPS (avg / window) | 165M / 6 h window | ~7,600 sustained over the night |
| Sync peak (during work hours) | on-demand bursts | ~2K QPS |
| New transactions/day | 150M × 50 / 30 | ~250M |
| Transaction size (row) | id + amount + meta + categorisation | ~600 B |
| New storage/day | 250M × 600 B × 3 (repl + index) | ~450 GB/day → ~165 TB/year |
| Read QPS dashboards | 30M × 5 logins × 10 reads / 86,400 | ~17K reads/sec average |
| Partner API rate limits | varies: 10–1000 req/s per bank | biggest design constraint |
The dominant constraint isn't our infrastructure — it's the partner APIs. Each bank has its own rate limit; some are 10 req/sec across all our users. The scheduler is built around respecting those limits while still finishing the nightly sweep on time.
3 · API and data model
Public API
POST /v1/accounts/link # initiate bank connection
{ "institution_id": "chase", "credential_type": "oauth" }
→ 200 { "link_token": "lt_...", "expires_at": "..." }
# Client uses link_token in partner SDK; we get a callback with credentials_id
POST /v1/accounts/link/complete
{ "link_token": "lt_...", "credentials_id": "cred_..." }
→ 201 { "account_ids": ["a_...", "a_..."], "sync_started": true }
POST /v1/accounts/:id/sync # on-demand refresh
→ 202 { "job_id": "j_..." }
GET /v1/accounts/:id # balance + last sync
GET /v1/accounts/:id/transactions # paginated
?since=2026-05-01&limit=100
GET /v1/categories
PUT /v1/transactions/:id/category # user overrideStorage
institutions -- one row per supported bank
institution_id VARCHAR(64) PK
name TEXT
partner VARCHAR(32) -- plaid | finicity | direct
rate_limit_qps INT
oauth_config JSONB
status VARCHAR(16) -- live | degraded | down
credentials -- encrypted secrets per user×institution
credentials_id UUID PK
user_id BIGINT
institution_id VARCHAR(64)
encrypted_token BYTEA -- envelope-encrypted; KMS-managed key
encryption_kid VARCHAR(64)
status VARCHAR(16) -- active | expired | revoked
accounts -- one row per connected account
account_id UUID PK
user_id BIGINT
credentials_id UUID
type VARCHAR(16) -- checking | savings | credit | brokerage | loan
partner_id TEXT -- partner's id; UNIQUE with credentials_id
balance DECIMAL(18,2)
available DECIMAL(18,2)
currency CHAR(3)
last_synced_at TIMESTAMP
next_sync_at TIMESTAMP -- scheduler index
status VARCHAR(16)
INDEX (next_sync_at, status) -- drives the scheduler
transactions -- partitioned by user_id, then date
txn_id UUID PK
account_id UUID
partner_txn_id TEXT -- UNIQUE (account_id, partner_txn_id)
-- the idempotency key
amount DECIMAL(18,2)
currency CHAR(3)
posted_at DATE -- partition key suffix
description TEXT
merchant TEXT
category_id VARCHAR(32)
user_category VARCHAR(32) -- user override; takes precedence
INDEX (account_id, posted_at DESC)
sync_jobs -- audit trail
job_id UUID PK
account_id UUID
started_at, finished_at TIMESTAMP
status VARCHAR(16) -- queued | running | success | failed
error_class VARCHAR(64)
txns_added INT4 · High-level architecture
Scheduler scans accounts where next_sync_at <= now and enqueues sync jobs to a queue partitioned by institution_id. Each institution has a dedicated worker pool sized to its rate limit — overflow waits in queue rather than overwhelming the partner. Workers fetch new transactions, dedupe by partner_txn_id, write to Postgres, then categorise. Credentials are envelope-encrypted with KMS keys; workers decrypt on demand and never log plaintext.
5 · The hard part — idempotent sync against fragile partners
Why idempotency is hard here
A sync run pulls transactions, writes them to our DB, then updates last_synced_at. If the worker crashes after writing some transactions but before updating last_synced_at, the next run will pull the same transactions again. Without idempotency we either lose data (skip-ahead) or double-count it (re-insert).
The idempotency key is the partner's transaction ID — every bank gives one back. The transactions table has a unique index on (account_id, partner_txn_id). Re-inserts become no-ops. Updates use an UPSERT pattern:
INSERT INTO transactions
(txn_id, account_id, partner_txn_id, amount, posted_at, description, ...)
VALUES (...)
ON CONFLICT (account_id, partner_txn_id)
DO UPDATE SET
amount = EXCLUDED.amount,
description = EXCLUDED.description,
posted_at = EXCLUDED.posted_at,
updated_at = now()
RETURNING txn_id;Pending vs posted
Most banks return pending transactions that later turn into posted transactions with a different (or sometimes the same!) partner ID. The system has to recognise that "Pending: STARBUCKS $4.20" and "Posted: STARBUCKS $4.20 2026-05-21" are the same transaction. Heuristics: same account, same amount, same merchant string, posted-date close to pending-date. When a posted transaction matches a pending one, the pending row is deleted and the posted one wins.
Rate-limit-aware scheduling
One bank with 10 req/sec across all our users and 5M of our users on that bank → finishing the nightly sweep takes 5,000,000 / 10 / 86,400 ≈ 5.8 days, which doesn't fit a 6-hour window. The fix: stagger refresh frequency by institution. High-rate-limit banks (Chase, BofA, Wells Fargo) refresh nightly. Low-rate-limit small banks refresh every 3 days; on-demand refresh always works, regardless.
The scheduler maintains a per-institution token bucket. Workers acquire a token before each partner call, sleep if the bucket is empty. The bucket replenishes at the institution's rate limit. Centralised in Redis so all worker instances share the same view.
Retry classification
| Error | Retry? | How long |
|---|---|---|
RATE_LIMITED | Yes | Exponential backoff respecting Retry-After |
INVALID_CREDENTIALS | No | Mark credentials expired; prompt user to re-auth |
MFA_REQUIRED | No | Push notification to user; suspend sync until resolved |
INSTITUTION_DOWN | Yes | Retry in 1 h, then 6 h, then daily; banner in UI |
TIMEOUT | Yes | Up to 3 retries; classify as INSTITUTION_DOWN after |
UNKNOWN_5xx | Yes | Exponential backoff up to 24 h |
6 · Categorisation
Each transaction gets a category (Groceries, Rent, Dining, Transport, …). Three layers:
- User override. If the user has corrected this merchant before, use that. Saved per (user, merchant) with a normalised merchant string.
- Global rules. Maintained list mapping merchant strings to categories ("STARBUCKS" → Dining). Curated centrally, updated weekly.
- ML model. Trained on aggregate categorised data. Runs only on transactions that miss the override and rules. Outputs category + confidence; below threshold, the transaction is left "Uncategorised" until the user labels it.
Privacy: the model is trained on aggregated, anonymised data. Per-user behaviour is never used for cross-user inference. Compliance audits this regularly.
7 · Failure modes & runbook
| Failure | Symptom | Mitigation |
|---|---|---|
| Partner returns wrong data | Balance jumps to wildly wrong value | Sanity check vs prior balance; if delta > threshold, mark account "needs review", don't overwrite. Alert ops. |
| Partner returns duplicate IDs | Same partner_txn_id, different details | Treat as update (UPSERT); flag the row for review if material details change. |
| Bank silently breaks API contract | Sync starts returning empty results for all users on that bank | Per-institution health monitor; pages on-call when success rate < 95%. Often requires partner engagement. |
| Credentials expire | Sync fails with INVALID_CREDENTIALS | Email + in-app prompt to re-link. Stop trying after 3 failures. |
| KMS unavailable | Cannot decrypt credentials; sync halts | KMS is multi-region with auto-failover; this is a P1 incident if it happens. |
| Scheduler stalls | Nightly sweep doesn't complete by morning | Auto-scale workers; alert if queue depth doesn't drain by SLO time. Stagger institutions to spread load. |
| Categorisation model drift | Users see wrong categories after a deploy | Canary the model on 5% of users; compare correction rate vs. control. Auto-rollback on regression. |
| Data leak risk (sensitive logs) | Audit reveals plaintext PAN in error logs | Log redaction filter at the SDK level; CI test that scans logs for known regexes. Treat any incident as a P0. |
8 · Cost & SLOs
| Line | Estimate | Note |
|---|---|---|
| API + scheduler + workers (~500 pods) | ~$20K/month | Stateless; bursty during the sync window |
| Postgres / Aurora (sharded by user) | ~$40K/month | ~5 TB hot; tier older transactions to warehouse |
| Redis (token buckets + dedupe cache) | ~$3K/month | Small; mostly in-memory rate-limit state |
| Partner API fees (Plaid, Finicity) | ~$300K/month | Per-user × per-account fees — the dominant line |
| KMS operations | ~$5K/month | One decrypt per sync per account |
| Data warehouse + categorisation training | ~$15K/month | Snowflake-shape; ML training on aggregates |
SLOs
- Dashboard load P99: 500 ms. Cached aggregates per user; recomputed on transaction insert.
- On-demand sync P99: 30 s. User taps Refresh; we expect to be done in 30 s for top-10 banks; longer for niche institutions.
- Nightly sweep completion: 100% of accounts by 8 AM local. Hard SLO; missing it means stale dashboards at morning peak.
- Transaction freshness P99: 24 h. Bank-dependent; some banks publish only daily.
- Data durability: 11 nines. Multi-region Aurora; daily backups; quarterly restore drills.
- Zero plaintext credential persistence. Verified by quarterly audit.
9 · Trade-offs & "what would you change at 10×"
| If… | Then… |
|---|---|
| 10× users (300M) | Per-institution worker pools become large enough to fill capacity. Partner fees become the dominant strategic conversation — direct integration with the top-20 banks instead of via aggregator. |
| Add money movement (ACH, Zelle) | Whole new compliance surface: BSA/AML, KYC, transaction monitoring, OFAC. The technical changes are small; the org changes are huge. Out-of-scope here. |
| Real-time push instead of polling | Some banks support webhooks; subscribe where possible. Removes the rate-limit problem but creates a delivery-reliability problem (we have to verify the webhook actually fired). |
| Open Banking (PSD2-style mandate) | The whole architecture gets simpler if regulators mandate a standard API. Most of the per-institution complexity goes away. |
| Investment-account specifics | Brokerages have positions + cost basis + lot tracking + corporate actions (stock splits, mergers). A whole sub-system; usually a separate workstream from cash-flow features. |
| "What would a more senior answer add?" | The risk + observability story: per-institution error budgets, fraud-detection layer (unusual transactions flagged before the user sees them), the data-subject-rights workflow (GDPR-style "delete all my data" with proof), the SOC 2 evidence pipeline that proves the controls work. At this tier the regulator-facing story IS the product. |
Further reading
- Plaid Engineering — public posts on auth, transactions, identity verification. The clearest reference for the aggregator model this design assumes.
- "Idempotent APIs" — Stripe Engineering. The canonical writeup on the idempotency-key pattern.
- Truelayer / Tink engineering blogs. European open-banking-flavoured versions of the same design.
- Adjacent: Distributed scheduler. The scheduling layer in depth.
- Adjacent: Idempotence. Why retries are safe.
- Adjacent: OAuth / OIDC. The credentials story.
- Adjacent: Napkin math. The capacity math.