17 / 19
Playbook / 17

Design Mint

Aggregate every user's bank, brokerage, credit card, and loan account into one place. The technical core is unglamorous — scheduled sync jobs against thousands of partner APIs, each with its own auth model, error semantics, and rate limits. The interesting parts: making retries safe when transactions are involved, categorising spend without seeing every user's data, and never losing a transaction.


1 · Clarifying questions

Functional scope?Connect financial accounts (via Plaid-shape aggregator), pull transactions on a schedule, categorise, show net worth + budgets, alert on unusual activity. Read-only access; no money movement.
Scale?30M users, ~5 accounts each → 150M connected accounts. Each refreshed nightly + on-demand → ~150M sync jobs/day. ~50 transactions/account/month average → ~250M new transactions/day.
Latency?Login-to-dashboard P99 ≤ 500 ms (data is already in our DB). On-demand sync ≤ 30 s end-to-end. Nightly sync runs over a 6-hour window.
Consistency?Per-user strong consistency on balances + transactions. A double-count is much worse than a 1-minute-stale balance.
Failure tolerance?Partner APIs fail constantly (small banks especially). Per-bank availability matters more than per-user.
Compliance?SOC 2, PCI-DSS-adjacent (we don't store card numbers but we do store account/routing). State-by-state regulation. Encryption at rest is mandatory; encryption-in-use is the bar at this tier.
Multi-region?US-only product; multi-region within US for DR.

2 · Capacity math, on a napkin

NumberCalculationResult
Usersgiven30M
Connected accounts30M × 5150M
Sync jobs/day150M nightly + ~10% on-demand~165M/day
Sync QPS (avg / window)165M / 6 h window~7,600 sustained over the night
Sync peak (during work hours)on-demand bursts~2K QPS
New transactions/day150M × 50 / 30~250M
Transaction size (row)id + amount + meta + categorisation~600 B
New storage/day250M × 600 B × 3 (repl + index)~450 GB/day → ~165 TB/year
Read QPS dashboards30M × 5 logins × 10 reads / 86,400~17K reads/sec average
Partner API rate limitsvaries: 10–1000 req/s per bankbiggest design constraint

The dominant constraint isn't our infrastructure — it's the partner APIs. Each bank has its own rate limit; some are 10 req/sec across all our users. The scheduler is built around respecting those limits while still finishing the nightly sweep on time.

3 · API and data model

Public API

POST /v1/accounts/link                  # initiate bank connection
{ "institution_id": "chase", "credential_type": "oauth" }
→ 200 { "link_token": "lt_...", "expires_at": "..." }
# Client uses link_token in partner SDK; we get a callback with credentials_id

POST /v1/accounts/link/complete
{ "link_token": "lt_...", "credentials_id": "cred_..." }
→ 201 { "account_ids": ["a_...", "a_..."], "sync_started": true }

POST /v1/accounts/:id/sync              # on-demand refresh
→ 202 { "job_id": "j_..." }

GET  /v1/accounts/:id                   # balance + last sync
GET  /v1/accounts/:id/transactions      # paginated
  ?since=2026-05-01&limit=100

GET  /v1/categories
PUT  /v1/transactions/:id/category      # user override

Storage

institutions                    -- one row per supported bank
  institution_id  VARCHAR(64) PK
  name            TEXT
  partner         VARCHAR(32)   -- plaid | finicity | direct
  rate_limit_qps  INT
  oauth_config    JSONB
  status          VARCHAR(16)   -- live | degraded | down

credentials                     -- encrypted secrets per user×institution
  credentials_id  UUID PK
  user_id         BIGINT
  institution_id  VARCHAR(64)
  encrypted_token BYTEA         -- envelope-encrypted; KMS-managed key
  encryption_kid  VARCHAR(64)
  status          VARCHAR(16)   -- active | expired | revoked

accounts                        -- one row per connected account
  account_id      UUID PK
  user_id         BIGINT
  credentials_id  UUID
  type            VARCHAR(16)   -- checking | savings | credit | brokerage | loan
  partner_id      TEXT          -- partner's id; UNIQUE with credentials_id
  balance         DECIMAL(18,2)
  available       DECIMAL(18,2)
  currency        CHAR(3)
  last_synced_at  TIMESTAMP
  next_sync_at    TIMESTAMP     -- scheduler index
  status          VARCHAR(16)
  INDEX (next_sync_at, status)  -- drives the scheduler

transactions                    -- partitioned by user_id, then date
  txn_id          UUID PK
  account_id      UUID
  partner_txn_id  TEXT          -- UNIQUE (account_id, partner_txn_id)
                                -- the idempotency key
  amount          DECIMAL(18,2)
  currency        CHAR(3)
  posted_at       DATE          -- partition key suffix
  description     TEXT
  merchant        TEXT
  category_id     VARCHAR(32)
  user_category   VARCHAR(32)   -- user override; takes precedence
  INDEX (account_id, posted_at DESC)

sync_jobs                       -- audit trail
  job_id          UUID PK
  account_id      UUID
  started_at, finished_at TIMESTAMP
  status          VARCHAR(16)   -- queued | running | success | failed
  error_class     VARCHAR(64)
  txns_added      INT

4 · High-level architecture

Scheduler scans accounts where next_sync_at <= now and enqueues sync jobs to a queue partitioned by institution_id. Each institution has a dedicated worker pool sized to its rate limit — overflow waits in queue rather than overwhelming the partner. Workers fetch new transactions, dedupe by partner_txn_id, write to Postgres, then categorise. Credentials are envelope-encrypted with KMS keys; workers decrypt on demand and never log plaintext.

5 · The hard part — idempotent sync against fragile partners

Why idempotency is hard here

A sync run pulls transactions, writes them to our DB, then updates last_synced_at. If the worker crashes after writing some transactions but before updating last_synced_at, the next run will pull the same transactions again. Without idempotency we either lose data (skip-ahead) or double-count it (re-insert).

The idempotency key is the partner's transaction ID — every bank gives one back. The transactions table has a unique index on (account_id, partner_txn_id). Re-inserts become no-ops. Updates use an UPSERT pattern:

INSERT INTO transactions
  (txn_id, account_id, partner_txn_id, amount, posted_at, description, ...)
VALUES (...)
ON CONFLICT (account_id, partner_txn_id)
DO UPDATE SET
  amount = EXCLUDED.amount,
  description = EXCLUDED.description,
  posted_at = EXCLUDED.posted_at,
  updated_at = now()
RETURNING txn_id;

Pending vs posted

Most banks return pending transactions that later turn into posted transactions with a different (or sometimes the same!) partner ID. The system has to recognise that "Pending: STARBUCKS $4.20" and "Posted: STARBUCKS $4.20 2026-05-21" are the same transaction. Heuristics: same account, same amount, same merchant string, posted-date close to pending-date. When a posted transaction matches a pending one, the pending row is deleted and the posted one wins.

Rate-limit-aware scheduling

One bank with 10 req/sec across all our users and 5M of our users on that bank → finishing the nightly sweep takes 5,000,000 / 10 / 86,400 ≈ 5.8 days, which doesn't fit a 6-hour window. The fix: stagger refresh frequency by institution. High-rate-limit banks (Chase, BofA, Wells Fargo) refresh nightly. Low-rate-limit small banks refresh every 3 days; on-demand refresh always works, regardless.

The scheduler maintains a per-institution token bucket. Workers acquire a token before each partner call, sleep if the bucket is empty. The bucket replenishes at the institution's rate limit. Centralised in Redis so all worker instances share the same view.

Retry classification

ErrorRetry?How long
RATE_LIMITEDYesExponential backoff respecting Retry-After
INVALID_CREDENTIALSNoMark credentials expired; prompt user to re-auth
MFA_REQUIREDNoPush notification to user; suspend sync until resolved
INSTITUTION_DOWNYesRetry in 1 h, then 6 h, then daily; banner in UI
TIMEOUTYesUp to 3 retries; classify as INSTITUTION_DOWN after
UNKNOWN_5xxYesExponential backoff up to 24 h

6 · Categorisation

Each transaction gets a category (Groceries, Rent, Dining, Transport, …). Three layers:

  • User override. If the user has corrected this merchant before, use that. Saved per (user, merchant) with a normalised merchant string.
  • Global rules. Maintained list mapping merchant strings to categories ("STARBUCKS" → Dining). Curated centrally, updated weekly.
  • ML model. Trained on aggregate categorised data. Runs only on transactions that miss the override and rules. Outputs category + confidence; below threshold, the transaction is left "Uncategorised" until the user labels it.

Privacy: the model is trained on aggregated, anonymised data. Per-user behaviour is never used for cross-user inference. Compliance audits this regularly.

7 · Failure modes & runbook

FailureSymptomMitigation
Partner returns wrong dataBalance jumps to wildly wrong valueSanity check vs prior balance; if delta > threshold, mark account "needs review", don't overwrite. Alert ops.
Partner returns duplicate IDsSame partner_txn_id, different detailsTreat as update (UPSERT); flag the row for review if material details change.
Bank silently breaks API contractSync starts returning empty results for all users on that bankPer-institution health monitor; pages on-call when success rate < 95%. Often requires partner engagement.
Credentials expireSync fails with INVALID_CREDENTIALSEmail + in-app prompt to re-link. Stop trying after 3 failures.
KMS unavailableCannot decrypt credentials; sync haltsKMS is multi-region with auto-failover; this is a P1 incident if it happens.
Scheduler stallsNightly sweep doesn't complete by morningAuto-scale workers; alert if queue depth doesn't drain by SLO time. Stagger institutions to spread load.
Categorisation model driftUsers see wrong categories after a deployCanary the model on 5% of users; compare correction rate vs. control. Auto-rollback on regression.
Data leak risk (sensitive logs)Audit reveals plaintext PAN in error logsLog redaction filter at the SDK level; CI test that scans logs for known regexes. Treat any incident as a P0.

8 · Cost & SLOs

LineEstimateNote
API + scheduler + workers (~500 pods)~$20K/monthStateless; bursty during the sync window
Postgres / Aurora (sharded by user)~$40K/month~5 TB hot; tier older transactions to warehouse
Redis (token buckets + dedupe cache)~$3K/monthSmall; mostly in-memory rate-limit state
Partner API fees (Plaid, Finicity)~$300K/monthPer-user × per-account fees — the dominant line
KMS operations~$5K/monthOne decrypt per sync per account
Data warehouse + categorisation training~$15K/monthSnowflake-shape; ML training on aggregates

SLOs

  • Dashboard load P99: 500 ms. Cached aggregates per user; recomputed on transaction insert.
  • On-demand sync P99: 30 s. User taps Refresh; we expect to be done in 30 s for top-10 banks; longer for niche institutions.
  • Nightly sweep completion: 100% of accounts by 8 AM local. Hard SLO; missing it means stale dashboards at morning peak.
  • Transaction freshness P99: 24 h. Bank-dependent; some banks publish only daily.
  • Data durability: 11 nines. Multi-region Aurora; daily backups; quarterly restore drills.
  • Zero plaintext credential persistence. Verified by quarterly audit.

9 · Trade-offs & "what would you change at 10×"

If…Then…
10× users (300M)Per-institution worker pools become large enough to fill capacity. Partner fees become the dominant strategic conversation — direct integration with the top-20 banks instead of via aggregator.
Add money movement (ACH, Zelle)Whole new compliance surface: BSA/AML, KYC, transaction monitoring, OFAC. The technical changes are small; the org changes are huge. Out-of-scope here.
Real-time push instead of pollingSome banks support webhooks; subscribe where possible. Removes the rate-limit problem but creates a delivery-reliability problem (we have to verify the webhook actually fired).
Open Banking (PSD2-style mandate)The whole architecture gets simpler if regulators mandate a standard API. Most of the per-institution complexity goes away.
Investment-account specificsBrokerages have positions + cost basis + lot tracking + corporate actions (stock splits, mergers). A whole sub-system; usually a separate workstream from cash-flow features.
"What would a more senior answer add?"The risk + observability story: per-institution error budgets, fraud-detection layer (unusual transactions flagged before the user sees them), the data-subject-rights workflow (GDPR-style "delete all my data" with proof), the SOC 2 evidence pipeline that proves the controls work. At this tier the regulator-facing story IS the product.

Further reading

  • Plaid Engineering — public posts on auth, transactions, identity verification. The clearest reference for the aggregator model this design assumes.
  • "Idempotent APIs" — Stripe Engineering. The canonical writeup on the idempotency-key pattern.
  • Truelayer / Tink engineering blogs. European open-banking-flavoured versions of the same design.
  • Adjacent: Distributed scheduler. The scheduling layer in depth.
  • Adjacent: Idempotence. Why retries are safe.
  • Adjacent: OAuth / OIDC. The credentials story.
  • Adjacent: Napkin math. The capacity math.
Next

Netflix — video streaming at scale

Adaptive bitrate, CDN economics, recommendation pipeline. The video-shape of every shape.

Next design
Found this useful?