17 / 19

Playbook / 17

Design Mint

Aggregate every user's bank, brokerage, credit card, and loan account into one place. The technical core is unglamorous — scheduled sync jobs against thousands of partner APIs, each with its own auth model, error semantics, and rate limits. The interesting parts: making retries safe when transactions are involved, categorising spend without seeing every user's data, and never losing a transaction.

1 · Clarifying questions

Functional scope?	Connect financial accounts (via Plaid-shape aggregator), pull transactions on a schedule, categorise, show net worth + budgets, alert on unusual activity. Read-only access; no money movement.
Scale?	30M users, ~5 accounts each → 150M connected accounts. Each refreshed nightly + on-demand → ~150M sync jobs/day. ~50 transactions/account/month average → ~250M new transactions/day.
Latency?	Login-to-dashboard P99 ≤ 500 ms (data is already in our DB). On-demand sync ≤ 30 s end-to-end. Nightly sync runs over a 6-hour window.
Consistency?	Per-user strong consistency on balances + transactions. A double-count is much worse than a 1-minute-stale balance.
Failure tolerance?	Partner APIs fail constantly (small banks especially). Per-bank availability matters more than per-user.
Compliance?	SOC 2, PCI-DSS-adjacent (we don't store card numbers but we do store account/routing). State-by-state regulation. Encryption at rest is mandatory; encryption-in-use is the bar at this tier.
Multi-region?	US-only product; multi-region within US for DR.

2 · Capacity math, on a napkin

Number	Calculation	Result
Users	given	30M
Connected accounts	30M × 5	150M
Sync jobs/day	150M nightly + ~10% on-demand	~165M/day
Sync QPS (avg / window)	165M / 6 h window	~7,600 sustained over the night
Sync peak (during work hours)	on-demand bursts	~2K QPS
New transactions/day	150M × 50 / 30	~250M
Transaction size (row)	id + amount + meta + categorisation	~600 B
New storage/day	250M × 600 B × 3 (repl + index)	~450 GB/day → ~165 TB/year
Read QPS dashboards	30M × 5 logins × 10 reads / 86,400	~17K reads/sec average
Partner API rate limits	varies: 10–1000 req/s per bank	biggest design constraint

The dominant constraint isn't our infrastructure — it's the partner APIs. Each bank has its own rate limit; some are 10 req/sec across all our users. The scheduler is built around respecting those limits while still finishing the nightly sweep on time.

3 · API and data model

Public API

POST /v1/accounts/link                  # initiate bank connection
{ "institution_id": "chase", "credential_type": "oauth" }
→ 200 { "link_token": "lt_...", "expires_at": "..." }
# Client uses link_token in partner SDK; we get a callback with credentials_id

POST /v1/accounts/link/complete
{ "link_token": "lt_...", "credentials_id": "cred_..." }
→ 201 { "account_ids": ["a_...", "a_..."], "sync_started": true }

POST /v1/accounts/:id/sync              # on-demand refresh
→ 202 { "job_id": "j_..." }

GET  /v1/accounts/:id                   # balance + last sync
GET  /v1/accounts/:id/transactions      # paginated
  ?since=2026-05-01&limit=100

GET  /v1/categories
PUT  /v1/transactions/:id/category      # user override

Storage

institutions                    -- one row per supported bank
  institution_id  VARCHAR(64) PK
  name            TEXT
  partner         VARCHAR(32)   -- plaid | finicity | direct
  rate_limit_qps  INT
  oauth_config    JSONB
  status          VARCHAR(16)   -- live | degraded | down

credentials                     -- encrypted secrets per user×institution
  credentials_id  UUID PK
  user_id         BIGINT
  institution_id  VARCHAR(64)
  encrypted_token BYTEA         -- envelope-encrypted; KMS-managed key
  encryption_kid  VARCHAR(64)
  status          VARCHAR(16)   -- active | expired | revoked

accounts                        -- one row per connected account
  account_id      UUID PK
  user_id         BIGINT
  credentials_id  UUID
  type            VARCHAR(16)   -- checking | savings | credit | brokerage | loan
  partner_id      TEXT          -- partner's id; UNIQUE with credentials_id
  balance         DECIMAL(18,2)
  available       DECIMAL(18,2)
  currency        CHAR(3)
  last_synced_at  TIMESTAMP
  next_sync_at    TIMESTAMP     -- scheduler index
  status          VARCHAR(16)
  INDEX (next_sync_at, status)  -- drives the scheduler

transactions                    -- partitioned by user_id, then date
  txn_id          UUID PK
  account_id      UUID
  partner_txn_id  TEXT          -- UNIQUE (account_id, partner_txn_id)
                                -- the idempotency key
  amount          DECIMAL(18,2)
  currency        CHAR(3)
  posted_at       DATE          -- partition key suffix
  description     TEXT
  merchant        TEXT
  category_id     VARCHAR(32)
  user_category   VARCHAR(32)   -- user override; takes precedence
  INDEX (account_id, posted_at DESC)

sync_jobs                       -- audit trail
  job_id          UUID PK
  account_id      UUID
  started_at, finished_at TIMESTAMP
  status          VARCHAR(16)   -- queued | running | success | failed
  error_class     VARCHAR(64)
  txns_added      INT

4 · High-level architecture

Scheduler scans accounts where next_sync_at <= now and enqueues sync jobs to a queue partitioned by institution_id. Each institution has a dedicated worker pool sized to its rate limit — overflow waits in queue rather than overwhelming the partner. Workers fetch new transactions, dedupe by partner_txn_id, write to Postgres, then categorise. Credentials are envelope-encrypted with KMS keys; workers decrypt on demand and never log plaintext.

5 · The hard part — idempotent sync against fragile partners

Why idempotency is hard here

A sync run pulls transactions, writes them to our DB, then updates last_synced_at. If the worker crashes after writing some transactions but before updating last_synced_at, the next run will pull the same transactions again. Without idempotency we either lose data (skip-ahead) or double-count it (re-insert).

The idempotency key is the partner's transaction ID — every bank gives one back. The transactions table has a unique index on (account_id, partner_txn_id). Re-inserts become no-ops. Updates use an UPSERT pattern:

INSERT INTO transactions
  (txn_id, account_id, partner_txn_id, amount, posted_at, description, ...)
VALUES (...)
ON CONFLICT (account_id, partner_txn_id)
DO UPDATE SET
  amount = EXCLUDED.amount,
  description = EXCLUDED.description,
  posted_at = EXCLUDED.posted_at,
  updated_at = now()
RETURNING txn_id;

Pending vs posted

Most banks return pending transactions that later turn into posted transactions with a different (or sometimes the same!) partner ID. The system has to recognise that "Pending: STARBUCKS $4.20" and "Posted: STARBUCKS $4.20 2026-05-21" are the same transaction. Heuristics: same account, same amount, same merchant string, posted-date close to pending-date. When a posted transaction matches a pending one, the pending row is deleted and the posted one wins.

Rate-limit-aware scheduling

One bank with 10 req/sec across all our users and 5M of our users on that bank → finishing the nightly sweep takes 5,000,000 / 10 / 86,400 ≈ 5.8 days, which doesn't fit a 6-hour window. The fix: stagger refresh frequency by institution. High-rate-limit banks (Chase, BofA, Wells Fargo) refresh nightly. Low-rate-limit small banks refresh every 3 days; on-demand refresh always works, regardless.

The scheduler maintains a per-institution token bucket. Workers acquire a token before each partner call, sleep if the bucket is empty. The bucket replenishes at the institution's rate limit. Centralised in Redis so all worker instances share the same view.

Retry classification

Error	Retry?	How long
`RATE_LIMITED`	Yes	Exponential backoff respecting `Retry-After`
`INVALID_CREDENTIALS`	No	Mark credentials expired; prompt user to re-auth
`MFA_REQUIRED`	No	Push notification to user; suspend sync until resolved
`INSTITUTION_DOWN`	Yes	Retry in 1 h, then 6 h, then daily; banner in UI
`TIMEOUT`	Yes	Up to 3 retries; classify as INSTITUTION_DOWN after
`UNKNOWN_5xx`	Yes	Exponential backoff up to 24 h

6 · Categorisation

Each transaction gets a category (Groceries, Rent, Dining, Transport, …). Three layers:

User override. If the user has corrected this merchant before, use that. Saved per (user, merchant) with a normalised merchant string.
Global rules. Maintained list mapping merchant strings to categories ("STARBUCKS" → Dining). Curated centrally, updated weekly.
ML model. Trained on aggregate categorised data. Runs only on transactions that miss the override and rules. Outputs category + confidence; below threshold, the transaction is left "Uncategorised" until the user labels it.

Privacy: the model is trained on aggregated, anonymised data. Per-user behaviour is never used for cross-user inference. Compliance audits this regularly.

7 · Failure modes & runbook

Failure	Symptom	Mitigation
Partner returns wrong data	Balance jumps to wildly wrong value	Sanity check vs prior balance; if delta > threshold, mark account "needs review", don't overwrite. Alert ops.
Partner returns duplicate IDs	Same partner_txn_id, different details	Treat as update (UPSERT); flag the row for review if material details change.
Bank silently breaks API contract	Sync starts returning empty results for all users on that bank	Per-institution health monitor; pages on-call when success rate < 95%. Often requires partner engagement.
Credentials expire	Sync fails with INVALID_CREDENTIALS	Email + in-app prompt to re-link. Stop trying after 3 failures.
KMS unavailable	Cannot decrypt credentials; sync halts	KMS is multi-region with auto-failover; this is a P1 incident if it happens.
Scheduler stalls	Nightly sweep doesn't complete by morning	Auto-scale workers; alert if queue depth doesn't drain by SLO time. Stagger institutions to spread load.
Categorisation model drift	Users see wrong categories after a deploy	Canary the model on 5% of users; compare correction rate vs. control. Auto-rollback on regression.
Data leak risk (sensitive logs)	Audit reveals plaintext PAN in error logs	Log redaction filter at the SDK level; CI test that scans logs for known regexes. Treat any incident as a P0.

8 · Cost & SLOs

Line	Estimate	Note
API + scheduler + workers (~500 pods)	~$20K/month	Stateless; bursty during the sync window
Postgres / Aurora (sharded by user)	~$40K/month	~5 TB hot; tier older transactions to warehouse
Redis (token buckets + dedupe cache)	~$3K/month	Small; mostly in-memory rate-limit state
Partner API fees (Plaid, Finicity)	~$300K/month	Per-user × per-account fees — the dominant line
KMS operations	~$5K/month	One decrypt per sync per account
Data warehouse + categorisation training	~$15K/month	Snowflake-shape; ML training on aggregates

SLOs

Dashboard load P99: 500 ms. Cached aggregates per user; recomputed on transaction insert.
On-demand sync P99: 30 s. User taps Refresh; we expect to be done in 30 s for top-10 banks; longer for niche institutions.
Nightly sweep completion: 100% of accounts by 8 AM local. Hard SLO; missing it means stale dashboards at morning peak.
Transaction freshness P99: 24 h. Bank-dependent; some banks publish only daily.
Data durability: 11 nines. Multi-region Aurora; daily backups; quarterly restore drills.
Zero plaintext credential persistence. Verified by quarterly audit.

9 · Trade-offs & "what would you change at 10×"

If…	Then…
10× users (300M)	Per-institution worker pools become large enough to fill capacity. Partner fees become the dominant strategic conversation — direct integration with the top-20 banks instead of via aggregator.
Add money movement (ACH, Zelle)	Whole new compliance surface: BSA/AML, KYC, transaction monitoring, OFAC. The technical changes are small; the org changes are huge. Out-of-scope here.
Real-time push instead of polling	Some banks support webhooks; subscribe where possible. Removes the rate-limit problem but creates a delivery-reliability problem (we have to verify the webhook actually fired).
Open Banking (PSD2-style mandate)	The whole architecture gets simpler if regulators mandate a standard API. Most of the per-institution complexity goes away.
Investment-account specifics	Brokerages have positions + cost basis + lot tracking + corporate actions (stock splits, mergers). A whole sub-system; usually a separate workstream from cash-flow features.
"What would a more senior answer add?"	The risk + observability story: per-institution error budgets, fraud-detection layer (unusual transactions flagged before the user sees them), the data-subject-rights workflow (GDPR-style "delete all my data" with proof), the SOC 2 evidence pipeline that proves the controls work. At this tier the regulator-facing story IS the product.