Cloud Codex / 05
Managed databases.
Once you've decided to put your data on someone else's hardware, the next decision is which shape of database. Managed Postgres covers most needs. DynamoDB-shape covers most of the others. Specialised engines — graph, time series, vector — show up in narrower spots. The boring answer is usually the right one: pick managed Postgres unless you have a clear reason not to.
1 · The shapes
- Relational (SQL). Tables, rows, joins, ACID transactions. The right default for anything with structured relationships. Postgres or MySQL underneath, almost always.
- Key-value. Get / put by key. Massive scale, low latency, no joins. Sessions, rate limits, hot lookups.
- Document. JSON-ish documents indexed by collection. Flexible schema, OK for nested data, poor for joins.
- Wide-column. Sparse, partitioned tables (think Cassandra-shape). Very high write throughput; querying limited to the partition key plus secondary indexes you defined up front.
- Graph. Nodes and edges, queries that traverse relationships. Niche: fraud detection, recommendations, knowledge graphs.
- Time-series. Optimised for append-only time-indexed data. Metrics, IoT telemetry, financial ticks.
- Vector. Embeddings + nearest-neighbour search. Newer category; pretty much every database now claims to do this.
2 · The AWS canonical version
| Shape | AWS service | Notes |
|---|---|---|
| Relational (Postgres / MySQL) | RDS | Managed engine on EC2 underneath. Patches, backups, Multi-AZ failover. The boring, safe default. |
| Relational, cloud-native | Aurora (Postgres or MySQL compatible) | AWS-rewritten storage engine. 3–5× faster than RDS, more expensive, better failover. Default for new builds at any scale. |
| Relational, serverless | Aurora Serverless v2 | Scales capacity per second. Good for spiky/dev workloads; not always cheaper than provisioned. |
| Key-value / document | DynamoDB | Fully managed, single-digit-ms latency at any scale, pay per request or provisioned. The right pick for the "I need a fast hash table at planet scale" problem. |
| Document (Mongo API) | DocumentDB | Mongo-compatible, AWS-managed. Use it if you want the Mongo programming model without running Mongo. |
| Wide-column | Keyspaces (Cassandra API) | Cassandra-compatible, serverless. Replaces self-managed Cassandra for the same workloads. |
| Search | OpenSearch | Fork of Elasticsearch. Logs, full-text search, dashboards. |
| Cache | ElastiCache (Redis / Memcached) | The fast in-front-of-DB layer. Redis for everything serious, Memcached for the rare cases you need shared memory only. |
| Time-series | Timestream | Append-only, time-partitioned. Mostly used in IoT pipelines. |
| Graph | Neptune | Property graph + RDF. Niche. |
| Vector / embeddings | RDS pgvector, OpenSearch k-NN, Aurora ML, plus standalone (Pinecone / Weaviate) | Pick the one your existing DB already supports unless you have a serious vector workload. |
| Analytics | Redshift, Athena (serverless on S3) | Redshift for warehouse, Athena when "warehouse" is overkill. |
3 · GCP and Azure equivalents
| Shape | AWS | GCP | Azure |
|---|---|---|---|
| Managed Postgres / MySQL | RDS / Aurora | Cloud SQL / AlloyDB (Aurora-shape) | Azure DB for PostgreSQL / MySQL |
| Globally consistent SQL | Aurora Global / Aurora DSQL | Spanner | Cosmos DB (SQL API) with strong |
| Key-value / doc, single-digit-ms | DynamoDB | Firestore (in Datastore mode) / Bigtable | Cosmos DB |
| Document (Mongo) | DocumentDB | Firestore (Native mode) / MongoDB Atlas (third-party) | Cosmos DB (Mongo API) |
| Wide-column | Keyspaces | Bigtable | Cosmos DB (Cassandra API) |
| Search | OpenSearch | Elasticsearch (3rd party) / Cloud Search | Azure AI Search |
| Cache | ElastiCache | Memorystore (Redis / Memcached) | Azure Cache for Redis |
| Warehouse | Redshift | BigQuery | Synapse Analytics / Fabric |
| Time-series | Timestream | Bigtable + tooling, or InfluxDB on GCE | Azure Data Explorer (ADX) |
| Graph | Neptune | (No first-party; use Neo4j on GKE) | Cosmos DB (Gremlin API) |
Spanner and BigQuery are the GCP standouts. Spanner is the only commercially available globally-linearisable RDBMS — it's what Google's AdWords runs on. BigQuery is the most ergonomic data warehouse on the market by a comfortable margin. Both are reasons to pick GCP for a specific workload even in an AWS-default shop.
4 · How to pick
- Does the data have relationships you'll want to query (joins)? Managed Postgres. Almost always Aurora-shape for new builds.
- Is the access pattern a key lookup at huge scale with sub-10ms P99? DynamoDB / Firestore / Cosmos. Plan your access patterns up front; you can't add ad-hoc queries later without a redesign.
- Do you need ACID transactions across globally-distributed regions? Spanner. Aurora DSQL (AWS's newer entry in the same space). CockroachDB self-managed if multi-cloud.
- Is it append-heavy time-indexed data? Timestream, ADX, or Postgres with TimescaleDB extension.
- Is it search-shaped (full text, faceting, log analytics)? OpenSearch / Azure AI Search.
- Is it a warehouse query (large scans, OLAP)? Redshift / BigQuery / Snowflake. Don't run OLAP on your transactional DB past a certain size.
The decision worth defending. "Pick managed Postgres unless you have a specific reason not to." Postgres is the default at every scale up to billions of rows; it handles JSON, full-text, geospatial, and vector workloads via extensions. The interesting question in a design interview is which specific workload would not fit Postgres, and why.
5 · What breaks
- RDS storage runs out. Disk fills up over a weekend; instance goes into storage-full state; nobody can write. Mitigation: enable storage auto-scaling. (Aurora is decoupled from storage and doesn't have this problem.)
- DynamoDB hot partition. If your partition key isn't well-distributed (e.g. all writes go to
user_123), you'll see throttling. The fix is a better partition key, not more capacity. - Aurora connection limit. Aurora limits connections by instance size. A poorly-tuned connection pool (or no pool, looking at you Lambda) hits the ceiling first. RDS Proxy or pgbouncer in between.
- DynamoDB scan. The escape hatch for "I forgot to design my access pattern." Cheap in dev, ruinous in production at scale. Real queries hit indexes; scans don't.
- Aurora reader lag. Read replicas are eventually consistent (single-digit ms typically, but spikes). Read-your-own-writes from a replica is the most-debugged bug in cloud-Postgres setups. Pin recent writes to the primary or use the cluster endpoint.
- BigQuery / Redshift cost spike. A single bad query can scan terabytes. Mitigations: BigQuery slot reservations, Redshift workload management, query review at code-review time.
6 · Cost note
Database is often the biggest line on the cloud bill after compute. Three things to watch:
- RDS/Aurora reserved instances. Same 30–60% savings story as compute. Steady-state DB instances should be reserved, full stop.
- DynamoDB on-demand vs provisioned. On-demand is convenient and 5–7× more expensive per request than well-tuned provisioned. Tables with predictable traffic should be provisioned with auto-scaling.
- Snapshots, backups, point-in-time recovery. All cost money. PITR especially is a per-GB-month charge that adds up on big DBs. Set retention deliberately, not at "default forever."
Further reading
- AWS Database Blog. Aurora internals posts are surprisingly detailed.
- "Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases" (SIGMOD 2017). The paper that explains the redo-log-only architecture.
- "Dynamo: Amazon's Highly Available Key-value Store" (SOSP 2007). The original Dynamo paper. DynamoDB descends from it, plus a couple of decades of engineering.
- Adjacent: Databases Codex. Engine internals: B-tree, LSM, MVCC, WAL.
- Adjacent: CAP / PACELC. The consistency-availability trade-offs each managed DB makes.
- Adjacent: Consistency patterns. Where each managed DB sits on the five-band spectrum.
Found this useful?