05 / 08
Cloud Codex / 05

Managed databases.

Once you've decided to put your data on someone else's hardware, the next decision is which shape of database. Managed Postgres covers most needs. DynamoDB-shape covers most of the others. Specialised engines — graph, time series, vector — show up in narrower spots. The boring answer is usually the right one: pick managed Postgres unless you have a clear reason not to.


1 · The shapes

  • Relational (SQL). Tables, rows, joins, ACID transactions. The right default for anything with structured relationships. Postgres or MySQL underneath, almost always.
  • Key-value. Get / put by key. Massive scale, low latency, no joins. Sessions, rate limits, hot lookups.
  • Document. JSON-ish documents indexed by collection. Flexible schema, OK for nested data, poor for joins.
  • Wide-column. Sparse, partitioned tables (think Cassandra-shape). Very high write throughput; querying limited to the partition key plus secondary indexes you defined up front.
  • Graph. Nodes and edges, queries that traverse relationships. Niche: fraud detection, recommendations, knowledge graphs.
  • Time-series. Optimised for append-only time-indexed data. Metrics, IoT telemetry, financial ticks.
  • Vector. Embeddings + nearest-neighbour search. Newer category; pretty much every database now claims to do this.

2 · The AWS canonical version

ShapeAWS serviceNotes
Relational (Postgres / MySQL)RDSManaged engine on EC2 underneath. Patches, backups, Multi-AZ failover. The boring, safe default.
Relational, cloud-nativeAurora (Postgres or MySQL compatible)AWS-rewritten storage engine. 3–5× faster than RDS, more expensive, better failover. Default for new builds at any scale.
Relational, serverlessAurora Serverless v2Scales capacity per second. Good for spiky/dev workloads; not always cheaper than provisioned.
Key-value / documentDynamoDBFully managed, single-digit-ms latency at any scale, pay per request or provisioned. The right pick for the "I need a fast hash table at planet scale" problem.
Document (Mongo API)DocumentDBMongo-compatible, AWS-managed. Use it if you want the Mongo programming model without running Mongo.
Wide-columnKeyspaces (Cassandra API)Cassandra-compatible, serverless. Replaces self-managed Cassandra for the same workloads.
SearchOpenSearchFork of Elasticsearch. Logs, full-text search, dashboards.
CacheElastiCache (Redis / Memcached)The fast in-front-of-DB layer. Redis for everything serious, Memcached for the rare cases you need shared memory only.
Time-seriesTimestreamAppend-only, time-partitioned. Mostly used in IoT pipelines.
GraphNeptuneProperty graph + RDF. Niche.
Vector / embeddingsRDS pgvector, OpenSearch k-NN, Aurora ML, plus standalone (Pinecone / Weaviate)Pick the one your existing DB already supports unless you have a serious vector workload.
AnalyticsRedshift, Athena (serverless on S3)Redshift for warehouse, Athena when "warehouse" is overkill.

3 · GCP and Azure equivalents

ShapeAWSGCPAzure
Managed Postgres / MySQLRDS / AuroraCloud SQL / AlloyDB (Aurora-shape)Azure DB for PostgreSQL / MySQL
Globally consistent SQLAurora Global / Aurora DSQLSpannerCosmos DB (SQL API) with strong
Key-value / doc, single-digit-msDynamoDBFirestore (in Datastore mode) / BigtableCosmos DB
Document (Mongo)DocumentDBFirestore (Native mode) / MongoDB Atlas (third-party)Cosmos DB (Mongo API)
Wide-columnKeyspacesBigtableCosmos DB (Cassandra API)
SearchOpenSearchElasticsearch (3rd party) / Cloud SearchAzure AI Search
CacheElastiCacheMemorystore (Redis / Memcached)Azure Cache for Redis
WarehouseRedshiftBigQuerySynapse Analytics / Fabric
Time-seriesTimestreamBigtable + tooling, or InfluxDB on GCEAzure Data Explorer (ADX)
GraphNeptune(No first-party; use Neo4j on GKE)Cosmos DB (Gremlin API)
Spanner and BigQuery are the GCP standouts. Spanner is the only commercially available globally-linearisable RDBMS — it's what Google's AdWords runs on. BigQuery is the most ergonomic data warehouse on the market by a comfortable margin. Both are reasons to pick GCP for a specific workload even in an AWS-default shop.

4 · How to pick

  1. Does the data have relationships you'll want to query (joins)? Managed Postgres. Almost always Aurora-shape for new builds.
  2. Is the access pattern a key lookup at huge scale with sub-10ms P99? DynamoDB / Firestore / Cosmos. Plan your access patterns up front; you can't add ad-hoc queries later without a redesign.
  3. Do you need ACID transactions across globally-distributed regions? Spanner. Aurora DSQL (AWS's newer entry in the same space). CockroachDB self-managed if multi-cloud.
  4. Is it append-heavy time-indexed data? Timestream, ADX, or Postgres with TimescaleDB extension.
  5. Is it search-shaped (full text, faceting, log analytics)? OpenSearch / Azure AI Search.
  6. Is it a warehouse query (large scans, OLAP)? Redshift / BigQuery / Snowflake. Don't run OLAP on your transactional DB past a certain size.
The decision worth defending. "Pick managed Postgres unless you have a specific reason not to." Postgres is the default at every scale up to billions of rows; it handles JSON, full-text, geospatial, and vector workloads via extensions. The interesting question in a design interview is which specific workload would not fit Postgres, and why.

5 · What breaks

  • RDS storage runs out. Disk fills up over a weekend; instance goes into storage-full state; nobody can write. Mitigation: enable storage auto-scaling. (Aurora is decoupled from storage and doesn't have this problem.)
  • DynamoDB hot partition. If your partition key isn't well-distributed (e.g. all writes go to user_123), you'll see throttling. The fix is a better partition key, not more capacity.
  • Aurora connection limit. Aurora limits connections by instance size. A poorly-tuned connection pool (or no pool, looking at you Lambda) hits the ceiling first. RDS Proxy or pgbouncer in between.
  • DynamoDB scan. The escape hatch for "I forgot to design my access pattern." Cheap in dev, ruinous in production at scale. Real queries hit indexes; scans don't.
  • Aurora reader lag. Read replicas are eventually consistent (single-digit ms typically, but spikes). Read-your-own-writes from a replica is the most-debugged bug in cloud-Postgres setups. Pin recent writes to the primary or use the cluster endpoint.
  • BigQuery / Redshift cost spike. A single bad query can scan terabytes. Mitigations: BigQuery slot reservations, Redshift workload management, query review at code-review time.

6 · Cost note

Database is often the biggest line on the cloud bill after compute. Three things to watch:

  • RDS/Aurora reserved instances. Same 30–60% savings story as compute. Steady-state DB instances should be reserved, full stop.
  • DynamoDB on-demand vs provisioned. On-demand is convenient and 5–7× more expensive per request than well-tuned provisioned. Tables with predictable traffic should be provisioned with auto-scaling.
  • Snapshots, backups, point-in-time recovery. All cost money. PITR especially is a per-GB-month charge that adds up on big DBs. Set retention deliberately, not at "default forever."

Further reading

  • AWS Database Blog. Aurora internals posts are surprisingly detailed.
  • "Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases" (SIGMOD 2017). The paper that explains the redo-log-only architecture.
  • "Dynamo: Amazon's Highly Available Key-value Store" (SOSP 2007). The original Dynamo paper. DynamoDB descends from it, plus a couple of decades of engineering.
  • Adjacent: Databases Codex. Engine internals: B-tree, LSM, MVCC, WAL.
  • Adjacent: CAP / PACELC. The consistency-availability trade-offs each managed DB makes.
  • Adjacent: Consistency patterns. Where each managed DB sits on the five-band spectrum.
Found this useful?