Handbook · Vol. IV · 2026 Track I · The data layer · piece 3 of 4 Primer

Track I · The data layer

NoSQL databases.

Key-value, wide-column, document, graph — the four shapes, which workloads each one fits, and the rule that decides every choice: where you want the schema enforced.

Track I · The data layer
How data lives, scales, and recovers.
  1. Primer
    Database indexing
  2. Deep dive
    Database scaling
  3. Primer
    NoSQL databases
  4. Decision rule
    When to shard

"NoSQL" is not one thing. It's five distinct data models, each with a different sweet spot, and choosing the wrong one is one of the most expensive mistakes a system can make.

Relational databases (PostgreSQL, MySQL) are remarkable general-purpose tools — they handle most workloads well. But not every workload looks like rows-and-relationships. When the data model, the access pattern, or the scale ceiling no longer fits, NoSQL stores enter as specialised alternatives. This module covers the five families of NoSQL stores, the trade-offs they accept in exchange for their gains, and the decision rules that map workloads to stores.

FIVE DATA MODELS · DIFFERENT SHAPES OF DATA key-value k → v "u:42"→ {…} "sess:abc"→ {…} "cart:99"→ […] Redis · DynamoDB document JSON / BSON { "_id": 42, "name": "Ada", "tags": [...], "addr": {...} } MongoDB · Couchbase wide-column row → many cols u42 | name=Ada | login=... | post:1=... | post:2=... Cassandra · Bigtable graph nodes · edges Neo4j · Neptune time-series t, m, v 12:00:00 cpu 0.12 12:00:01 cpu 0.18 12:00:02 cpu 0.21 12:00:03 cpu 0.19 InfluxDB · Timescale
Five data shapes, five engines optimised for them. The shape comes from the access pattern, not the data — pick by the queries you'll run, not the rows you'll store.

Why NoSQL — and when not

The word "NoSQL" was coined as a marketing label and stuck despite being misleading. Most NoSQL stores eventually added a SQL-like query language; the meaningful split is not "SQL vs NoSQL" but "relational vs specialised." Three reasons drive the move to a specialised store:

Schema flexibility
Rigid relational schemas force migrations for every change. Document stores let each row carry its own shape: useful for catalog data, user-generated content, or anything heterogeneous by nature.
Horizontal scale
Relational databases scale to one big box very well; beyond that, sharding is manual and painful. Several NoSQL stores (Cassandra, DynamoDB, MongoDB sharded clusters) scale linearly across nodes by design.
Specialised access
Time-series, full-text, graph traversals, geospatial — these query shapes are slow on a relational engine and fast on a specialised one. Sometimes the price of using the wrong tool is 100×.

The honest counter: most apps that "need" NoSQL actually need an indexed Postgres. Rules of thumb that say "use Mongo for JSON" and "use Cassandra at scale" lead to bills and migration projects later. Default to Postgres. Reach for NoSQL when you have a specific shape of data or query that Postgres does badly.

The five families

Key-value

The simplest possible model: a flat namespace of keys mapped to opaque values. Sub-millisecond GET/SET. Used as caches (Redis), session stores (Redis, DynamoDB), feature flags. Doesn't scale to "find all keys matching a pattern" without help.

Document

Each row is a JSON document. Indexes can be defined on any field; queries look like MongoDB filters or SQL. Best when the schema varies row to row, when a single document holds everything you need (no joins), and when nested fields matter.

Wide-column

A two-dimensional map: row key → column → value. Each row can have millions of columns, and you query by row key + column range. Built for time-series-shaped writes and predictable reads. Cassandra, Bigtable, ScyllaDB.

Graph

Nodes and edges as the primary primitive. Optimised for "friend of friend," shortest path, fraud-ring detection. Slow at heavy aggregation; fast at relationships you would dread expressing in SQL.

Time-series

Append-mostly, timestamp-keyed, retention-based. Specialised compression (delta + run-length) makes 10× the storage efficiency of a generic store. InfluxDB, TimescaleDB, Prometheus, Bigtable for monitoring.

Search index

Inverted index over text and structured fields. Full-text search, faceted filters, relevance ranking. Elasticsearch and OpenSearch. Almost always a secondary store fed by CDC, never the source of truth.

The big trade-off table

Postgres / MySQLDocument (Mongo)Wide-column (Cassandra)Key-value (DynamoDB)Graph (Neo4j)
JoinsNativeLimitedNoneNoneNative
TransactionsFull ACIDSingle-doc / multi-doc since 4.0Single-row onlySingle-item / TransactWriteFull ACID
Secondary indexExcellentGoodLimited (poor under load)GSI/LSI, eventually consistentTunable
Horizontal scaleVertical first; sharding manualSharded clustersLinear, by designLinear, by designLimited
ConsistencyStrongTunable per writeTunable (QUORUM, ONE, ALL)Tunable per readStrong
Best atAnything OLTP, any reportingJSON-shaped, schema variationHigh-write, time-seriesSub-ms point lookups, high QPSRelationship traversal
Bad at10M+ writes/sec, deeply nested JSONCross-doc joins, ad-hoc analyticsAnything ad-hoc; needs rigid query shapeRange scans, complex filtersBulk aggregation

Modelling for the access pattern, not the data

The single biggest mental shift when moving from relational to NoSQL: you do not model the data; you model the queries. In Postgres you normalise into 3NF, then write a JOIN. In Cassandra or DynamoDB you write the queries first, then design the tables to answer those queries directly — even if it means duplicating data across multiple tables, each one shaped for one access pattern.

For example, a chat app in Cassandra has at minimum two tables: messages_by_room partitioned by room id, ordered by time, for the message-list view; and messages_by_user partitioned by user id for "all my messages across rooms." The same message appears in both tables. Storage doubles. Reads become single-partition. This is not a workaround — it's the design pattern.

The hard cases

Eventual consistency surprises. A write returns success, the user reads, sees old data. In MongoDB with default read preference, this happens for ~100ms. In Cassandra QUORUM-write QUORUM-read, this is bounded but real. Either ack writes against the read path or document the lag explicitly to consumers.
Hot partitions. A poorly-chosen partition key concentrates traffic on one node. In DynamoDB this becomes "ProvisionedThroughputExceededException." In Cassandra it becomes one node with 90% CPU. Mitigate with composite keys, write sharding (append a random suffix), or in DynamoDB-on-demand mode let the platform spread it.
Schema drift in document stores. Mongo will accept any shape for any document. Five years later, you have 17 schema variants in one collection and reads have to handle them all. Mitigate with schema validation rules, application-level versioning, and periodic backfill jobs. The flexibility cuts both ways.

Practical defaults

  1. Default is Postgres. JSONB columns handle 90% of "I need flexible schema" requirements with full SQL on top.
  2. Reach for Redis when latency or QPS dominates the requirement and the data fits in RAM.
  3. Reach for Cassandra/DynamoDB when write throughput exceeds what a single sharded RDBMS can sustain (typically >100k writes/sec).
  4. Reach for Elasticsearch when search is part of the product, not an afterthought. Feed it via CDC; never make it the system of record.
  5. Reach for a graph DB when the queries are explicitly relationship-shaped — second-degree connections, shortest path, ring detection.
  6. Reach for a time-series DB when you have ingestion rate > 50k points/sec and retention windows in months.
  7. Polyglot persistence is fine. Most non-trivial systems end up with 2-3 stores. The cost is operational complexity; pay it deliberately.
Found this useful?