Time-series databases
Time-series workloads are the easy case in disguise. The data is append-only, always timestamped in order, mostly numeric, and read in time ranges. A general-purpose database can hold it, but a TSDB uses every one of those properties: time-partitioned chunks, columnar compression of regular value streams, automatic downsampling, and retention policies that drop old data instead of vacuuming around it.
What makes time-series different
In a transactional workload, writes are scattered across the keyspace and you have to keep
B-trees balanced. In a time-series workload, writes arrive at the head of a timeline that
advances by one second every second. The newest data is almost always the only writable
region, and rows for the same series (cpu_idle for host A, say) arrive
thousands of times per day with timestamps that keep increasing.
That regularity is what TSDBs use. They partition by time so the active window is small and hot, compress consecutive values together because adjacent samples differ by very little, and avoid updates entirely. There is no MVCC, no vacuum chain, no row versioning. A row, once written, is immutable until the retention policy drops the chunk that holds it. Queries are almost always "give me this metric between t1 and t2", so any structure that lets you skip whole files by their time range pays for itself.
Chunks, partitions, shards
Every serious TSDB partitions storage by time first, by series second. A chunk is a contiguous time window (an hour, a day, a week) that holds samples from many series. Prometheus calls them blocks (2-hour windows by default, compacted up). TimescaleDB calls them hypertable chunks. InfluxDB calls them shard groups. Druid calls them segments.
The benefits compound. A query for the last hour touches one block. A query for "last week, at 5-minute resolution" touches seven daily blocks and skips everything older. Retention drops a whole chunk at a time, a single file unlink instead of a billion row deletes. Downsampling re-encodes one chunk into a coarser-resolution one and discards the original. And because chunks are immutable, they can be checksummed, cached, and replicated as opaque blobs.
Columnar storage and Gorilla compression
Inside a chunk, a TSDB stores each series as two columns, timestamps and values, rather than as rows. Adjacent timestamps differ by a fixed scrape interval (10 s, 30 s, 1 m). The delta-of-delta is almost always zero. Adjacent floating-point values for the same metric differ by very little, so XOR-ing consecutive IEEE 754 doubles produces long runs of zero bits that compress easily.
Facebook's Gorilla paper (VLDB 2015) showed this could push 16 bytes per sample down to about 1.37 bytes. Prometheus implements a close variant. InfluxDB's TSI/TSM format does the same. The headline result: a year of one-second samples for a single metric, 31.5 million points, compresses to roughly 40 MB instead of the 500 MB a row store would need. At fleet scale (a million series, a year) that is the difference between a single host of storage and a small rack.
raw: 1701390000 42.13
1701390010 42.18
1701390020 42.21
1701390030 42.19
gorilla:
timestamps: base=1701390000 Δ=10 ΔΔ=0,0,0 → 1 bit each
values: base=42.13 XOR streams, 0 0 0 1 → ~4 bits eachIndexing by labels, not rows
A row in a TSDB is identified by a metric name plus a set of label key/value pairs:
http_requests_total{method="GET",status="200",host="api-7"}. The cartesian
product of those labels is the cardinality of the metric, and it dominates everything else
about TSDB performance. A metric with a thousand hosts × ten status codes × four methods is
40 000 series. Add a label like user_id and you might add a million series.
This is where most TSDB outages start.
The index is an inverted map from each label/value pair to the set of series IDs that carry
it. Asking for {job="api", status="500"} intersects two posting lists.
Prometheus, VictoriaMetrics, and Mimir all use roaring-bitmap-style structures here.
InfluxDB shipped TSI as a successor to its earlier in-memory index for the same reason. The
one rule everyone learns the hard way: never put unbounded values, a request ID, a
user ID, a free-text path, into a label.
Downsampling and retention
Old data is less interesting at full resolution. A dashboard panel showing the last year doesn't render a billion points; it averages them into a few hundred. So instead of paying storage for the raw samples forever, most TSDBs run a background process that resamples them. A common policy: keep raw 1-second data for two weeks, 1-minute aggregates for three months, 1-hour aggregates for two years, daily aggregates forever.
Different products spell it differently. InfluxDB uses continuous queries plus retention policies. TimescaleDB has continuous aggregates and compression policies. Prometheus defers downsampling to long-term-storage layers like Cortex/Mimir/Thanos. Druid does roll-up at ingestion time. The shape is the same: each tier holds the same window of wall-clock time at coarser resolution, so total storage stays bounded even as the system runs forever.
The product landscape
There are roughly four shapes of system in this space, and most of the named products fit one of them.
| System | Shape | What it's good at |
|---|---|---|
| Prometheus | Pull, single-node, 2-hour blocks | Operational metrics. Excellent for short retention, alerting, exporters. Not built for years of data or high-cardinality user data. |
| InfluxDB (v2/v3) | Push, TSM storage engine, IOx in v3 | Application metrics, IoT. Good UX, custom Flux/SQL query language. The v3 rewrite onto Arrow + Parquet realigns it with the data-lake stack. |
| TimescaleDB | Postgres extension with hypertables | You need full SQL, joins to relational data, ACID. Pays a B-tree tax on writes but gets every Postgres feature for free. |
| Druid | OLAP segments with roll-up at ingest | Real-time analytics over event streams. High-fan-out dashboards. Less suited to per-second operational metrics. |
| ClickHouse | Columnar OLAP with MergeTree | Used as a TSDB by Uber, Cloudflare, Discord. Not strictly a TSDB but the columnar + time-ordered partitioning fits the shape and the query speed is extreme. |
| VictoriaMetrics, Mimir, Thanos | Long-term Prometheus stores | Sit behind Prometheus to take blocks off-node and provide global query, longer retention, and downsampling. |
Where the model breaks
TSDBs assume time moves forward, samples are small and regular, and the cardinality of label combinations stays bounded. When those assumptions break, the system falls over in predictable ways. Backfilling a week-old log file is slow because the chunk it lands in is already compacted. High-cardinality labels (user IDs, request IDs) explode the index until the host runs out of memory; Prometheus's "sample limit" and "target out of order" errors trace back here. Joins across unrelated metrics are expensive because the index isn't built for them. A relational store is usually the better tool when the data is naturally relational.
The usual escape hatch is to treat the TSDB as a fast layer over recent data and push historical, high-cardinality, or join-heavy work to a data warehouse (ClickHouse, BigQuery, Snowflake) that doesn't care about the time-series specialisations.
Further reading
- Pelkonen et al. — Gorilla: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015)
- Prometheus — local storage docs
- Ganesh Vernekar — Prometheus TSDB internals (series)
- TimescaleDB — hypertables and chunks
- InfluxData — InfluxDB 3.0 system architecture
- Apache Druid — design overview
- ClickHouse — MergeTree engine