12 / 14
Internals / 12

Time-series databases

Time-series workloads are the easy case in disguise. The data is append-only, always timestamped in order, mostly numeric, and read in time ranges. A general-purpose database can hold it, but a TSDB uses every one of those properties: time-partitioned chunks, columnar compression of regular value streams, automatic downsampling, and retention policies that drop old data instead of vacuuming around it.


What makes time-series different

In a transactional workload, writes are scattered across the keyspace and you have to keep B-trees balanced. In a time-series workload, writes arrive at the head of a timeline that advances by one second every second. The newest data is almost always the only writable region, and rows for the same series (cpu_idle for host A, say) arrive thousands of times per day with timestamps that keep increasing.

That regularity is what TSDBs use. They partition by time so the active window is small and hot, compress consecutive values together because adjacent samples differ by very little, and avoid updates entirely. There is no MVCC, no vacuum chain, no row versioning. A row, once written, is immutable until the retention policy drops the chunk that holds it. Queries are almost always "give me this metric between t1 and t2", so any structure that lets you skip whole files by their time range pays for itself.

Chunks, partitions, shards

Every serious TSDB partitions storage by time first, by series second. A chunk is a contiguous time window (an hour, a day, a week) that holds samples from many series. Prometheus calls them blocks (2-hour windows by default, compacted up). TimescaleDB calls them hypertable chunks. InfluxDB calls them shard groups. Druid calls them segments.

The benefits compound. A query for the last hour touches one block. A query for "last week, at 5-minute resolution" touches seven daily blocks and skips everything older. Retention drops a whole chunk at a time, a single file unlink instead of a billion row deletes. Downsampling re-encodes one chunk into a coarser-resolution one and discards the original. And because chunks are immutable, they can be checksummed, cached, and replicated as opaque blobs.

The non-obvious rule. Pick a chunk size where one chunk fits comfortably in memory during a write burst. Too small and the chunk-count overhead (one file per chunk per series, in some designs) crushes the filesystem. Too large and a single broken chunk takes down too much data at once. Prometheus's 2-hour default is a deliberate compromise; TimescaleDB defaults to a chunk per week and lets you tune.

Columnar storage and Gorilla compression

Inside a chunk, a TSDB stores each series as two columns, timestamps and values, rather than as rows. Adjacent timestamps differ by a fixed scrape interval (10 s, 30 s, 1 m). The delta-of-delta is almost always zero. Adjacent floating-point values for the same metric differ by very little, so XOR-ing consecutive IEEE 754 doubles produces long runs of zero bits that compress easily.

Facebook's Gorilla paper (VLDB 2015) showed this could push 16 bytes per sample down to about 1.37 bytes. Prometheus implements a close variant. InfluxDB's TSI/TSM format does the same. The headline result: a year of one-second samples for a single metric, 31.5 million points, compresses to roughly 40 MB instead of the 500 MB a row store would need. At fleet scale (a million series, a year) that is the difference between a single host of storage and a small rack.

raw:    1701390000  42.13
        1701390010  42.18
        1701390020  42.21
        1701390030  42.19

gorilla:
  timestamps:  base=1701390000  Δ=10  ΔΔ=0,0,0       → 1 bit each
  values:      base=42.13       XOR streams, 0 0 0 1 → ~4 bits each

Indexing by labels, not rows

A row in a TSDB is identified by a metric name plus a set of label key/value pairs: http_requests_total{method="GET",status="200",host="api-7"}. The cartesian product of those labels is the cardinality of the metric, and it dominates everything else about TSDB performance. A metric with a thousand hosts × ten status codes × four methods is 40 000 series. Add a label like user_id and you might add a million series. This is where most TSDB outages start.

The index is an inverted map from each label/value pair to the set of series IDs that carry it. Asking for {job="api", status="500"} intersects two posting lists. Prometheus, VictoriaMetrics, and Mimir all use roaring-bitmap-style structures here. InfluxDB shipped TSI as a successor to its earlier in-memory index for the same reason. The one rule everyone learns the hard way: never put unbounded values, a request ID, a user ID, a free-text path, into a label.

Downsampling and retention

Old data is less interesting at full resolution. A dashboard panel showing the last year doesn't render a billion points; it averages them into a few hundred. So instead of paying storage for the raw samples forever, most TSDBs run a background process that resamples them. A common policy: keep raw 1-second data for two weeks, 1-minute aggregates for three months, 1-hour aggregates for two years, daily aggregates forever.

Different products spell it differently. InfluxDB uses continuous queries plus retention policies. TimescaleDB has continuous aggregates and compression policies. Prometheus defers downsampling to long-term-storage layers like Cortex/Mimir/Thanos. Druid does roll-up at ingestion time. The shape is the same: each tier holds the same window of wall-clock time at coarser resolution, so total storage stays bounded even as the system runs forever.

Why this matters. Without downsampling, your TSDB grows linearly forever and alerting queries that span a year scan everything. With it, every query bounded by resolution stays cheap regardless of how long the system has run. This is the closest thing TSDBs have to indexing.

The product landscape

There are roughly four shapes of system in this space, and most of the named products fit one of them.

SystemShapeWhat it's good at
Prometheus Pull, single-node, 2-hour blocks Operational metrics. Excellent for short retention, alerting, exporters. Not built for years of data or high-cardinality user data.
InfluxDB (v2/v3) Push, TSM storage engine, IOx in v3 Application metrics, IoT. Good UX, custom Flux/SQL query language. The v3 rewrite onto Arrow + Parquet realigns it with the data-lake stack.
TimescaleDB Postgres extension with hypertables You need full SQL, joins to relational data, ACID. Pays a B-tree tax on writes but gets every Postgres feature for free.
Druid OLAP segments with roll-up at ingest Real-time analytics over event streams. High-fan-out dashboards. Less suited to per-second operational metrics.
ClickHouse Columnar OLAP with MergeTree Used as a TSDB by Uber, Cloudflare, Discord. Not strictly a TSDB but the columnar + time-ordered partitioning fits the shape and the query speed is extreme.
VictoriaMetrics, Mimir, Thanos Long-term Prometheus stores Sit behind Prometheus to take blocks off-node and provide global query, longer retention, and downsampling.

Where the model breaks

TSDBs assume time moves forward, samples are small and regular, and the cardinality of label combinations stays bounded. When those assumptions break, the system falls over in predictable ways. Backfilling a week-old log file is slow because the chunk it lands in is already compacted. High-cardinality labels (user IDs, request IDs) explode the index until the host runs out of memory; Prometheus's "sample limit" and "target out of order" errors trace back here. Joins across unrelated metrics are expensive because the index isn't built for them. A relational store is usually the better tool when the data is naturally relational.

The usual escape hatch is to treat the TSDB as a fast layer over recent data and push historical, high-cardinality, or join-heavy work to a data warehouse (ClickHouse, BigQuery, Snowflake) that doesn't care about the time-series specialisations.

Further reading

Next deep dive 11 — Choosing a database Continue
Found this useful?