13 / 15
Internals / 13

SSDs and the FTL

An SSD looks like a fast block device to the OS. Inside, it's a far stranger machine: NAND flash that can be read in pages, written once per page until the whole block is erased, and worn out after a few thousand erase cycles. The Flash Translation Layer is the firmware that hides all of this — it remaps writes to fresh blocks, garbage-collects stale ones, levels wear across the chip, and presents the illusion of a normal block device. Understanding the FTL is the difference between a drive that lasts a decade and one that bricks itself in months.


NAND cells store charge

A NAND flash cell is a transistor with a floating gate — a pocket of silicon isolated from the rest of the circuit by oxide. To program a cell, you tunnel electrons through the oxide onto the floating gate; to erase, you tunnel them back off. The amount of charge on the gate determines a threshold voltage, which is what the read circuit measures.

Cell typeBits/cellStatesEnduranceCost / GBUsed in
SLC12~100,000 P/E~$3/GBHigh-end industrial; obsolete for consumer
MLC24~10,000 P/E~$0.30/GBOlder enterprise; phasing out
TLC38~3,000 P/E~$0.07/GBMainstream consumer + enterprise
QLC416~1,000 P/E~$0.05/GBRead-heavy bulk storage; cold tiers
PLC532~150 P/E<$0.04/GBExperimental; archive-class workloads

Each step from SLC → MLC → TLC → QLC packs more bits per cell by squeezing more distinct voltage levels into the same physical range. The cost: narrower windows mean less margin against drift, more error correction overhead, and fewer program/erase cycles before the oxide degrades. Modern consumer drives use TLC with a fast SLC-mode cache region for burst writes.

The asymmetry that makes everything weird

Three different operations, three different sizes, three different speeds:

OperationGranularityTime
Page read4–16 KB~50–100 µs
Page program (write)4–16 KB~500 µs–1 ms
Block erase256 KB–4 MB~3–10 ms

Two things follow from this. First, you can't overwrite — every write to a previously-programmed page requires erasing its whole block first. Second, the erase block is tens of times larger than the page, so erasing one block invalidates many pages. The FTL's whole job is to defer and amortize this.

The Flash Translation Layer

The FTL maintains a mapping table from logical block address (what the OS sees) to physical page. When the OS writes LBA 0 twice, the FTL writes the new data to a fresh physical page and updates the mapping; the old page is marked stale. No erase is needed at write time.

Eventually free pages run out and the FTL has to reclaim space. It picks a block with many stale pages, copies the still-valid pages elsewhere (called garbage collection), and erases the now-empty block. The erased block re-enters the free pool. This is where write amplification comes from: every garbage-collected valid page is a write the user didn't ask for.

User writes 1 GB sequentially:
  WAF ≈ 1.0 — fresh blocks fill in order, no GC needed

User writes 1 GB randomly across the disk:
  WAF ≈ 2–4 — every block has many invalid pages; GC has to copy survivors

User fills disk to 95%, then writes randomly:
  WAF ≈ 5–20 — almost no spare capacity; GC has to compact constantly

Drive endurance scales as 1/WAF. A 1 PB drive at WAF=1 → 1 PB lifetime;
                              at WAF=10 → 100 TB.

How write amplification eats endurance

Wear leveling

Each NAND block can only be erased a few thousand times before the oxide degrades and bits start sticking. Without wear leveling, hot blocks (those holding frequently-rewritten data) would die long before cold blocks. The FTL tracks erase counts and rotates writes across the entire physical address space so wear is uniform.

Two flavours:

  • Dynamic. Apply only when writing — pick the freshest free block. Cheap to implement, doesn't help with cold-but-stuck data (read-only files that never need rewriting).
  • Static. Periodically copy cold data to high-erase-count blocks, freeing the low-erase-count ones for hot data. More expensive in WAF but extends drive life significantly. Used by all enterprise SSDs.

NVMe — many queues, deep queues

Modern NVMe drives support 64+ submission/completion queue pairs, each capable of holding 1024+ outstanding requests. The CPU enqueues a request via a memory-mapped doorbell; the drive DMAs the data; an MSI-X interrupt signals completion. No host CPU does any data copying. Throughput is bounded by queue depth × per-request latency:

32
80 µs
IOPS
400,000
throughput (4 KB blocks)
1.5 GB/s
A real Gen5 NVMe drive specced at "1.5 Mpps random read" achieves that with queue depth ~128 and ~85 µs per request. Drop the queue to depth 1 (a single outstanding request) and IOPS collapses to ~12,000 — a 100× drop. This is why database workloads tune their IO concurrency carefully.

Over-provisioning

Every SSD reserves ~7–28% of its raw NAND capacity as over-provisioning — space the OS never sees, used by the FTL for garbage collection and bad-block replacement. Consumer drives keep ~7% (so a "1 TB" drive has ~1075 GB of NAND); enterprise drives keep 25%+ for sustained-write performance.

More over-provisioning means lower WAF means longer life and more consistent latency. Production datacenter drives are often under-formatted — using a 4 TB drive as 3 TB to give the FTL more elbow room. The latency tail under heavy write workloads improves dramatically.

TRIM and the OS contract

When the OS deletes a file, the SSD doesn't know — the LBAs simply aren't referenced anymore, but from the FTL's view they still hold valid data. Without help, garbage collection treats them as live and copies them around forever. The TRIM command (or its NVMe equivalent, DEALLOCATE) tells the SSD which LBAs are free.

Modern Linux mounts ext4/xfs/btrfs with discard options; macOS does it automatically; Windows runs a TRIM job weekly. Without TRIM, an SSD's sustained write performance degrades as it appears full from the FTL's view, even when the OS has logically freed most of the disk.

Power loss and the supercap

Enterprise SSDs include a small bank of capacitors — a "supercap" — that holds enough charge to flush all in-flight writes from DRAM cache to NAND on power loss. This makes fsync honest: a write that returned successfully really is durable, even if the host crashes immediately after.

Consumer SSDs typically don't have a supercap. A write returns success when it lands in the SSD's DRAM cache; if power fails before that data reaches NAND, it's lost. Most filesystems and databases tolerate this because they assume the ACID guarantees come from the application layer (write-ahead log, etc.) — but a server-class workload running on a consumer SSD can lose data on a power event.

Why this matters: when a database or filesystem benchmark says a Samsung 990 Pro does 700,000 fsync/sec, it's lying about the durability part. An enterprise drive with proper power-loss protection sustains ~10,000–30,000 real fsyncs/sec. The 30× gap is the cost of doing it correctly.

Why TLC won the market — and what QLC concedes

The cell-types table above looks like a menu of equal trade-offs. In practice the industry has converged hard on TLC for almost everything, with QLC creeping into the capacity tier. The reason is non-linear: every additional bit per cell roughly cuts endurance by an order of magnitude, halves the read margin, and adds latency (because the read circuit has to discriminate more voltage levels with the same noise floor).

Why most drives are TLC. Sweet spot. Endurance is good enough for a 5-year consumer warranty even at 0.5 DWPD, density doubles versus MLC, and the cost per bit is acceptable. QLC sits one rung below — fine for cold reads, marginal for sustained writes. Many "QLC" drives actually run a small portion as pseudo-SLC cache to mask the underlying latency, then migrate cold data to QLC blocks in the background. That's why a "QLC" drive can advertise SLC-like peak write speeds for the first ~50 GB and then collapse to 100 MB/s sustained.

Garbage collection — the algorithm

NAND can program a page (write) but only erase a block (a group of 256-512 pages). To rewrite a page in place, you'd have to erase the whole block, costing ms of latency. So SSDs never overwrite — they write to a fresh page and remap. Eventually the FTL runs out of fresh pages and has to reclaim them by erasing blocks. That's garbage collection.

Garbage collection (greedy variant):
  1. Pick the block with the highest "stale page" ratio.
  2. Read all VALID pages from that block.
  3. Write them to a fresh block.
  4. Update the FTL mapping for the rewritten LBAs.
  5. Erase the now-empty old block.

Cost per reclaimed block:
  - (V × tPROG) for valid-page rewrites
  - 1 × tBERS for the block erase
  - V × FTL update transactions

  where V = valid pages in the chosen block, tPROG ≈ 0.5 ms, tBERS ≈ 3-5 ms

The choice of which block to GC is the policy. Greedy minimises immediate WAF but causes cold blocks to never get re-written, which hurts wear leveling. Real FTLs use cost-benefit policies that weight stale-ratio against block age, balancing GC efficiency against even wear distribution. Some FTLs (LightNVM, ZNS) expose the block structure to the host so the database can do its own GC — eliminating the FTL/host double-bookkeeping.

NVMe vs SATA — the protocol gap

SATA was designed for spinning rust. Its command queue is 32 entries deep. There is one queue. SCSI/AHCI's per-command overhead is ~3 µs. None of this matters at 500 MB/s; all of it matters at 14 GB/s.

NVMe (2011) rewrote the storage protocol from scratch to assume flash. The big ideas:

FeatureAHCI (SATA)NVMe
Command queues1 queue, 32 entriesUp to 64K queues, each 64K entries
Per-command CPU overhead~3 µs (locked, single queue)~0.3 µs (lockless, per-core queue)
Interrupt modelSingle line, all completions through one coreMSI-X, one vector per queue, pinned per core
Doorbell mechanismCentralisedOne MMIO doorbell per submission queue
Random 4 KB IOPS (top consumer drive)~100,000~1.5 million

The IOPS gap isn't just bandwidth — it's CPU concurrency. NVMe lets each core drive its own queue with no cross-core synchronisation. That's why a modern NVMe stack can saturate the drive with one syscall per core; SATA fundamentally cannot.

DWPD and how to read the spec sheet

DWPD — Drive Writes Per Day — is the headline endurance number. A 1 TB drive rated 1 DWPD over 5 years means the manufacturer guarantees you can write 1 TB per day, every day, for 5 years (1825 TB total). Past that you're outside warranty; the drive will still work, usually for a while.

ClassTypical DWPDUse case
Consumer (read-mostly)0.1-0.3Laptop, gaming desktop. Power-loss not protected.
Consumer (gaming/prosumer)0.5-1.0Workstation, content creation.
Enterprise (read-mostly)1.0-3.0Read-heavy DB, log storage, web servers.
Enterprise (mixed)3.0-10.0OLTP, virtualisation hosts.
Enterprise (write-intensive)10.0-30.0+Hot caches, journaling, write-amplified workloads.

A 30 DWPD enterprise drive vs a 0.3 DWPD consumer drive isn't using better NAND — both are TLC. The enterprise drive has 25-30% over-provisioning, a supercap for power-loss protection, and ECC margin tuned aggressively. Same NAND, three orders of magnitude more endurance in the rated spec, ~3x the price per GB. The DWPD number is essentially "how much over-provisioning did the vendor agree to honour."

Common misconceptions

  • "SSDs are random-access devices." Their interface is, but the underlying NAND is fundamentally sequential at the block level — every write to a block requires erasing it first. The FTL hides this with great cleverness, but workload patterns still leak through. Sequential write patterns hit far higher throughput than random writes of the same volume.
  • "Filling an SSD doesn't matter." It matters a lot. Once free space drops below ~20%, the FTL has less room for garbage collection. WAF spikes; sustained write performance collapses; latency tail explodes. Keep production SSDs below ~70% utilization.
  • "All NVMe drives are the same." Random read performance varies 10× across drives at the same Gen5 spec. Sustained write throughput varies 50×. The cheap drive's headline number is achieved for the first few GB; after that, the SLC cache fills and TLC speeds dominate.
  • "NVMe replaced SCSI." NVMe replaced SCSI as the high-end interface, but legacy SCSI is still everywhere — SAS in enterprise storage, parallel SCSI in some industrial gear, and the entire VMware vSAN protocol stack. SCSI's RDMA cousin, iSER, also remains in production for shared-storage clusters.

Numbers worth remembering

QuantityValue
NAND page size4–16 KB
NAND erase block size256 KB – 4 MB
Page program time~500 µs – 1 ms
Block erase time~3–10 ms
TLC endurance~3,000 P/E cycles
QLC endurance~1,000 P/E cycles
Typical consumer over-provisioning~7%
Typical enterprise over-provisioning~28%
NVMe Gen5 sequential read~14 GB/s
NVMe Gen5 random read latency~10 µs
NVMe queue depth, max~64K queues × 64K depth
NVMe doorbell write latency~1 µs
Enterprise SSD fsync rate~10,000–30,000/s

Further reading

Found this useful?