What is the FTL in an SSD?

The Flash Translation Layer is the firmware on every SSD that maps logical block addresses (what the OS sees) to physical NAND locations. It exists because NAND can be read or programmed in pages but only erased in much larger blocks, and pages can't be overwritten without erasing the whole block first. The FTL handles wear leveling, garbage collection, and crash consistency entirely behind the SSD's interface.

What is write amplification?

Write amplification is the ratio of physical NAND bytes written to user bytes written. If your application writes 1 GB but the SSD has to write 2.5 GB internally (due to garbage collection, wear leveling, and metadata), the WAF is 2.5. WAF is the dominant factor in SSD endurance: a drive rated for 1 PB of writes at WAF=1 only delivers 400 TB at WAF=2.5.

What is the difference between SLC and TLC NAND?

SLC stores 1 bit per cell (2 voltage levels). TLC stores 3 bits per cell (8 voltage levels). TLC packs 3× more capacity per silicon area but needs more careful programming and reading because the voltage windows are narrower. SLC endurance is ~100,000 program/erase cycles; TLC is ~3,000. Most modern consumer SSDs use TLC NAND with a small SLC cache region for burst writes.

13 / 15

Internals / 13

SSDs and the FTL

An SSD looks like a fast block device to the OS. Inside, it's a far stranger machine: NAND flash that can be read in pages, written once per page until the whole block is erased, and worn out after a few thousand erase cycles. The Flash Translation Layer is the firmware that hides all of this — it remaps writes to fresh blocks, garbage-collects stale ones, levels wear across the chip, and presents the illusion of a normal block device. Understanding the FTL is the difference between a drive that lasts a decade and one that bricks itself in months.

NAND cells store charge

A NAND flash cell is a transistor with a floating gate — a pocket of silicon isolated from the rest of the circuit by oxide. To program a cell, you tunnel electrons through the oxide onto the floating gate; to erase, you tunnel them back off. The amount of charge on the gate determines a threshold voltage, which is what the read circuit measures.

Cell type	Bits/cell	States	Endurance	Cost / GB	Used in
SLC	1	2	~100,000 P/E	~$3/GB	High-end industrial; obsolete for consumer
MLC	2	4	~10,000 P/E	~$0.30/GB	Older enterprise; phasing out
TLC	3	8	~3,000 P/E	~$0.07/GB	Mainstream consumer + enterprise
QLC	4	16	~1,000 P/E	~$0.05/GB	Read-heavy bulk storage; cold tiers
PLC	5	32	~150 P/E	<$0.04/GB	Experimental; archive-class workloads

Each step from SLC → MLC → TLC → QLC packs more bits per cell by squeezing more distinct voltage levels into the same physical range. The cost: narrower windows mean less margin against drift, more error correction overhead, and fewer program/erase cycles before the oxide degrades. Modern consumer drives use TLC with a fast SLC-mode cache region for burst writes.

The asymmetry that makes everything weird

Three different operations, three different sizes, three different speeds:

Operation	Granularity	Time
Page read	4–16 KB	~50–100 µs
Page program (write)	4–16 KB	~500 µs–1 ms
Block erase	256 KB–4 MB	~3–10 ms

Two things follow from this. First, you can't overwrite — every write to a previously-programmed page requires erasing its whole block first. Second, the erase block is tens of times larger than the page, so erasing one block invalidates many pages. The FTL's whole job is to defer and amortize this.

The Flash Translation Layer

The FTL maintains a mapping table from logical block address (what the OS sees) to physical page. When the OS writes LBA 0 twice, the FTL writes the new data to a fresh physical page and updates the mapping; the old page is marked stale. No erase is needed at write time.

Eventually free pages run out and the FTL has to reclaim space. It picks a block with many stale pages, copies the still-valid pages elsewhere (called garbage collection), and erases the now-empty block. The erased block re-enters the free pool. This is where write amplification comes from: every garbage-collected valid page is a write the user didn't ask for.

User writes 1 GB sequentially:
  WAF ≈ 1.0 — fresh blocks fill in order, no GC needed

User writes 1 GB randomly across the disk:
  WAF ≈ 2–4 — every block has many invalid pages; GC has to copy survivors

User fills disk to 95%, then writes randomly:
  WAF ≈ 5–20 — almost no spare capacity; GC has to compact constantly

Drive endurance scales as 1/WAF. A 1 PB drive at WAF=1 → 1 PB lifetime;
                              at WAF=10 → 100 TB.

How write amplification eats endurance

user writes (GB) 100 GB

WAF (write amplification factor) 2.5×

actual NAND writes

250.0GB

extra wear

150%

drive lifetime multiplier

1/2.5

Production guidance: keep SSDs below ~70% utilization, avoid frequent random small writes, use filesystem TRIM/discard so the FTL knows which blocks are free, and prefer append-only patterns (LSM-tree storage engines, log- structured filesystems) which keep WAF near 1.

Wear leveling

Each NAND block can only be erased a few thousand times before the oxide degrades and bits start sticking. Without wear leveling, hot blocks (those holding frequently-rewritten data) would die long before cold blocks. The FTL tracks erase counts and rotates writes across the entire physical address space so wear is uniform.

Two flavours:

Dynamic. Apply only when writing — pick the freshest free block. Cheap to implement, doesn't help with cold-but-stuck data (read-only files that never need rewriting).
Static. Periodically copy cold data to high-erase-count blocks, freeing the low-erase-count ones for hot data. More expensive in WAF but extends drive life significantly. Used by all enterprise SSDs.

NVMe — many queues, deep queues

Modern NVMe drives support 64+ submission/completion queue pairs, each capable of holding 1024+ outstanding requests. The CPU enqueues a request via a memory-mapped doorbell; the drive DMAs the data; an MSI-X interrupt signals completion. No host CPU does any data copying. Throughput is bounded by queue depth × per-request latency:

queue depth 32

per-request latency (µs) 80 µs

IOPS

400,000

throughput (4 KB blocks)

1.5 GB/s

A real Gen5 NVMe drive specced at "1.5 Mpps random read" achieves that with queue depth ~128 and ~85 µs per request. Drop the queue to depth 1 (a single outstanding request) and IOPS collapses to ~12,000 — a 100× drop. This is why database workloads tune their IO concurrency carefully.

Over-provisioning

Every SSD reserves ~7–28% of its raw NAND capacity as over-provisioning — space the OS never sees, used by the FTL for garbage collection and bad-block replacement. Consumer drives keep ~7% (so a "1 TB" drive has ~1075 GB of NAND); enterprise drives keep 25%+ for sustained-write performance.

More over-provisioning means lower WAF means longer life and more consistent latency. Production datacenter drives are often under-formatted — using a 4 TB drive as 3 TB to give the FTL more elbow room. The latency tail under heavy write workloads improves dramatically.

TRIM and the OS contract

When the OS deletes a file, the SSD doesn't know — the LBAs simply aren't referenced anymore, but from the FTL's view they still hold valid data. Without help, garbage collection treats them as live and copies them around forever. The TRIM command (or its NVMe equivalent, DEALLOCATE) tells the SSD which LBAs are free.

Modern Linux mounts ext4/xfs/btrfs with discard options; macOS does it automatically; Windows runs a TRIM job weekly. Without TRIM, an SSD's sustained write performance degrades as it appears full from the FTL's view, even when the OS has logically freed most of the disk.

Power loss and the supercap

Enterprise SSDs include a small bank of capacitors — a "supercap" — that holds enough charge to flush all in-flight writes from DRAM cache to NAND on power loss. This makes fsync honest: a write that returned successfully really is durable, even if the host crashes immediately after.

Consumer SSDs typically don't have a supercap. A write returns success when it lands in the SSD's DRAM cache; if power fails before that data reaches NAND, it's lost. Most filesystems and databases tolerate this because they assume the ACID guarantees come from the application layer (write-ahead log, etc.) — but a server-class workload running on a consumer SSD can lose data on a power event.

Why this matters: when a database or filesystem benchmark says a Samsung 990 Pro does 700,000 fsync/sec, it's lying about the durability part. An enterprise drive with proper power-loss protection sustains ~10,000–30,000 real fsyncs/sec. The 30× gap is the cost of doing it correctly.

Why TLC won the market — and what QLC concedes

The cell-types table above looks like a menu of equal trade-offs. In practice the industry has converged hard on TLC for almost everything, with QLC creeping into the capacity tier. The reason is non-linear: every additional bit per cell roughly cuts endurance by an order of magnitude, halves the read margin, and adds latency (because the read circuit has to discriminate more voltage levels with the same noise floor).

Why most drives are TLC. Sweet spot. Endurance is good enough for a 5-year consumer warranty even at 0.5 DWPD, density doubles versus MLC, and the cost per bit is acceptable. QLC sits one rung below — fine for cold reads, marginal for sustained writes. Many "QLC" drives actually run a small portion as pseudo-SLC cache to mask the underlying latency, then migrate cold data to QLC blocks in the background. That's why a "QLC" drive can advertise SLC-like peak write speeds for the first ~50 GB and then collapse to 100 MB/s sustained.

Garbage collection — the algorithm

NAND can program a page (write) but only erase a block (a group of 256-512 pages). To rewrite a page in place, you'd have to erase the whole block, costing ms of latency. So SSDs never overwrite — they write to a fresh page and remap. Eventually the FTL runs out of fresh pages and has to reclaim them by erasing blocks. That's garbage collection.

Garbage collection (greedy variant):
  1. Pick the block with the highest "stale page" ratio.
  2. Read all VALID pages from that block.
  3. Write them to a fresh block.
  4. Update the FTL mapping for the rewritten LBAs.
  5. Erase the now-empty old block.

Cost per reclaimed block:
  - (V × tPROG) for valid-page rewrites
  - 1 × tBERS for the block erase
  - V × FTL update transactions

  where V = valid pages in the chosen block, tPROG ≈ 0.5 ms, tBERS ≈ 3-5 ms

The choice of which block to GC is the policy. Greedy minimises immediate WAF but causes cold blocks to never get re-written, which hurts wear leveling. Real FTLs use cost-benefit policies that weight stale-ratio against block age, balancing GC efficiency against even wear distribution. Some FTLs (LightNVM, ZNS) expose the block structure to the host so the database can do its own GC — eliminating the FTL/host double-bookkeeping.

NVMe vs SATA — the protocol gap

SATA was designed for spinning rust. Its command queue is 32 entries deep. There is one queue. SCSI/AHCI's per-command overhead is ~3 µs. None of this matters at 500 MB/s; all of it matters at 14 GB/s.

NVMe (2011) rewrote the storage protocol from scratch to assume flash. The big ideas:

Feature	AHCI (SATA)	NVMe
Command queues	1 queue, 32 entries	Up to 64K queues, each 64K entries
Per-command CPU overhead	~3 µs (locked, single queue)	~0.3 µs (lockless, per-core queue)
Interrupt model	Single line, all completions through one core	MSI-X, one vector per queue, pinned per core
Doorbell mechanism	Centralised	One MMIO doorbell per submission queue
Random 4 KB IOPS (top consumer drive)	~100,000	~1.5 million

The IOPS gap isn't just bandwidth — it's CPU concurrency. NVMe lets each core drive its own queue with no cross-core synchronisation. That's why a modern NVMe stack can saturate the drive with one syscall per core; SATA fundamentally cannot.

DWPD and how to read the spec sheet

DWPD — Drive Writes Per Day — is the headline endurance number. A 1 TB drive rated 1 DWPD over 5 years means the manufacturer guarantees you can write 1 TB per day, every day, for 5 years (1825 TB total). Past that you're outside warranty; the drive will still work, usually for a while.

Class	Typical DWPD	Use case
Consumer (read-mostly)	0.1-0.3	Laptop, gaming desktop. Power-loss not protected.
Consumer (gaming/prosumer)	0.5-1.0	Workstation, content creation.
Enterprise (read-mostly)	1.0-3.0	Read-heavy DB, log storage, web servers.
Enterprise (mixed)	3.0-10.0	OLTP, virtualisation hosts.
Enterprise (write-intensive)	10.0-30.0+	Hot caches, journaling, write-amplified workloads.

A 30 DWPD enterprise drive vs a 0.3 DWPD consumer drive isn't using better NAND — both are TLC. The enterprise drive has 25-30% over-provisioning, a supercap for power-loss protection, and ECC margin tuned aggressively. Same NAND, three orders of magnitude more endurance in the rated spec, ~3x the price per GB. The DWPD number is essentially "how much over-provisioning did the vendor agree to honour."

Common misconceptions

"SSDs are random-access devices." Their interface is, but the underlying NAND is fundamentally sequential at the block level — every write to a block requires erasing it first. The FTL hides this with great cleverness, but workload patterns still leak through. Sequential write patterns hit far higher throughput than random writes of the same volume.
"Filling an SSD doesn't matter." It matters a lot. Once free space drops below ~20%, the FTL has less room for garbage collection. WAF spikes; sustained write performance collapses; latency tail explodes. Keep production SSDs below ~70% utilization.
"All NVMe drives are the same." Random read performance varies 10× across drives at the same Gen5 spec. Sustained write throughput varies 50×. The cheap drive's headline number is achieved for the first few GB; after that, the SLC cache fills and TLC speeds dominate.
"NVMe replaced SCSI." NVMe replaced SCSI as the high-end interface, but legacy SCSI is still everywhere — SAS in enterprise storage, parallel SCSI in some industrial gear, and the entire VMware vSAN protocol stack. SCSI's RDMA cousin, iSER, also remains in production for shared-storage clusters.

Numbers worth remembering

Quantity	Value
NAND page size	4–16 KB
NAND erase block size	256 KB – 4 MB
Page program time	~500 µs – 1 ms
Block erase time	~3–10 ms
TLC endurance	~3,000 P/E cycles
QLC endurance	~1,000 P/E cycles
Typical consumer over-provisioning	~7%
Typical enterprise over-provisioning	~28%
NVMe Gen5 sequential read	~14 GB/s
NVMe Gen5 random read latency	~10 µs
NVMe queue depth, max	~64K queues × 64K depth
NVMe doorbell write latency	~1 µs
Enterprise SSD fsync rate	~10,000–30,000/s

SSDs and the FTL

NAND cells store charge

The asymmetry that makes everything weird

The Flash Translation Layer

How write amplification eats endurance

Wear leveling

NVMe — many queues, deep queues

Over-provisioning

TRIM and the OS contract

Power loss and the supercap

Why TLC won the market — and what QLC concedes

Garbage collection — the algorithm

NVMe vs SATA — the protocol gap

DWPD and how to read the spec sheet

Common misconceptions

Numbers worth remembering

Further reading

14 — GPUs and accelerators