06 / 10

Internals / 06

File systems

A filesystem is the thing that lets you say open("/var/log/app.log") instead of "read 4 KB starting at byte 3,489,792 on the third disk." It builds the abstraction of files and directories on top of a flat array of fixed-size blocks, and it has to keep that abstraction intact when the power dies mid-write. This page walks the whole stack: what an inode stores, how directories map names to inodes, how blocks get allocated, where the page cache sits, and how journaling and copy-on-write each survive a crash. The differences between ext4, XFS, btrfs, and ZFS only start to matter once you understand what they all share.

The abstraction: files and directories over blocks

A disk, at the bottom, is a numbered array of fixed-size blocks. SSDs and spinning disks both present this view: block 0, block 1, block 2, up to however many the device holds, each block usually 512 bytes or 4 KB. There are no files down there, no names, no folders, no permissions. A file is a fiction the filesystem maintains. Its job is to take this flat array and build the thing you actually use: named files of arbitrary length, grouped into a tree of directories, each with an owner and an access bit and a modification time, all of it reachable by a path like /home/nilesh/notes.txt.

Everything that follows is in service of that translation. When a program asks to read byte 9,000 of notes.txt, the filesystem has to figure out which block on the device holds that byte, fetch it, and hand back the right 1,000 bytes. When the program appends data, the filesystem has to find a free block, write the bytes there, and record that the file now owns one more block. The cleverness is in the bookkeeping: the data structures that make that lookup fast, the allocation policy that keeps a file's blocks near each other, and the recovery scheme that keeps the bookkeeping consistent when the machine loses power halfway through an update.

On a Unix system the bookkeeping splits into a small number of on-disk structures, the same set across nearly every design. There is a superblock that describes the filesystem as a whole; inodes that hold per-file metadata; directories that map names to inodes; and the data blocks that hold file contents. The superblock is the first thing read at mount and the last thing written at unmount. It records the total size, the block size, how many inodes exist, a magic number that says "yes, this is an ext4 filesystem," and a clean/dirty flag that tells the kernel whether the last unmount finished properly.

Inodes and what they store

The inode is the centre of the design. One inode per file, identified by a number, and it holds everything about the file except its name. Owner and group, the permission bits, the file's size in bytes, three timestamps (atime for last access, mtime for last content change, ctime for last inode change), a link count, and — the part that does the real work — a description of where the file's data blocks live on the device. The name lives somewhere else entirely, which is the single most important thing to absorb about Unix filesystems and the reason hard links and atomic renames behave the way they do.

The three-hop lookup: directory entry to inode number, inode to metadata, metadata to data blocks.

Inodes are allocated in a fixed-size table when the filesystem is created (or grown on demand on XFS and btrfs). On ext4 the inode count is decided at mkfs time and cannot grow, which is why a partition can report plenty of free space yet refuse to create a file: it has run out of inodes, not bytes. That happens on filesystems holding millions of tiny files. The fix is to format with a higher inode density, but you have to decide up front. Check it with df -i, which reports inode usage the way df reports block usage.

One subtlety lives in how the inode describes data blocks. Old designs used direct and indirect block pointers: the inode held a handful of direct block numbers, then a pointer to a block full of pointers (single indirect), then a pointer to a block of pointers to blocks of pointers (double indirect), and so on. That works but it is metadata-heavy for large files, and reading a big file means chasing pointer blocks. Modern filesystems use extents instead, which we get to below.

Directories as name-to-inode maps

A directory is not a special kind of object in the kernel's eyes. It is an ordinary file whose contents happen to be a list of entries, each entry pairing a name with an inode number. That is the whole of it. /home/nilesh/notes.txt resolves by reading the inode for /, scanning its directory data for the entry named home to get that inode number, reading that inode, scanning its data for nilesh, and so on down the path until the final component yields the inode for notes.txt. Every path resolution is a walk of this kind, which is why path lookup is a hot operation and why the kernel caches its results aggressively.

A flat list is fine for a directory with ten entries and slow for one with a hundred thousand, where every lookup is a linear scan. ext4 stores large directories as an HTree, a hashed B-tree-like structure keyed on the filename hash, so a lookup is a few block reads rather than a full scan. XFS uses B+trees for the same reason. The structure is invisible to programs — readdir still returns a list — but it is the difference between a directory of a million files being usable or not.

Because the name and the inode are separate, several names can point at the same inode. That is a hard link: two directory entries, possibly in different directories, holding the same inode number. The inode's link count records how many entries refer to it. Remove one name and the count drops by one; the file's data stays on disk until the count reaches zero and no process holds the file open. This is why unlink on a file that some process is still reading does not break the reader: the name is gone from the directory, but the inode survives because the open file descriptor keeps a reference. The blocks are reclaimed only when the last reference, name or descriptor, goes away.

Hard links versus symbolic links

A hard link and a symbolic link solve different problems and behave differently when things change. A hard link is a second true name for the same inode, indistinguishable from the first — there is no "original." Both entries have equal standing, both point at the same data, and deleting either one just decrements the link count. The catch is that a hard link cannot cross filesystem boundaries (inode numbers are only meaningful within one filesystem) and cannot point at a directory (that would let you build cycles the kernel cannot safely walk).

A symbolic link is a different animal. It is its own tiny file whose data is a path string. When you open a symlink, the kernel reads that string and restarts the lookup from there. Because it stores a path, not an inode number, a symlink can cross filesystems and point at directories. The price is that it breaks if the target moves or is deleted — you are left with a dangling link pointing at a path that no longer resolves. A hard link can never dangle, because it shares the inode; a symlink can, because it only knows a name.

Two names for one inode versus a stored path that the kernel follows again.

This separation is also what makes rename() on the same filesystem atomic and cheap. Renaming a file does not touch its data or even its inode; it removes one directory entry and adds another, both under a lock that makes the change all-or-nothing from a reader's point of view. That is why the standard safe-save pattern is "write to a temp file, fsync it, then rename it over the target": the rename is the atomic switch, so a reader sees either the old complete file or the new complete file, never a half-written one.

Block allocation, extents, and fragmentation

The filesystem has to decide which physical blocks a file's bytes go into, and that decision shapes how fast the file reads later. The naive scheme tracks free space with a bitmap, one bit per block, and grabs the first free block whenever a file grows. It works, but if you let it, a file's blocks scatter across the device as free space gets carved up by other files. On a spinning disk that means the read head jumps around; on an SSD it costs fewer seeks but still bloats metadata and hurts sequential throughput.

The answer is extents. Instead of recording every block a file owns, the inode records ranges: "blocks 9,000 through 9,255, contiguous." A single extent describes 256 blocks (1 MB at a 4 KB block size) in one small record. A large file that was allocated contiguously might need only a handful of extents, where the old indirect-pointer scheme needed thousands of pointers. ext4, XFS, and btrfs all use extents. They make large-file metadata tiny and large-file reads fast, because contiguous ranges read in long sequential bursts.

Getting contiguous ranges requires knowing how big a write will be before committing to block numbers, and that is what delayed allocation buys. When a program writes, the kernel keeps the data in dirty pages and does not pick physical blocks yet. By the time it actually flushes, it often knows the full size of the write and can carve out one large contiguous extent rather than dribbling out blocks as each write call arrives. The cost shows up after a crash: data that was written but never allocated and never flushed is simply gone, which is a real source of "my file is empty after a power cut" reports when applications skip fsync.

Contiguous layout collapses to one extent record and reads sequentially; fragmentation needs many records and many seeks.

Copy-on-write filesystems fragment by design, which is the trade-off for never overwriting in place — more on that below. For in-place filesystems, allocation policy fights fragmentation actively: ext4 keeps a file's data near its inode and groups related files, XFS spreads load across allocation groups so independent regions of the disk can be allocated in parallel without lock contention. Both expose online defragmentation for the cases where a long-lived, repeatedly-appended file (a database, a log) has fragmented anyway.

The VFS layer: one API, many filesystems

All of this on-disk machinery is hidden behind a single kernel interface, the Virtual File System. Every filesystem-relevant syscall — open, read, write, stat, unlink, rename, fsync — passes through VFS before reaching the filesystem that owns the file. VFS defines operation tables (super_operations, inode_operations, file_operations, dentry_operations) that each filesystem fills in with its own functions. The kernel dispatches through those function pointers, so the application never knows whether bytes are coming from ext4 on a local disk, XFS on a SAN, NFS over the network, or a FUSE process running in userspace.

This is what lets you mix filesystems on one machine: / on ext4, /home on XFS, /mnt/data on btrfs, /mnt/s3 on a FUSE mount of an object store, all behind the same open(). VFS also owns the in-memory caches that sit in front of every filesystem: the inode cache and the dentry cache. The first ls -l of a cold directory with a hundred thousand files is slow because each stat may need a disk read to fetch the inode; the second is fast because the inodes and directory entries are now cached. /proc/sys/vm/vfs_cache_pressure tunes how aggressively the kernel reclaims that metadata cache versus the page cache, and raising it can be exactly wrong for a metadata-heavy workload.

The cost of one uniform API is that VFS imposes Unix semantics on everyone. Delete-while-open, atomic rename, the inode-and-link-count model — these are baked into the interface, and filesystems built around other assumptions (a few Windows behaviours, some object-store semantics) do not map onto it cleanly. The I/O path below VFS — how a read turns into a device request, blocking versus non-blocking, io_uring — is a topic of its own, covered on the I/O models page.

The page cache and write-back

Sitting above every filesystem is the page cache, and it is the single biggest reason filesystem performance surprises people. Reads are served from cached pages whenever the data is already in memory, so a re-read of a recently-read file never touches the device. Writes are absorbed into dirty pages in RAM and the syscall returns immediately, before anything reaches the disk. Kernel write-back threads flush those dirty pages later, on a policy that targets a dirty-data window of around 30 seconds. A write that has not been flushed can sit in RAM for that long, and a crash inside that window loses it.

This is why a microbenchmark of write(); write(); write() looks impossibly fast: it is measuring the speed of copying bytes into RAM, not the speed of the disk. The moment durability enters the picture the numbers change by orders of magnitude. A loop of write(); fsync() waits for each flush to physically reach stable storage, and that wait is the actual cost of a durable write. fsync is also what forces the kernel to issue a cache-flush command to the device, because the drive itself has a volatile write cache that will happily acknowledge a write before the bits are on the platter or in NAND. Without that flush, "the kernel wrote it" still does not mean "the device persisted it."

fsync vs sync vs fdatasync. sync() schedules all dirty pages for write-back and returns immediately. fsync(fd) blocks until that file's data and metadata are on stable storage, including a device cache flush. fdatasync(fd) blocks until just the data and the minimum metadata needed to read it back are on storage, skipping the timestamp update where it safely can, which is a real latency win. Database write-ahead log implementations use fdatasync for log appends and fsync for the rename that finishes a commit. Getting these wrong is the most common root cause behind "I lost data after a power cut."

What actually happens on a write

Trace a single write(fd, buf, 4096) to an existing file and the layers come into focus. The syscall enters VFS, which dispatches to the filesystem's write handler. The bytes are copied into a page in the page cache, the page is marked dirty, the file's in-memory inode gets its size and mtime updated, and the call returns. Nothing has touched the device yet. The program believes the write succeeded, and for the purposes of any later read by the same machine it has, because reads come from the cache.

Later — either when the 30-second window elapses, when memory pressure forces a flush, or when the program calls fsync — write-back kicks in. If this is an append that needed new space, delayed allocation now picks physical blocks, ideally one contiguous extent, and updates the inode's extent list. The data pages are written to those blocks. The metadata (the inode, the block bitmap or extent tree, the directory entry if the file is new) is written too. On a journaling filesystem these metadata changes go through the journal first, which is the crux of crash consistency and the subject of the next section. Only after all of that, and a device cache flush, is the data durable.

The gap between "the syscall returned" and "the bytes are durable" is where almost every data-loss story lives. An application that writes a config file and exits without fsync is trusting the page cache to flush before anything goes wrong. Most of the time it does. The time it does not — a kernel panic, a power cut, a yanked cable — the file is empty or truncated, and the application looks buggy when the real issue is a missing durability barrier.

Journaling and crash consistency

A metadata update is rarely a single block write. Creating a file touches the directory entry, the inode, the inode bitmap, and the block bitmap; growing a file touches the extent tree and the free-space map. If the machine dies after some of those writes but not others, the filesystem is left inconsistent: a directory entry pointing at an inode that was never initialised, or blocks marked used that no file owns. The old fix was fsck, a full scan at boot that could take hours on a large disk. Journaling exists to make that scan unnecessary.

A journaling filesystem (ext4, XFS, NTFS, HFS+) keeps a small circular log. Before applying a metadata change to its real location, it writes a description of the whole change to the journal and marks it committed. Then it writes the change to the actual structures. Then it marks the journal entry done. If the machine crashes, the recovery step at mount replays any committed-but-not-finished journal entries, bringing the real structures to a consistent state in seconds rather than scanning the whole disk. The cost is write amplification: metadata gets written twice, once to the journal and once to its home.

Once the commit record is durable, recovery can finish the change; before it, recovery discards a partial entry.

ext4 offers three journaling modes that trade safety for speed. data=journal logs both data and metadata, the safest and slowest, since every byte is written twice. data=ordered — the default — journals only metadata but guarantees the data blocks are flushed before the metadata that references them commits, so you can never end up with an inode pointing at blocks that hold someone else's old contents. data=writeback journals metadata with no ordering, the fastest, with the risk that after a crash a file's metadata says it has data while the blocks still hold stale bytes. Ordered mode is the sane default: it keeps metadata consistent and prevents the worst data-exposure failure without doubling the data writes.

Copy-on-write filesystems

btrfs and ZFS take a different route to crash consistency: they never overwrite a live block. Modifying data writes a new version to free space, then updates the metadata that points at it — and since that metadata is itself a block, updating it also writes a new copy, and so on up to the root of the filesystem tree. Changing the root pointer is the single atomic act that makes the whole new state visible. This is copy-on-write, and it removes the need for a separate journal: every write already lands in a fresh location, so a crash simply leaves the previous consistent tree root intact, and recovery means using it.

The same mechanism makes snapshots almost free. A snapshot is just an extra reference to an existing tree root. Because nothing is overwritten, the old blocks the snapshot points at stay valid even as the live filesystem moves on, and the snapshot consumes space only as the live data diverges from it. btrfs builds on this with subvolumes, transparent compression, and send/receive for incremental backups; ZFS adds end-to-end checksums on every block, so it can detect silent corruption and, when there is redundancy, repair it automatically from a good copy. ZFS also folds in volume management and a tiered cache (the in-RAM ARC, an optional flash L2ARC, and the ZIL log device for synchronous writes), which is why it asks for more memory than ext4.

The cost of copy-on-write is fragmentation, the same trade-off seen from the other side. Rewriting parts of a large file in place is impossible by definition, so a database file or a busy log that gets random in-place updates scatters across the device and accumulates dead blocks that only garbage collection reclaims. That is the reason the standard advice is to keep database storage on ext4 or XFS, or to mark such files nodatacow on btrfs: copy-on-write is excellent for snapshots and integrity and poor for high-rate in-place rewrites.

Choosing one

The four filesystems that matter on Linux differ in how they make the choices above, not in the abstraction they present. ext4 is the in-place, journaling default — boring, well-understood, and the right answer when you are not sure. XFS is also in-place and journaling but built for parallelism, with allocation groups and B+trees that scale to many cores and very large files. btrfs and ZFS are copy-on-write, trading in-place efficiency for cheap snapshots and, in ZFS's case, end-to-end integrity.

Workload	Pick	Why
General-purpose Linux server	ext4	Default everywhere, extents, ordered journaling. Boring and good.
Very large files, high concurrency	XFS	Allocation groups and B+trees scale across cores; RHEL default.
Data integrity, snapshots, RAID	ZFS	End-to-end checksums, mature RAID-Z, send/receive, self-healing.
Snapshot-heavy, modest scale, Linux-native	btrfs	Copy-on-write snapshots and subvolumes; avoid RAID 5/6.
Cloud object store mounted as files	FUSE (s3fs, mountpoint-s3)	Pay the latency cost for the file abstraction over an object store.
Database storage volume	ext4 or XFS	In-place writes; copy-on-write fragments under heavy in-place updates.

Whichever you pick, the durability rules are the same: a write is not safe until it has been flushed past the page cache and the device cache, and the way you force that is fsync or fdatasync at the right moments. The filesystem decides how the bytes are laid out and how it survives a crash; your application decides when it actually waits for that survival to happen. Understanding the page cache and crash consistency also pays off one layer up, in how databases reason about durability — see write-ahead logging — and one layer down, in how the kernel turns these writes into device requests, on the I/O models page. The way the page cache shares memory with the rest of the system connects to memory allocation as well.