Filesystem Visualizer: ext4 on one screen.

A tiny ext4-shaped filesystem: 2 block groups, each with a superblock, an inode bitmap, a block bitmap, an inode table, and data blocks. Plus a journal. Create files and watch the inode get allocated, the block bitmap flip, the journal grow, and the commit make it durable. Simulate a crash and watch what replays.

files

inodes

1/12

blocks

12/24

size: 2 block(s) journal mode:

Filesystem mounted. Root directory inode #1.

Disk layout (2 block groups)

Block group 0

GDT

IBM

BBM

Block group 1

IBM

BBM

superblock / GDT inode bitmap block bitmap inode table data (used) data (free)

Inode table

#	name	mode	size	blocks
#1	/	d	4	—

Journal (jbd2)

— empty —

Log

Filesystem mounted. Root inode #1 created.

Try this

Click create a few times. Watch inodes fill from the start, data blocks fill behind the inode table, journal grow with each transaction, then commit (✓).
Delete a file. Inode and blocks free up — but the bitmap update is a journal transaction too, not a synchronous write to disk.
Switch journal mode to journal. Now every file create writes the data through the journal first and to its final block. Slower, but data is guaranteed consistent on crash.
Click "preset: crash + replay". 3 files created, simulated crash drops uncommitted journal entries, mount replays. Some may not survive.

The three journal modes

data=ordered (default). Metadata journaled. Data is written before its metadata commit. Crash → consistent metadata, possibly lost newer data.

data=journal. Data + metadata both journaled. Double the writes. Safest, slowest.

data=writeback. Metadata journaled. Data flushed independently. Crash → consistent metadata that may point at stale or zeroed data blocks.

Adjacent

What you're looking at

The grid is a tiny ext4 disk cut into two block groups. The coloured cells are fixed metadata — superblock, the inode bitmap, the block bitmap, the inode table — and the rest are data blocks that turn green as files claim them. Below it, the inode table lists every allocated file, and the journal (jbd2) shows each metadata transaction and whether it has committed (the ✓).

Click create a few times and watch an inode get claimed, data blocks flip to used, and a journal entry appear and then commit a moment later — that commit is what makes the change durable. Switch the journal mode to journal and create again: now the data is written twice, once through the journal and once to its final block, which is the price of crash-proof data. The moment to watch for is preset: crash + replay — the uncommitted journal entries are dropped, so some files survive the crash and some don't, exactly as a real fsck replay would leave the disk. The filesystem structure always comes back consistent; whether your newest data does depends on the mode.

Why ext4 cuts the disk into groups

The same trick computer architects use for memory.

Ext4 divides the disk into block groups of roughly 128 MiB each. Every block group has its own copy of the superblock metadata (for redundancy), its own inode bitmap, its own block bitmap, its own inode table, and the actual data blocks. A file's inode and its data blocks live in the same block group whenever possible, with directory entries clustered nearby. Each directory is assigned a "preferred group" at creation, and files in that directory try to land there.

The benefit is mechanical sympathy with the storage device. On spinning rust, reading the inode and the data without a long seek is much faster than scattering them across the platter. On an SSD, the seek argument is gone but the locality still helps for read-ahead and for keeping a working set of bitmaps in the page cache. The pattern is the same one CPUs use for cache lines — pack things you'll access together. The cost is that very large files (multi-GB) span groups and lose the benefit, which is why ext4 added extents (contiguous-extent descriptors) instead of the old single-block pointer array of ext2/3.

A file's identity isn't its name

The directory entry maps the name to a number. The number is the file.

An inode is a fixed-size structure (256 bytes on modern ext4) holding everything about a file except its name: mode bits, owner, size, modification times, refcount, and the extent tree pointing at where the data lives. The filename is in the directory entry of some parent directory, not in the inode. This is why renaming a file is atomic and cheap (rewrite one dirent), why hard links work (multiple dirents point at the same inode), and why unlinking a file whose fd is still open doesn't actually delete the data (the inode's refcount only drops to zero when the last reference, dirent or fd, is gone).

This is also why df can report a disk is "full" when only 30% of the data blocks are used — the inode table is fixed at filesystem-creation time, and you can run out of inodes (think: a Maildir with a million tiny messages) long before you run out of data blocks. The fix is mkfs.ext4 -N to allocate more inodes up front. XFS dynamically allocates inodes so this problem doesn't appear.

The journal makes the filesystem survivable

The same idea databases use for crash recovery, applied one layer down.

Without a journal, any modification that touches multiple metadata blocks (creating a file touches the inode bitmap, the inode table, the parent directory entry, and the block bitmap) risks leaving the filesystem in an inconsistent state if the machine crashes mid-modification. The classic ext2 era of "fsck takes 45 minutes on a 1 TB disk" was a direct consequence — fsck had to scan everything looking for inconsistencies because nothing recorded what had been in flight.

Ext4's journal (managed by the JBD2 layer) records every metadata transaction before it updates the in-place blocks. A transaction is opened; the about-to-change metadata blocks are copied into the journal; an in-place update happens; a commit record is written to the journal. On crash, the kernel mounts the FS, reads the journal, and replays any committed-but-not-yet-applied transactions, discarding any started-but-not-committed ones. The filesystem is consistent the moment the mount finishes. No global scan needed.

This is why ext4 fsck on a clean unmount takes a second; on a dirty mount it takes only as long as the journal replay (a few seconds, even on multi-TB filesystems). The data may still need recovery — that depends on the journal mode — but the filesystem structure itself is always recoverable in constant time relative to the journal size, not the disk size.

Found this useful?