How Git stores your code as snapshots, not diffs.
A content-addressed key-value store with three object types and a directed acyclic graph on top. Branches are pointers. Merges are commits with two parents. Everything else is a UI on top of that.
The three objects Git is built from: blob, tree, commit
Files, directories, and snapshots, each keyed by a hash of its content.
Git is a content-addressable filesystem with a version-control UI bolted on top. Linus Torvalds wrote it in April 2005 in a few weeks after BitKeeper revoked the kernel's free licence. Internally, Git is four object types — blob, tree, commit, tag — all addressed by SHA-1 (transitioning to SHA-256). Every clone is a full copy of the repository.
Git stores three kinds of object. A blob is a file's bytes. A tree is a directory listing — names, modes, and pointers to blobs or sub-trees. A commit wraps a tree with metadata: who, when, why, parents. Every object is keyed by the SHA-1 of its content; two objects with identical bytes resolve to the same hash, no matter who created them or when — a content-addressed hash table on disk.
Click between the three to see one of each.
tree a1b2c3d parent f12abe7 author Alice <alice@x> 1730000000 +0000 committer Alice <alice@x> 1730000000 +0000 Add hello.js
A commit pins a tree to a moment, with one or more parents. Merge commits have two. Initial commits have none. The commit hash covers the tree, the parent list, the author/committer, and the message — change any of those and the hash changes.
Git history is a graph of commits, not a straight line
Each commit points back at its parents, forming a directed acyclic graph.
Each commit points at its parent (or parents). Branches are not first-class — they are just names that point to a single commit. A merge between two branches is a commit whose parent list has two entries; the working tree is the merged content.
This shape gives you everything: git log walks parent edges; git diff a..b diffs trees; git merge-base finds the LCA in the DAG — the same kind of immutable identifier that distributed IDs give a row. The DAG is the data structure; commands are queries on it.
# A small history with one merge A — B — C — D — E (main) \ \ F — G —— M (M is the merge of E and G)
A Git branch is just a file holding one commit hash
No metadata, no history, just forty hex digits pointing at the tip.
A branch is a file under .git/refs/heads/ containing one commit hash. main: 40 characters of hex, no more. git branch new writes another file. git checkout updates .git/HEAD, which is itself a ref pointing at a branch ref. There is no metadata, no history, no creation time — just the hash of the tip.
Tags are similar but live under .git/refs/tags/. Lightweight tags are just refs. Annotated tags are real tag objects (the fourth object type) with a message and a signer.
What the Git staging area actually is
The index is the next tree you are about to commit.
The index (.git/index) is Git's staging area — the proposed next tree, behaving a lot like a database write-ahead log for the working copy. git add writes blobs and updates the index. git commit snapshots the index into a tree, wraps it with metadata, and points the current branch at the new commit.
Most commands you think of as "change a working file" are really three: object → index → tree. Knowing this collapses ten confusing UI surprises into one mental model. git diff diffs working tree vs index; git diff --cached diffs index vs HEAD's tree; git diff HEAD diffs working tree vs HEAD's tree.
How Git keeps the repo small with packfiles
Objects start loose, then get delta-compressed into packs.
Naively, every commit stores a fresh blob for every changed file. That works for a small repo. For a large one, git gc packs many objects together as deltas (one full version + many "this differs from that by X bytes"). The packfile uses zlib for the bulk and binary diffs between similar blobs to drop redundancy.
This is why git clone is fast even on huge repos: the server sends one packfile, not thousands of objects. It is also why .git directory size is far smaller than the sum of all historical file sizes.
Merge vs rebase: two ways to combine branches
One records that histories met; the other rewrites them into a line.
Merge creates a new commit with two parents, recording the fact that two histories converged — the closest Git gets to a database transaction. Truthful but messy — many merge commits in a busy repo.
Rebase rewrites your commits on top of theirs. The history looks linear, but the original commits are gone — replaced with new ones (different hashes, same patches). Clean history; rewritten attribution; pre-merge state is lost unless you saved it.
Pick one per project and stick with it. The flame wars about which is "right" are misplaced — both are valid; consistency matters more than the choice.
Truth-preserving.
Keeps every commit. Records that A and B converged at point M. Easy to revert. Branchy log. Good for shared branches.
History-rewriting.
Replays your commits onto theirs. Hashes change; original commits orphaned. Linear log. Use for your local work; never for shared branches.
Trunk-based, GitHub flow, git-flow: pick by release cadence
A branching strategy is three decisions wearing one name.
The DAG does not care how you branch; your release process does. Behind every named strategy sit three decisions: how often you ship, what a bad change costs to undo, and who is allowed to merge. Name those and the choice usually makes itself.
- Trunk-based
- Everyone commits to main; branches live hours, not days; unfinished work hides behind feature flags. Cadence: continuous — every green commit is shippable. Revert cost: one small commit to back out. Who merges: nobody, really; CI is the gatekeeper. The price of entry is real: fast tests, feature flags, and the discipline to keep half-built work dark. Without those, trunk-based is just everyone breaking main together.
- GitHub flow
- Branch from main, open a pull request, review, merge, deploy. Cadence: per-PR, usually daily or better. Revert cost: revert the merge (or squash) commit — one operation, because each PR is one unit. Who merges: the author, after review. The default for most teams, and a good one: it keeps main always deployable while giving review a natural unit.
- git-flow
- Long-lived develop and main, plus release and hotfix branches. Built for scheduled, versioned releases — installed software, mobile release trains, anything with several supported versions in the field at once. Revert cost: the highest — a bad change may need cherry-picks across every active release branch. Who merges: a release manager, explicit or de facto. Wrong default for a web service that deploys daily; the right shape if you must patch v2.3 while building v3.
The failure mode is copying a strategy from a company with a different release reality. A two-person team running git-flow does ceremony for an audience that does not exist; a team shipping a mobile app "trunk-based" discovers it still needs release branches the first time a store review takes a week. Start from how you ship, not from a diagram.
On rebase-vs-merge etiquette, the honest version: the rule "never rewrite shared history" does all the real work, and most of the remaining argument is aesthetics. A team that rebases feature branches before merging gets a log that reads like a changelog; a team that merges gets a log that records what actually happened. Both are fine. What is not fine: force-pushing a branch a reviewer has already pulled, rebasing mid-review so the new diff can't be compared against the old one, or rewriting anything someone else's work sits on top of. Agree on one convention, set the repo's merge-button policy to match (merge commit, squash, or rebase — pick exactly one), and spend the argument budget elsewhere.
Git is distributed: every clone is a full repository
Your laptop holds the whole history; there is no privileged server.
Git's distributed nature is not a feature on top — it is the model. Your laptop holds the full DAG. push sends new objects to a remote that does not yet have them; fetch downloads what the remote has and you do not. Both operations are content-addressed: the wire format names objects by hash and skips ones the other side already has — effectively a cache negotiation.
Forks, mirrors, and the GitHub model all sit on top. There is no central server in Git itself; "origin" is just one remote out of many. The convention of one canonical remote is a workflow choice.
How to recover lost commits with the reflog
Git remembers every move of HEAD, so most lost work is still there.
Most "lost commits" are not lost. Git's reflog records every change to HEAD for ninety days by default. git reflog shows it; you can git checkout any of those hashes back. Even after a rebase that "destroyed" your branch, the original commits are still there — orphaned but reachable from the reflog until garbage collection runs.
Real loss happens only when objects are unreachable AND old enough for git gc to prune them. The default is two weeks. Plenty of time to recover, if you know where to look.
Git at scale: what breaks in giant monorepos
How the biggest repos work around the limits of a full clone.
Git was designed for the Linux kernel — a few hundred contributors, ~70k files, ~5GB of history in 2024. Beyond that scale the assumptions strain. The big monorepos:
- Linux kernel
- ~70k files, ~1M commits since 2005. Vanilla git handles it. ~3 GB clone. Linus's reference workload.
- Chromium
- ~340k files, ~1.5M commits, ~25 GB clone. Google ships depot_tools and partial-clone workflows; vanilla git struggles.
- Microsoft Windows · ~3.5M files
- The reason VFS for Git (now Scalar) exists. Files are virtualized — visible in the filesystem but only fetched when accessed. git status on the full 270 GB repo became feasible.
- Google internal · piper
- Not Git at all. ~2 billion lines of code in a monolithic repo, ~80 PB of metadata. Internal "Piper" with bespoke tooling. Public docs (Potvin & Levenberg, "Why Google Stores Billions of Lines of Code in a Single Repository", CACM 2016) explain why off-the-shelf VCS does not fit.
- Meta · Mercurial
- Started on Git, moved to Mercurial around 2012 specifically because Hg's extensibility made the "shallow + virtual filesystem" pattern easier. Now uses a fork called Sapling (open-sourced 2022).
Modern Git scale features. Partial clone (git clone --filter=blob:none, since 2020) skips downloading file contents until needed. Sparse checkout limits the working tree to a subset. Commit-graph files speed up git log by orders of magnitude. Bundle URIs (since git 2.38) let CDNs serve the bulk of a clone. Together these turn 30-minute Chromium clones into 30-second ones.
How GitHub, GitLab, and Gitea store your repos
Every host is a Git server plus a web UI, auth, and CI on top.
The shared truth: every Git host is a Git server (the protocol) plus a web UI plus auth plus CI plus issue tracking. The differences are operational.
GitHub (~100M users, 2024) runs custom servers on top of the Git protocol — their backend is called Spokes (replicated, multi-leader storage), behind a git-http-backend-style frontend. Issue tracking, Actions CI, Copilot, and Codespaces are layered on. Public engineering blog posts (2017-2024) document the move from a single-leader DGit to multi-leader Spokes for resilience.
GitLab (~30M users) runs Gitaly as the storage daemon (Go service that wraps git for a shardable, gRPC-accessed backend) plus a Rails monolith for the UI. Self-hostable as the GitLab Community Edition; many large companies run on-prem GitLab. Storage is plain bare repositories on disk.
Gitea / Forgejo are lightweight self-hosted alternatives. Gitea (Go) is <100MB resident; Forgejo is a community-driven Gitea fork after Gitea's 2022 corporate transition. Suitable for teams up to ~1000 users; both ship Issues, PRs, and Actions-compatible CI.
Bitbucket (Atlassian) is the corporate-Git story for many enterprises, deeply integrated with Jira and Confluence. The server product (Bitbucket Data Center) is what self-hosters use; the cloud product is comparable to GitHub feature-wise.
Git is a small system surfaced by a confusing CLI. Once the three objects, the DAG, and the index click, the surface area collapses. Most "Git is hard" complaints are really "the porcelain commands are inconsistent" — and they are. The plumbing underneath is simple.
Further reading on Git internals
Primary sources, in order.
- Pro GitGit Internals — Plumbing & PorcelainThe chapter that turns Git from "magic" into "data structure". Required reading once.
- Build it yourselfWrite Yourself a GitA 500-line Python implementation. Once you've written one, you've understood Git.
- Semicolony guideHash tablesContent-addressed storage is hash tables on disk. The algorithms are the same.
- Semicolony guideWrite-ahead loggingDifferent domain, same idea: an append-only log gives you a queryable history.