Volume VII — opening pages

The maths
underneath.

Every system in Semicolony rests on a small library of mathematical and physical primitives. Information theory bounds compression. Probability governs hashing and load. Graph theory is what routing actually is. Queueing sets your latency budget. Physics caps everything.

Information theory.

How much can a channel carry; how much does a symbol cost.

Shannon entropy

Open full page →

What

H(X) = −Σ p(x) log₂ p(x). The expected number of bits needed to encode a sample from X.

Where it lives

Compression bounds, password strength, ML loss functions, Huffman coding.

Key insight

Higher entropy = more uncertainty = more bits to encode. A fair coin: 1 bit. A biased coin (90/10): ~0.47 bits.

Mutual information

Open full page →

What

I(X;Y) = H(X) − H(X∣Y). How much knowing Y reduces uncertainty about X.

Where it lives

Decision-tree splits (information gain), feature selection, channel capacity.

Key insight

A perfectly informative split has I = H(X). A useless feature has I ≈ 0.

Channel capacity (Shannon-Hartley)

Open full page →

What

C = B · log₂(1 + S/N). Bits per second for a noisy channel of bandwidth B and signal-to-noise S/N.

Where it lives

Why a 25 MHz Wi-Fi channel can't carry 10 Gbit. The hard ceiling on every modulation scheme.

Key insight

Linear in bandwidth, logarithmic in SNR. Doubling power gives one extra bit per symbol.

KL divergence

Open full page →

What

D_KL(P‖Q) = Σ p(x) log(p(x)/q(x)). The "extra bits" cost of using Q to encode P.

Where it lives

Cross-entropy loss in ML, variational inference, t-SNE.

Key insight

Asymmetric: D_KL(P‖Q) ≠ D_KL(Q‖P). Zero only when P = Q.

Probability & randomness.

When the dice are honest — and when they aren't.

Birthday paradox

Open full page →

What

In a room of 23 people, two share a birthday with > 50% probability. Generally: collisions in n samples from k buckets become likely around √k.

Where it lives

Hash collisions, UUID collision probability, BREACH attacks, why 128-bit hashes feel "small".

Key insight

Hash output of N bits gives collision-resistance only at ~2^(N/2) operations. SHA-256 → 128 bits of resistance.

Poisson process

Open full page →

What

Events arrive at constant average rate λ; inter-arrival times are exponentially distributed.

Where it lives

Web request arrivals, network packet losses, server failures, cache eviction.

Key insight

Memoryless: P(next arrival in t∣no arrival yet for s) = P(arrival in t). Decisive for queueing math.

Concentration inequalities

Open full page →

What

Chernoff, Hoeffding, Chebyshev — bounds on how far an empirical sum strays from its expectation.

Where it lives

A/B test sample sizes, why "law of large numbers" actually converges, randomised algorithms' bounds.

Key insight

For n samples of a [0,1] variable: P(|mean − μ| > ε) ≤ 2 exp(−2nε²). The 2× factor comes from a two-sided bound.

Bayes theorem

Open full page →

What

P(A∣B) = P(B∣A) · P(A) / P(B). The rule for updating beliefs given evidence.

Where it lives

Spam filters, medical tests, A/B test priors, every Bayesian inference engine.

Key insight

Counterintuitive when P(A) is small: a 99%-accurate test on a 1-in-1000 disease still gives ~9% true-positive rate.

III

Graph theory.

Vertices, edges, and the algorithms that walk them.

Shortest paths

Open full page →

What

Dijkstra: O((V + E) log V) with a binary heap. Bellman-Ford: O(V·E), handles negative edges. A*: Dijkstra with an admissible heuristic.

Where it lives

Routing protocols (OSPF), GPS navigation, project scheduling, network distance in CDNs.

Key insight

A heuristic h(n) is admissible if h(n) ≤ true cost from n to goal. With one, A* expands O(b^d) nodes instead of all reachable.

Min-cut / Max-flow

Open full page →

What

The maximum flow from source to sink equals the minimum cut between them (Ford–Fulkerson).

Where it lives

Image segmentation, VLSI design, bipartite matching, network reliability.

Key insight

Max-flow algorithms range from O(V·E²) (Edmonds-Karp) to O(V²·√E) (Dinic) — pick by graph density.

Spanning trees

Open full page →

What

Connect all vertices with V−1 edges and minimum total weight. Kruskal (sort edges + union-find) or Prim (grow from a vertex).

Where it lives

Network broadcast (one shortest path to each node), clustering, redundancy planning.

Key insight

A graph with V vertices has V^(V−2) labelled spanning trees (Cayley's formula). Unique only for unique edge weights.

Connectivity & cycles

Open full page →

What

DFS in O(V + E) finds bridges, articulation points, strongly connected components, and topological orderings.

Where it lives

Build-system ordering, deadlock detection, code call graphs, social-network influence.

Key insight

A bridge is an edge whose removal disconnects the graph. Tarjan finds them all in one DFS.

Queueing theory.

Where load meets capacity.

Little's law

Open full page →

What

L = λW. Average number in system = arrival rate × average time in system.

Where it lives

Sizing thread pools, queue depth, request-in-flight tuning, every capacity-planning spreadsheet.

Key insight

Holds for any stable queue, regardless of arrival or service distribution. The most-used theorem in operations.

M/M/1 queue

Open full page →

What

Single server, Poisson arrivals (rate λ), exponential service (rate μ). Average wait W = 1/(μ−λ).

Where it lives

Initial latency models for any single-bottleneck system. The clean baseline before reality intrudes.

Key insight

The (μ−λ) denominator: as utilisation approaches 1, wait time → ∞. You cannot run a server at 99% utilisation.

Universal scalability law

Open full page →

What

C(N) = N / (1 + α(N−1) + βN(N−1)). Throughput as a function of concurrency, with contention α and coherence-cost β.

Where it lives

Choosing thread counts, replica counts, partition counts. Where adding more workers makes things worse.

Key insight

Above a critical N, throughput peaks and then drops. β captures coordination costs that grow as O(N²).

Tail latency

Open full page →

What

For fan-out queries, system latency is dominated by the slowest of N parallel requests. p99 of one becomes p50 of 100.

Where it lives

Microservice composition, every search index, MapReduce stragglers.

Key insight

Hedged requests (fire two, take the first) cut tail latency at modest extra cost — Dean & Barroso 2013.

Physics & hardware limits.

The bottlenecks that don't care about your code.

Speed of light

Open full page →

What

c ≈ 3 × 10⁸ m/s in vacuum, ~2 × 10⁸ m/s in fibre (refractive index ~1.5).

Where it lives

Cross-region replication latency, CDN edge placement, the floor under network round-trips.

Key insight

New York to London round-trip: ~56 ms minimum, just from light travel time. No protocol can beat this.

Memory hierarchy

Open full page →

What

L1 ≈ 1 ns, L2 ≈ 4 ns, L3 ≈ 12 ns, RAM ≈ 100 ns, NVMe ≈ 100 µs, network ≈ 1 ms, disk ≈ 10 ms.

Where it lives

Why cache locality matters more than algorithmic complexity for small n. Why DB engines fight for page-cache.

Key insight

Each level is roughly 10× slower than the last. A pointer chase across cache levels can dominate any computation.

Landauer principle

Open full page →

What

Erasing one bit of information dissipates at least kT·ln(2) joules of heat (≈ 3 × 10⁻²¹ J at room temperature).

Where it lives

Theoretical lower bound on computational energy. Why reversible computing is studied.

Key insight

Modern CPUs are ~10⁹ × above this limit. Cooling, not physics, is the immediate constraint — but the wall is real.

Amdahl's law

Open full page →

What

Speedup = 1 / (s + p/N), where s is the serial fraction. Maximum speedup is 1/s no matter how many cores.

Where it lives

Parallelisation budgets. Why "make it parallel" doesn't scale forever.

Key insight

A 5% serial fraction caps speedup at 20×. Twice as many cores can give less than twice the throughput.

Cryptographic foundations.

What makes a primitive secure — and what breaks it.

One-way functions

Open full page →

What

f(x) is easy to compute, f⁻¹(y) is computationally infeasible. The conjectural foundation of all asymmetric crypto.

Where it lives

Hash functions, password verification, every digital signature scheme.

Key insight

No one has proven they exist (P ≠ NP would imply it). The whole field operates on the assumption.

Discrete log

Open full page →

What

Given g and gˣ mod p, find x. Believed hard for primes ~2048-bit; very hard for elliptic curves at 256-bit.

Where it lives

Diffie-Hellman key exchange, ECDSA, Ed25519. The math behind TLS's key exchange.

Key insight

Elliptic curves give 128-bit security at 256-bit keys; classical DH needs ~3072-bit primes for the same strength.

Birthday attack

Open full page →

What

Finding any collision in an n-bit hash takes ~2^(n/2) operations, not 2^n.

Where it lives

Hash function security, why 128-bit MAC tags are the modern minimum.

Key insight

SHA-1 is 160 bits → 80 bits of resistance → broken in 2017. SHA-256 → 128 bits → safe past 2050.

AEAD

Open full page →

What

Authenticated Encryption with Associated Data. One operation provides confidentiality, integrity, and binding to a context.

Where it lives

TLS 1.3 cipher suites, age, signal protocol. Replaces "encrypt then MAC" composition errors.

Key insight

AES-GCM, ChaCha20-Poly1305. Never use raw block ciphers without AEAD wrapping.

VII

Computational complexity.

What is hard, and the structure of hardness.

P vs NP

Open full page →

What

P: problems solvable in polynomial time. NP: problems whose solutions can be verified in polynomial time. The conjecture P ≠ NP is the most consequential open problem in computing.

Where it lives

Underpins all of cryptography, optimisation theory, and the very notion of "this problem is hard".

Key insight

A million-dollar Clay Prize is open for either direction. Most cryptographers bet P ≠ NP, but no proof exists.

NP-completeness

Open full page →

What

A problem is NP-complete if every NP problem reduces to it in polynomial time. Cook-Levin (1971): SAT is NP-complete. Karp (1972): 21 problems are.

Where it lives

TSP, graph colouring, knapsack, vertex cover, scheduling — most "looks like" NP-hard problems are.

Key insight

A polynomial-time algorithm for any NP-complete problem would solve all of them. Reductions are how you prove your problem is hard without designing a new algorithm.

Approximation algorithms

Open full page →

What

When exact solutions are too expensive, find an answer guaranteed within a factor of optimal. PTAS: arbitrarily close. APX: bounded ratio.

Where it lives

Set cover, bin packing, MAX-SAT, scheduling. Production solvers routinely use 1.5×-optimal as "good enough".

Key insight

Some problems have no constant-factor approximation unless P = NP (max independent set). Some have a PTAS (knapsack). The classification matters.

Reductions

Open full page →

What

Show your problem is at least as hard as a known hard problem by mapping instances. The standard tool for "should I keep looking for a fast algorithm?"

Where it lives

Compiler optimisation, scheduling, planning, deadlock detection, register allocation — when you suspect hardness, reduce to 3-SAT or clique.

Key insight

A successful reduction means you can stop hunting for a polynomial algorithm. Failure to reduce doesn't prove the problem is easy — it just means you haven't proved it's hard yet.

VIII

Statistics & inference.

Sampling, signal, and the noise around it.

Central Limit Theorem

Open full page →

What

The sum of many independent random variables, regardless of their distribution, approaches a normal distribution. Convergence is O(1/√n).

Where it lives

Why averages "look normal" even when individual measurements aren't. The basis of A/B testing, confidence intervals, load-test analysis.

Key insight

Heavy-tailed distributions (Pareto, log-normal) approach normality slowly. Use medians or trimmed means for log-normal data; the CLT lies for small n.

Hypothesis testing & p-values

Open full page →

What

Set a null hypothesis, compute the probability of observing your data (or more extreme) under it. Reject if p < α (typically 0.05).

Where it lives

A/B tests, regression analysis for performance hypotheses, root-cause analysis on outage metrics.

Key insight

p-values are not the probability that the null is true. They are easily abused — selection bias, multiple comparisons, and underpowered tests give meaningless "significant" results.

Confidence intervals

Open full page →

What

A range that would contain the true parameter in 95% of repeated samples. The dual of hypothesis testing — and usually more informative.

Where it lives

Latency reports ("p99 is 45ms ± 3ms"), benchmark variance, A/B-test effect estimation.

Key insight

A 95% CI is not "95% probability the true value is here". It is "95% of intervals computed this way contain the true value". Frequentist subtlety.

Power & sample size

Open full page →

What

Power = 1 − β = probability of correctly rejecting the null when the alternative is true. Detecting a small effect requires a large sample.

Where it lives

How long does an A/B test need to run? How many requests does a load test need to bound p99?

Key insight

For a 1% effect at α = 0.05, power = 0.8 on a 10% baseline, you need ~16,000 samples per arm. Most A/B tests are stopped too early.

Linear algebra.

Vectors, matrices, and the rotations underneath ML.

Eigenvalues & eigenvectors

Open full page →

What

For matrix A, vector v with Av = λv. Captures invariant directions of a linear transformation. Power iteration finds the dominant one.

Where it lives

PageRank (the dominant eigenvector of the link matrix), spectral clustering, PCA, stability of dynamical systems.

Key insight

A matrix's eigenvalues encode its long-run behaviour: |λ| > 1 means amplification, |λ| < 1 means decay, complex λ means rotation.

Singular Value Decomposition

Open full page →

What

Any m×n matrix factors as A = UΣVᵀ where U, V are orthogonal and Σ is diagonal. Truncating Σ gives the best rank-k approximation in Frobenius norm.

Where it lives

Latent semantic analysis, recommender systems, image compression, regularised regression, every dimensionality reduction.

Key insight

The largest singular value tells you the matrix's 2-norm; truncating gives the best low-rank approximation. The math behind LSA, PCA, and modern embeddings.

Norms & inner products

Open full page →

What

L1: |x|+|y|. L2: √(x²+y²). L∞: max(|x|, |y|). The geometry you choose determines which solutions are favoured.

Where it lives

Lasso (L1) vs Ridge (L2) regression, cosine similarity in embeddings, gradient-clipping by norm in deep learning.

Key insight

L1 produces sparse solutions (many zeros); L2 produces dense, smooth ones. Choosing a norm is choosing a prior — and it shows up everywhere from optimisation to crypto.

Matrix multiplication complexity

Open full page →

What

Naive: O(n³). Strassen (1969): O(n^2.807). State of the art (Alman-Williams 2024): O(n^2.371). Conjectured exponent ω = 2.

Where it lives

Every neural-net forward pass, every linear regression, every iterative solver. The single most-tuned numerical kernel.

Key insight

GPUs and TPUs achieve their advantage by parallelising matmul. The cuBLAS GEMM kernel hits within 5% of theoretical peak FLOPS on every NVIDIA generation.

Coding theory.

Detecting and correcting errors at the wire.

Hamming codes

Open full page →

What

Add log(n) parity bits to n data bits to detect 2 errors and correct 1. The original 1950 code; first single-error-correcting scheme.

Where it lives

ECC RAM (corrects the bit flips that DRAM gets from cosmic rays), early modems, every CD player.

Key insight

A (7,4) Hamming code adds 3 parity bits to 4 data bits. The Hamming distance — bits that differ between two codewords — must be at least 3 for single-error correction.

Reed-Solomon codes

Open full page →

What

Treat data as polynomial coefficients; transmit polynomial values at extra points. Recover from any t lost values with 2t redundant points.

Where it lives

CDs, DVDs, QR codes, Blu-ray, satellite communication, RAID-6, distributed storage erasure coding.

Key insight

A (255, 223) Reed-Solomon code corrects 16 byte errors per 255-byte block. The math behind every "scratched disc still plays" scenario.

Erasure codes & LDPC

Open full page →

What

Generalise RS to large block sizes with sparse parity equations. Encode/decode in linear time using belief propagation. Approach Shannon's channel-capacity limit.

Where it lives

5G, Wi-Fi 6, deep-space probes, modern distributed storage (S3, GCS).

Key insight

LDPC codes used in 5G achieve rates within 0.5 dB of theoretical capacity. The same family powers every "this satellite link works at all" engineering.

Cyclic redundancy check (CRC)

Open full page →

What

Treat the message as a binary polynomial; compute its remainder modulo a fixed generator polynomial. Detect any single burst of N consecutive bit errors.

Where it lives

Ethernet frames (CRC-32), TCP segments (with checksums on top), USB, ZIP archives, every DNS message.

Key insight

CRC is for detection only — it cannot correct errors. CRC-32 catches almost all random bit flips and any burst of ≤ 32 consecutive bit errors.

Adjacent

From the maths, back to the papers.

Most foundations on this page have a paper or two that established them. The Papers volume catalogs them by topic — read after this one if you want to go deeper into proofs and history. Coming at this from the other direction, before the maths? The apprenticeship's machine task covers the load-bearing slice first — memory, the call stack, and what happens when a program runs.

Open Papers → Or guides

Found this useful?

The mathsunderneath.

Shannon entropy

Mutual information

Channel capacity (Shannon-Hartley)

KL divergence

Birthday paradox

Poisson process

Concentration inequalities

Bayes theorem

Shortest paths

Min-cut / Max-flow

Spanning trees

Connectivity &amp; cycles

Little's law

M/M/1 queue

Universal scalability law

Tail latency

Speed of light

Memory hierarchy

Landauer principle

Amdahl's law

One-way functions

Discrete log

Birthday attack

AEAD

P vs NP

NP-completeness

Approximation algorithms

Reductions

Central Limit Theorem

Hypothesis testing &amp; p-values

Confidence intervals

Power &amp; sample size

Eigenvalues &amp; eigenvectors

Singular Value Decomposition

Norms &amp; inner products

Matrix multiplication complexity

Hamming codes

Reed-Solomon codes

Erasure codes &amp; LDPC

Cyclic redundancy check (CRC)

From the maths, back to the papers.

The maths
underneath.

Connectivity & cycles

Hypothesis testing & p-values

Power & sample size

Eigenvalues & eigenvectors

Norms & inner products

Erasure codes & LDPC