Decision Tree Visualizer: splits made visible.

A decision tree classifies points by asking one feature-threshold question at a time, drawing axis-aligned cuts that carve the plane into labelled regions. Drop points by class and watch the tree pick the split that minimises impurity at each node.

Depth
3
Points
20

Click in the plot to add a point · alternates between blue (0) and orange (1)
Root split
if x < 1 → left subtree (further splits up to depth 2) else → right subtree

What you're looking at

The scatter plot is a 10-by-10 plane of labelled points: blue dots are class 0, orange dots are class 1, and each click drops a new point, alternating the class. From those points the tree searches every vertical and horizontal cut, scores each by Gini impurity, and keeps the one that leaves the two halves purest. The root-split readout under the plot names that first cut as if x < t or if y < t, and the depth buttons set how many further splits the subtrees below it may add.

Drop a tight cluster of blue in one corner and orange in the opposite one, then watch the readout snap to a clean cut between them. The thing to notice is that the split is always x < t or y < t, never a diagonal. A tree can only ask about one feature at a time, so a slanted boundary has to be approximated by a staircase of axis-aligned cuts stacked through the deeper levels. Mix the two classes together instead and the cleanest available cut barely beats a coin flip, which is the moment a single shallow tree stops being enough and ensembles start to earn their keep.


What is a decision tree?

A flowchart that learns itself.

A decision tree is a learned flowchart: each internal node tests one feature against a threshold, each leaf is a prediction. Ross Quinlan's ID3 (1986) and Leo Breiman's CART (1984) are the foundational algorithms. Random forests (Breiman, 2001) and gradient-boosted trees (Friedman, 2001) ensemble them; XGBoost, LightGBM, and CatBoost dominate production tabular ML today.

Imagine you have a thousand emails labelled either “spam” or “not spam”, and you want a rule for predicting the label of new emails. The first instinct — “write down a checklist of suspicious words” — works for a while and then breaks. Different users have different signals; spammers learn to avoid the obvious words; the manual rules drift out of date. You want a rule that looks at the data and writes itself, and that you can still read after it does.

The trick a decision tree uses is brutally simple. Take all the labelled examples and ask one question: “does email contain the word winner?”. Try every possible question — every word, every numerical threshold (“more than 50 capital letters?”), every property — and pick the one that, after splitting the data into two groups, leaves the cleanest pile possible. “Cleanest” means the spam-vs-not-spam mix is most lopsided in each child. Then recurse: on each child, repeat the same trick until either the group is pure (everyone agrees) or the group is too small to keep splitting.

The output is a flowchart. To classify a new email, walk down the tree following the answers — “contains winner? yes. From a free email provider? no. Has more than three exclamation marks? yes” — and report the leaf's label. Trees are trained in seconds, score in microseconds, and produce predictions you can read out loud. That last property is why credit-scoring agencies, medical-decision-support systems, and high-stakes regulatory contexts all use trees long after fancier models existed: a trained tree is a transparent rulebook, and that's a feature regulators can audit.

The measure of “cleanest split” is the only mathematical idea you need to know. Gini impurity of a group is 1 − sum(p²) where p is the fraction of each class in the group. A pure group (all spam) has Gini 0. A 50/50 group has Gini 0.5. The split-search picks the question whose two children have the lowest weighted-average Gini. The simulator above lets you click on a 2D scatter to add labelled points and watch the tree pick splits in real time. Each split is a horizontal or vertical line that carves the input space into rectangles; deeper splits subdivide further.

The downsides come quickly. A tree with no depth limit will split until every training point sits in its own leaf — perfect on training, terrible on new data (textbook overfitting). A tree is greedy at each step, so it can miss globally better splits available later. And a tree predicts only values it has seen — it can't extrapolate beyond the convex hull of its training labels. Modern production systems wrap these flaws by training many shallow trees together as a random forest or gradient-boosted ensemble — Parts 05 and 06 cover both. The starting point is the single tree, and the starting question is “which split is cleanest?”.

GINI SPLIT INTUITION · ONE FEATURE, ONE THRESHOLDBEFORE SPLITGini = 0.506 vs 5 — almost balancedx < 5.2 ?LEFTGini = 0.00RIGHTGini = 0.00GAIN = 0.50 → both children are purePICK THE QUESTION WITH THE LARGEST GAIN, RECURSE

From Quinlan's ID3 to Breiman's CART.

The decision tree as a machine-learning algorithm has two parallel origins. Ross Quinlan's ID3 (Iterative Dichotomiser 3) appeared in his 1986 paper Induction of Decision Trees in the journal Machine Learning. ID3 was an outgrowth of Quinlan's earlier work on the chess endgame database CLS, and it formalised the recursive top-down split-by-information-gain procedure that everyone still uses today. The algorithm picks, at each node, the categorical feature whose split produces the largest reduction in entropy, and recurses on the resulting partitions. ID3 only handled categorical features and had no pruning step, but the framework was complete.

Quinlan's 1993 book C4.5: Programs for Machine Learning (Morgan Kaufmann) extended ID3 with three additions that made the algorithm production-grade: continuous features (find the best threshold by sorting and scanning), missing-value handling (proportionally distribute samples down both branches with weighted contribution), and post-pruning by error-rate reduction on a held-out set. C4.5 became the reference open-source implementation; its successor C5.0 (commercial, 1998) added boosting, multi-threading, and rule extraction. The Quinlan lineage uses information-gain ratio (info-gain divided by the split's own entropy) to discourage splits that produce many tiny pure children.

Independently, Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone published CART — Classification and Regression Trees — as a Wadsworth book in 1984, the same year Apple shipped the first Macintosh. CART differs from C4.5 in three ways. It uses Gini impurity rather than entropy (cheaper to compute, no log; in practice the splits chosen are nearly indistinguishable). It always produces binary splits, even on categorical features (it tries the 2k-1-1 partitions of a k-valued category and picks the best, with a clever sorted-mean shortcut for continuous targets). And it introduced cost-complexity pruning, also called weakest-link pruning, which trades training error against tree size via a regularisation parameter α: minimise R(T) + α|T|, sweep α, pick the size that wins on cross-validation. Cost-complexity pruning is the textbook approach for selecting a single tree, and it's the algorithm scikit-learn's DecisionTreeClassifier with ccp_alpha implements.

The CART book ran to nearly 360 pages of derivations and case studies; the algorithm is the foundation of nearly every modern tree implementation. Breiman's later work — bagging in 1996, random forests in 2001 — built directly on CART. Quinlan's C4.5 and Breiman's CART converged in practice; modern libraries blur the line, often offering both impurity criteria as a hyperparameter.

The historical lineage matters because trees pre-date most of the field. The earliest "decision tree" structures appear in Hunt, Marin & Stone's Experiments in Induction (Academic Press, 1966) and the AID program (Sonquist & Morgan, 1963) for survey analysis. Belson (1959) is sometimes cited as the first written description of automatic interaction detection by recursive partitioning. By the time Quinlan and Breiman wrote, the idea was already three decades old — what the 1980s contributions added was a rigorous information-theoretic justification for the split criterion and a principled way to size the final tree. Those two contributions made the algorithm reproducible enough to ship as software rather than as a hand-tuned procedure for each dataset.


Splitting criteria — Gini, entropy, picking the best cut

Picking the best cut.

A decision tree is built greedily, one split at a time. At each node the algorithm tries every feature and, for continuous features, every threshold between adjacent sorted values; it picks the (feature, threshold) pair that reduces impurity the most; it recurses on each resulting partition. With n samples and d features, scoring all candidate splits at one node is O(d · n log n) — the sort dominates. Building the full tree is O(d · n log² n) on average, O(d · n²) worst-case for adversarial data. With caching of pre-sorted feature columns (the trick scikit-learn calls presort) you can amortise the sort cost over the entire tree build.

Three impurity measures dominate. Gini impurity is 1 − Σ p&sub2;i; it ranges from 0 (pure node) to 1 − 1/k for k-class problems with maximum mixing. Entropy is −Σ p&sub2;i log p&sub2;i, ranging from 0 to log k. The information gain is the parent's entropy minus the weighted average of the children's entropies; this is what ID3 maximised. MSE for regression is (1/n) Σ (yi − ̅y)²; the split chosen is the one whose two children have minimum weighted sum of squared deviations from their means.

Gini and entropy disagree on fewer than 2% of splits in practice (Raileanu & Stoffel, 2004 — Theoretical Comparison Between the Gini Index and Information Gain Criteria). The choice of criterion matters far less than depth, leaf size, and pruning. CatBoost ships with both options and the developers note in their documentation that the difference is "negligible on real datasets". The relevant choice on real datasets is whether you're classifying or regressing, and how aggressively you're willing to prune.

The split-search itself has been heavily optimised. Modern boosting libraries replace the exact greedy search with histogram-based binning: each continuous feature is bucketed into 256 quantiles up front, and the per-node search considers only those 255 candidate thresholds rather than every distinct value. This drops the per-split cost from O(n) to O(b) with b=256, and the accuracy loss is typically below 0.1% on benchmarks. LightGBM, XGBoost (with tree_method=hist), CatBoost, and HistGradientBoosting in scikit-learn all use this trick. It also makes GPU acceleration practical: histogram aggregation maps cleanly onto SIMT hardware, and CUDA-accelerated training of a thousand-tree boosted model is now a matter of minutes on benchmarks that took hours on CPU a decade ago.

Criterion Formula Used by
Gini impurity1 − Σ p²CART, scikit-learn (default), XGBoost.
Entropy / Info gain−Σ p log pID3, C4.5, C5.0.
Gain ratioinfo-gain / split-infoC4.5; penalises high-cardinality categoricals.
MSE / variance(1/n) Σ (y − ̅y)²CART regression, gradient boosting trees.
SPLIT SEARCH · BEST CUT MAXIMISES GINI IMPURITY DROProot: 60 samplesGini = 0.500if x < 5.2left: 28 samplesGini = 0.122right: 32 samplesGini = 0.211GAIN = 0.500 − (28/60)·0.122 − (32/60)·0.211 = 0.330 (largest among all splits tried)

Overfitting — memorising vs generalising

Memorising vs generalising.

Left to its own devices, a decision tree keeps splitting until every training example sits in its own leaf — perfect training accuracy, awful test performance. The classic bias-variance picture in Hastie, Tibshirani & Friedman's The Elements of Statistical Learning (Springer, 2nd ed. 2009; the canonical free PDF from Stanford) plots training error monotonically falling with depth while test error first falls then rises. The minimum of that U-curve, taken on a held-out validation set or via k-fold cross-validation, is the depth you actually want.

Two regularisation strategies are in play. Pre-pruning stops the tree from growing past certain limits: max_depth, min_samples_split, min_samples_leaf, max_leaf_nodes, min_impurity_decrease. Post-pruning grows a deep tree first and trims it afterwards. Cost-complexity pruning, the canonical post-pruning algorithm from CART, sweeps the regularisation parameter α: at α=0 the tree is fully grown, at large α only the root remains, and somewhere in between is the cross-validation winner. scikit-learn exposes this via cost_complexity_pruning_path, which returns the sequence of impurity-vs-alpha pairs you can scan.

For tabular workloads the U-curve's minimum usually sits at depth 6–10 for a single tree on hundreds of thousands of rows. With more data you can afford more depth before the variance term wins; with fewer features you need less. The hyperparameters you choose for a single tree are nearly the opposite of what you choose for trees inside a forest or boosting ensemble. A random forest wants deep, decorrelated trees (depth 20+, no pruning) because averaging cancels their variance. Gradient boosting wants shallow, fast trees (depth 4–6) because each tree is a small correction.

The other knob, max_features, restricts how many features each split considers. A common choice for classification random forests is sqrt(d); for regression, d/3. The point is decorrelation: forcing each tree to ignore a different random subset makes their errors more independent, which is what averaging needs to be effective. Breiman's 2001 random-forest paper proves a generalisation-error bound in terms of the strength of individual trees and the correlation between them — the bound shrinks as either strength rises or correlation falls, and feature subsampling moves the second.

Practically, the easiest diagnostic for over- or under-fitting is the gap between training and validation accuracy plotted against depth. A flat training curve hugging a flat validation curve at low accuracy means the tree is too shallow — raise depth, lower min_samples_leaf. A training curve climbing toward 100% while the validation curve plateaus or falls is the textbook overfitting signature — cap depth, raise min_samples_split, or sweep ccp_alpha with cross-validation. The diagnostic is the same one Hastie et al draw in figure 7.1 of their book; it's a hundred years old in spirit and the right way to start any tree-tuning session.

The U-curve, in three lines

Train error falls monotonically with depth; validation error first falls then rises. The minimum of the validation curve — not the training one — is the depth you actually want. For tabular data with hundreds of thousands of rows it lands at depth 6–10 for a single tree; for trees inside a random forest, push it past 20.


Random forests and gradient boosting — many trees, one prediction

Many trees, one prediction.

A single decision tree is high-variance — small changes in data flip large parts of the structure. Ensembles fix this by training many trees and aggregating their predictions. Two strategies dominate: bagging (parallel) and boosting (sequential), each captured by a foundational paper.

Breiman's random forests (Machine Learning, 2001) train N independent trees, each on a bootstrap sample of the data and a random feature subset, and predict by majority vote (classification) or average (regression). N is typically 100–1000; the marginal accuracy gain from doubling N flattens around 500. Bootstrapping naturally leaves ~37% of samples out of any one tree (the "out-of-bag" set), which gives a free held-out estimate of generalisation error without a separate validation split. Random forests need almost no tuning, parallelise trivially across cores or machines, and remain a strong default baseline. The original Breiman implementation in Fortran, ported to R as randomForest, set the standard; scikit-learn's RandomForestClassifier is the modern Python descendant.

Friedman's gradient boosting (Greedy Function Approximation: A Gradient Boosting Machine, Annals of Statistics, 2001) takes the opposite tack: train a shallow tree (depth 4–6), compute the residuals it can't explain, train the next tree on those residuals, add. Repeat thousands of times with a small learning rate. The framework generalises to any differentiable loss, and the trees become weak learners stacked into a strong one. XGBoost (Chen & Guestrin, KDD 2016) added second-order gradient information, regularisation on tree complexity, sparsity-aware splitting, and a parallel histogram-based algorithm; it dominated Kaggle from 2014 onwards. LightGBM (Ke et al, NeurIPS 2017) introduced gradient-based one-side sampling and exclusive feature bundling for ~10× faster training; it grows trees leaf-wise rather than level-wise. CatBoost (Prokhorenkova et al, 2018) added ordered boosting to fix target leakage in categorical encoding and ships native categorical-feature handling without manual one-hot expansion.

Rule of thumb: random forest when you want a strong baseline with little tuning; gradient boosting when accuracy matters and you have time to tune learning rate, depth, regularisation, and number of rounds. Both interpret as feature-importance and partial-dependence plots; the SHAP library (Lundberg & Lee, NeurIPS 2017) gives game-theoretic Shapley-value attributions for each prediction, which is the modern explainability standard for tree ensembles. On Kaggle's 2023 State of Data Science survey, gradient-boosted trees remained the most-used non-neural model for tabular data, ahead of random forests and well ahead of linear models.

RANDOM FOREST (PARALLEL, BAGGING) vs GRADIENT BOOSTING (SEQUENTIAL, RESIDUALS)RANDOM FORESTtree 1tree 2tree 3tree Nmajority vote / meandeep, decorrelated, parallelavg variance ↓, bias unchangedGRADIENT BOOSTINGtree 1tree 2tree NF(x) = Σᵢ η · fᵢ(x)each fᵢ fits residuals of Fweighted sum of treesshallow, sequential, η smallbias ↓, variance grows with N

Beyond axis-aligned cuts — oblique trees, cubic kernels

Beyond axis-aligned cuts.

The classical CART tree splits on a single feature at a time — geometrically, it carves the input space with rectangles whose edges parallel the axes. That works when the discriminative boundaries align with features, but a 45-degree boundary in two-feature space requires a tower of axis-aligned cuts to approximate. Oblique trees split on a linear combination of features (3.2 x&sub1; + 1.7 x&sub2; < 4.1); they fit non-axis-aligned boundaries with far fewer nodes but cost more to train (the search at each node is now over a high-dimensional weight vector). Murthy, Kasif & Salzberg's OC1 (1994) was an early implementation; HHCART by Wickramarachchi et al (2016) is a modern one. Oblique trees rarely appear in production because the marginal accuracy gain is small and the interpretability cost is large — one can read "age < 35" from a tree node, but "0.7·age + 0.3·income < 47.2" reads like the linear model decision trees were chosen to avoid.

For unsupervised anomaly detection, Liu, Ting & Zhou's isolation forest (ICDM 2008) flips the algorithm: build many shallow random trees and score points by the average path length to isolation. Outliers are isolated quickly because they sit far from the bulk of the data; normal points need many splits to become a leaf alone. The whole approach is O(n log n) training, constant memory per tree, and competitive with one-class SVMs on synthetic and real benchmarks. scikit-learn's IsolationForest is the default unsupervised anomaly detector in many production fraud and intrusion-detection systems.

Other variants: extremely randomised trees (Geurts, Ernst & Wehenkel, 2006) skip the full split search and pick a random threshold for each candidate feature, then keep the best of those random candidates. Faster to train, slightly less accurate, more decorrelated — a useful base learner for forests. BART (Bayesian Additive Regression Trees, Chipman, George & McCulloch, 2010) puts a prior on small trees and samples the posterior via MCMC; it's the Bayesian counterpart to gradient boosting. Soft trees replace hard splits with sigmoid gates and become differentiable, which is how the line between trees and neural networks blurs in models like Google's TabNet.

One practitioner-favourite, quantile regression forests (Meinshausen, JMLR 2006), tracks the entire conditional distribution of the target variable rather than just the conditional mean. Predicting the 5th and 95th quantiles produces calibrated prediction intervals at zero extra training cost; the distribution comes for free from the training samples falling into each leaf. This matters for inventory forecasting (need an upper bound to size warehouses), for risk modelling (need lower-bound guarantees on capital reserves), and for any problem where downstream decisions depend on tail behaviour. Production fraud-detection systems often combine a CatBoost classifier with a quantile-forest auxiliary model to flag the long tail of rare events the classifier's point prediction blurs.


Where tabular ML actually runs — XGBoost, LightGBM, CatBoost

Where tabular ML actually runs.

Despite a decade of deep-learning hype, tree ensembles still own tabular machine learning in production. Shwartz-Ziv & Armon's 2022 paper Tabular Data: Deep Learning Is Not All You Need compared neural-network methods (TabNet, NODE, FT-Transformer) against XGBoost on eleven benchmark datasets and showed XGBoost won on most, with the few neural-network wins requiring task-specific tuning. Grinsztajn, Oyallon & Varoquaux (NeurIPS 2022) reached the same conclusion with a larger benchmark.

Real production deployments include: Stripe Radar, the credit-card fraud detection system, runs gradient-boosted trees scored at sub-millisecond latency on every transaction (Stripe engineering blog, 2017). Microsoft Bing's ranking model used LambdaMART (a boosted-tree learning-to-rank algorithm) for years before switching to neural rankers; LightGBM came out of Microsoft Research and remains a Bing-internal staple. Yandex Search built CatBoost specifically for ranking, recommendation, and classification on its production data. Uber's ETA prediction (Michelangelo platform) uses XGBoost models retrained daily on hundreds of millions of trips. Netflix's personalisation stack uses gradient-boosted trees alongside neural collaborative filtering. Lyft, DoorDash, Instacart all publish about their tree-based ranking models on their respective engineering blogs.

The deployment numbers are concrete. A typical XGBoost model with a thousand trees of depth six scores in under 100 microseconds per row on a single core; a million features per second per core is achievable with the C++ runtime. Memory footprint is roughly 100 KB per tree, so a 1000-tree model fits in a few hundred megabytes — small enough to ship with the application binary, no GPU required, no inference server needed. ONNX Runtime, Treelite (DMLC) and ONNX-MLIR all compile tree models to native code that scores at SIMD-accelerated speed; this is why fraud detection, ad ranking, and quote pricing all run on tree ensembles rather than neural networks. The latency budget is the deciding factor.

AlgorithmSplit criterionPruneEnsemble?
ID3 (1986)information gainnonesingle
C4.5 / C5.0gain ratiopost (error-based)single (C5.0 boosting)
CARTGini / MSEcost-complexity (α)single, basis for RF/GB
Random Forest (2001)Gini, on bootstrap + feature subsetnone (deep trees OK)bagging, parallel
XGBoost (2016)2nd-order grad + L1/L2shallow + min-child-weightboosting, level-wise
LightGBM (2017)histogram + GOSSmax-leaves, EFBboosting, leaf-wise
# scikit-learn DecisionTreeClassifier — production-shaped tuning
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

clf = DecisionTreeClassifier(
    criterion="gini",          # or "entropy" / "log_loss"
    max_depth=8,                # cap to control variance
    min_samples_split=20,
    min_samples_leaf=10,
    ccp_alpha=0.001,            # cost-complexity post-pruning
    class_weight="balanced",
    random_state=42,
)
# Sweep ccp_alpha via cost_complexity_pruning_path:
path = clf.cost_complexity_pruning_path(X_train, y_train)
grid = GridSearchCV(clf, &#123;"ccp_alpha": path.ccp_alphas&#125;, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_, grid.best_score_)

When NOT to use trees — the places trees don't earn their keep

The places trees don't earn their keep.

Trees and tree ensembles are the right tool when features are heterogeneous (mix of numeric, categorical, ordinal), interactions matter, training data is moderate (103 to 108 rows), and inference latency budget is tight. They are not the right tool for several adjacent problem classes.

For images, audio, and text, convolutional and transformer architectures dominate because the relevant features are spatial or sequential and must be learned from the raw signal. A decision tree applied to raw pixel values produces nonsense; it can only consume engineered features (HOG descriptors, CNN embeddings) which is just routing the heavy lifting elsewhere. ResNet, ViT, and BERT exist because the input modality calls for representation learning, not just classification on top of fixed features.

For truly linear relationships with strong sparsity, regularised linear models (Lasso, Elastic Net, L2-logistic) often beat trees. A genome-wide association study with 100,000 SNPs and 1,000 patients is a place where Lasso shines and trees thrash. The same applies to high-dimensional text classification, where logistic regression on TF-IDF features remains a benchmark that's hard to beat without a transformer.

For extrapolation, trees are catastrophically bad. A tree predicts only the value of the leaf its query lands in, so it cannot predict a value outside the convex hull of the training labels. Linear models extrapolate naturally; Gaussian processes extrapolate with quantified uncertainty. Time-series forecasting with a structural trend is a place to reach for ARIMA or a neural temporal model rather than a tree.

For online learning with concept drift, retraining a tree ensemble from scratch is expensive. Hoeffding trees (Domingos & Hulten, KDD 2000) are an online variant that splits when statistical bounds permit; they ship in MOA and Apache SAMOA but rarely match batch-trained gradient boosting on accuracy. Production teams typically batch-retrain trees daily or hourly rather than learn online. If your stream rate exceeds what hourly retraining can handle, an online linear model with stochastic gradient descent or a neural network with periodic distillation may serve better.


Further reading on decision trees

Primary sources, in order.

Found this useful?