How is early stopping different from regularisation?

Early stopping monitors validation error during training and stops when it starts to rise, even if training error is still falling. Conceptually it limits the 'effective' number of training iterations rather than the model's parameter-space capacity, but empirically it has a very similar effect: it prevents the model from exploiting noise-level features it would discover only with more training steps. It is essentially free (just monitor val error during training), so it is a standard default in deep learning even alongside other regularisation techniques.

How do I tell overfitting from dataset drift?

Overfitting is diagnosed within the training/validation split you have — large train-val gap, unchanging test-set distribution. Dataset drift is diagnosed across time or populations — the training data was collected from one distribution, the deployment environment has a different one. A model can appear to generalise well on a held-out test set from the same distribution as training, then fail in production because the production distribution differs. The fix for overfitting is the techniques in this chapter; the fix for drift is monitoring and periodic retraining on fresh data, not a model change.

Evaluation Chapter 3 of 4 · tap to browse

01 Train / Val / Test 02 Cross-Validation 03 Overfitting & Underfitting 04 Imbalanced Data

Overfitting and Underfitting

When a model is too simple, too complex, or just right — and how to tell which

A data scientist trains a decision tree on 5,000 bank loan applications. Training accuracy: 99.8%. Validation accuracy: 71%. The model has memorised the training set — it will ship with confidence and fail the moment it meets tomorrow's applicants. A week later the same person trains a logistic regression and reports training accuracy 74% and validation accuracy 73%. The first model is overfit; the second is underfit. The 28-point train-val gap on the tree and the flat 73% of the logistic regression are two faces of the same bias–variance trade-off, and the fix for each is completely different.

Learning Objectives

1 Define overfitting and underfitting in terms of the gap between training and validation error: overfitting shows low train error with high val error (memorisation), underfitting shows both high (too simple).
2 Read a train/validation error curve as model complexity grows and identify the 'sweet spot' — the complexity minimising validation error — as distinct from the 'best' training error, which always favours complexity.
3 Diagnose whether a given train/val gap represents overfit, underfit, or well-calibrated and choose the corresponding remedy — more data, less capacity, more regularisation — based on the specific failure mode.
4 Explain how L2 (ridge) regularisation and early stopping shrink effective model capacity, and when to use them versus reducing model size directly.

¶ Narrative

The Gap Between Training and Validation

A data scientist trains a decision tree on 5,000 bank loan applications. Training accuracy comes back at 99.8%. They are delighted — until the validation accuracy prints 71%. Something is wrong. The same scientist, a week later, trains a logistic regression on the same data. Training accuracy 74%, validation accuracy 73%. Both models are broken but in opposite ways. The tree memorised the training set — huge train accuracy, poor generalisation. The logistic regression is too simple to capture the pattern — both accuracies are low. This chapter is about recognising which failure you are looking at and what to do about each one.

Three regimes

Every trained model sits in one of three regimes, diagnosed by the gap between training and validation error.

The same 24 noisy data points fitted by polynomials of increasing degree. Left: degree 1 (a straight line) visibly misses all the curvature — underfit. Middle: degree 3 captures the shape cleanly without hugging individual noise spikes — good fit. Right: degree 12 wiggles through every point of noise — overfit. The dashed grey line is the true underlying function; the middle panel tracks it closely while the outer two fail for opposite reasons.

These three regimes correspond to three diagnostic patterns:

Underfit: training error high, validation error high (and roughly equal). The model is too simple to represent the data.
Good fit: training error low, validation error low (and roughly equal). Training and validation error tell the same story.
Overfit: training error low, validation error high. The model has fitted the noise, not the pattern.

The gap val_err − train_err is the direct diagnostic. Small or zero is good. Large is overfitting. If both numbers are large, the model is underfit and the gap may still be small — the failure is absolute, not relative.

The complexity curve — what model size actually buys you

As you increase model capacity (polynomial degree, tree depth, network width), training error falls monotonically — more parameters can always fit more of the training data, right down to zero error if capacity is unbounded. Validation error does not behave this way. It follows a characteristic U-shape: it falls as the model captures real structure, then rises as the model picks up noise-level features that do not generalise.

Training error (accent) vs validation error (amber) as polynomial degree grows from 1 to 14 on the same dataset the playground uses. Training error falls monotonically — every additional degree of capacity reduces it. Validation error is U-shaped: it drops sharply from degree 1 to roughly degree 3 as the model captures the true curvature, then rises as the model starts fitting noise. The sweet spot marked with a dashed vertical line is the degree that minimises validation error — that is the 'right' capacity for this data.

The shape of the validation curve is the bias-variance trade-off made visible.

Bias (high at low degree): the model’s structural assumption is too restrictive. It cannot represent the true pattern regardless of how much data you give it.
Variance (high at high degree): the model’s predictions change dramatically when you resample the training data, because it has enough flexibility to chase whatever noise is present.

Bias falls as capacity rises; variance rises as capacity rises. Total validation error is roughly their sum, so the minimum sits somewhere in the middle. The practical goal of modelling is not to minimise bias (overfit) or minimise variance (underfit) but to find the capacity that minimises their sum.

Three knobs for fixing overfit models

When you diagnose overfitting — training error low, validation error much higher — you have three main interventions available.

More training data. Adding data makes memorisation progressively harder. A model with fixed capacity facing a larger dataset must abstract — there are too many examples to fit individually. This is often the most effective remedy, and often the most expensive. When you cannot get more real data, sometimes augmentation (slight transformations of existing examples) or semi-supervised learning from unlabelled data can approximate it.

Less capacity. Directly reduce the model’s expressive power — fewer tree splits, shallower network, lower polynomial degree. This is cheap but requires you to know which capacity is too much. A common pattern: start with a model you suspect is too small and increase capacity until training and validation error diverge, then back off slightly.

Regularisation. Add a penalty term to the loss that encourages simpler solutions without changing the nominal model size. The most common form is L2 (ridge) regularisation, which adds λ times the sum of squared coefficients to the loss:

Loss_{L2} = fit MSE (y, \overset{y}{^}) + penalty λ i \sum c_{i}^{2}

Large coefficients now cost the optimiser more, so it picks the smallest coefficients that still fit the data well. Empirically this produces much smoother fits without removing any model parameters.

The same degree-12 polynomial fit to the same 20 noisy training points, with and without L2 regularisation. Left: no regularisation (λ = 0) — the fit wiggles wildly through every noise spike, a textbook overfit. Right: ridge regularisation with λ = 0.4 — the coefficients shrink, so the curve becomes smooth and tracks the true underlying function (dashed) closely. Same model size, completely different behaviour.

L1 regularisation (lasso) uses |cᵢ| instead of cᵢ². It tends to drive many coefficients to exactly zero, effectively removing features from the model — so it doubles as an automatic feature-selection method. For feature-sparse problems it is sometimes preferable to L2.

💡 Insight

Early stopping is a fourth knob that is effectively free. Monitor validation error during training and stop when it starts to rise, even if training error keeps falling. Conceptually it limits how long the model has to find noise-level features. Empirically it produces a similar effect to L2 regularisation, and it is the standard default in deep learning regardless of other regularisation choices.

Three knobs for fixing underfit models

Underfitting is the opposite problem and needs the opposite interventions.

More capacity. The model’s structural assumption is too restrictive. Give it more parameters to work with: higher polynomial degree, deeper tree, wider network. This is the first thing to try when both train and validation error are high.

Better features. If the available features do not contain the signal, no capacity will help. Go back to feature engineering: add interactions, nonlinear transformations, external data sources. Sometimes underfit is actually an information problem disguised as a capacity problem.

Weaker regularisation. If you inherited a model with heavy regularisation (large λ) and it is underfit, lower λ and the model can use more of its nominal capacity. Sometimes this is done by accident — a default regularisation value appropriate for a small dataset is too strong when you move to a larger one.

Diagnosis	Train err	Val err	Gap	Remedies
Underfit	high	high	small	Increase capacity, add features, reduce regularisation
Good fit	low	low	small	Ship it (ideally evaluate once on the test set first)
Overfit	low	high	large	More training data, less capacity, stronger regularisation, early stopping
Confused (rare)	high	low	inverted	Often a pipeline bug — data leakage, mis-split, scaled target. Investigate before modelling.

Cross-validation closes the diagnostic loop

The previous chapter introduced cross-validation as a way to get a reliable validation error estimate on small datasets. This chapter is what CV scores feed into: once you have a reliable estimate of validation error, you use it to pick the model capacity that minimises it. That is the standard hyperparameter-tuning workflow.

python

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

# X: 1-D training data reshaped to (n, 1); y: target values
# Sweep polynomial degree from 1 to 15 and report CV error

degrees = list(range(1, 16))
scores = []
for d in degrees:
  pipe = Pipeline([
      ('poly',  PolynomialFeatures(degree=d)),
      ('scale', StandardScaler()),
      ('model', Ridge(alpha=0.0)),     # no regularisation here
  ])
  cv = cross_val_score(pipe, X, y, cv=5, scoring='neg_mean_squared_error')
  scores.append(-cv.mean())

best_idx = int(np.argmin(scores))
print(f"Best degree: {degrees[best_idx]}, CV MSE: {scores[best_idx]:.4f}")
print(f"All scores:  {np.array(scores).round(3).tolist()}")

# Typical output on the chapter's data:
# Best degree: 3, CV MSE: 0.0512
# Scores decrease sharply from d=1 to d=3, then rise steadily

# Now sweep ridge alpha at the chosen degree — regularisation tuning
alphas = [0, 0.01, 0.1, 0.3, 1.0, 3.0, 10.0]
for alpha in alphas:
  pipe.set_params(model__alpha=alpha)
  cv = cross_val_score(pipe, X, y, cv=5, scoring='neg_mean_squared_error')
  print(f"  alpha={alpha:6.2f}  CV MSE={-cv.mean():.4f}")

The whole workflow — sweep capacity, pick the minimum, verify on test — is what the playground makes interactive. Drag the polynomial degree from 1 to 15 and you will watch the validation MSE trace out the U-shape directly; drag the regularisation slider and you will see a high-degree fit smooth out in real time. The next chapter takes us out of the regression world and into classification, where class imbalance creates its own family of overfit-adjacent failures.

In this section

Is overfitting always bad?

For any model intended to generalise to new data, yes. The mechanism is the same everywhere: the model has optimised its parameters specifically to reduce error on the training examples, so its training error is an underestimate of its true error. The larger the gap between training and validation error, the more severe the overfit. The only scenario where heavy overfitting is intentional is memorisation tasks (training set lookup, caching), which are rare in modelling contexts.

Why does more data help with overfitting?

An overfit model has memorised specific training examples because there were few enough of them that memorisation was cheaper than learning the real pattern. Adding more training data makes memorisation progressively harder — at some point the model must abstract to capture the pattern, because the cost of storing every example exceeds the cost of generalising. This is why sample complexity (amount of data) and effective capacity (model size + regularisation) are tightly coupled: cutting one can often substitute for increasing the other.

What does regularisation actually do mathematically?

L2 regularisation (ridge) adds a penalty of λ times the sum of squared coefficients to the loss function. This encourages the optimiser to pick coefficients that are small unless the data strongly demands otherwise — in practice, small coefficients produce smoother functions that are less capable of fitting noise. L1 regularisation (lasso) penalises the sum of absolute coefficients, which has the useful side effect of driving many coefficients to exactly zero — an automatic feature-selection mechanism. Both shrink effective capacity without reducing the nominal model size.

◎ Intuition

Your colleague trains two models to predict customer lifetime value from 12 features. Model A reports: - Training R² = 0.97 - 5-fold CV R² = 0.38 Model B reports: - Training R² = 0.61 - 5-fold CV R² = 0.58 Before reading the reflection: - Which model is overfitting, which is underfitting, and what diagnostic number in each report told you so? - For Model A, name three interventions you would try in order. For each, explain in one sentence why it should help. - Model B has similar training and validation R², which sounds reassuring — but both numbers are mediocre. What would you try next, and why is "collect more data" probably the wrong answer here?

↺ Reflection

Three Knobs for Model Fit

The two broken models that opened this chapter — the tree with 99.8%/71% accuracies and the logistic regression with 74%/73% — are both failures of the same underlying principle. A model’s usefulness is measured by how well it generalises to unseen data, not by how well it fits the training data; the gap between training and validation error is the direct diagnostic of how far the two have drifted apart. When that gap is large, the model has exploited specifics of the training data that will not transfer. When both errors are high with a small gap, the model is too simple to represent the pattern regardless of what data you give it. The two failure modes look almost opposite on paper but they are symmetric — one overshoots the data, the other undershoots it.

The bias-variance trade-off is the way this symmetry is usually stated. Bias is how far a model’s average prediction sits from the true underlying function — a systematic error that restricting the model’s capacity (a straight line through curved data) imposes. Variance is how much a model’s predictions change when you resample the training data — a sensitivity to noise that giving the model more capacity amplifies. Total expected error decomposes into bias squared plus variance plus irreducible noise, so the minimum sits at a capacity where bias and variance trade evenly. The validation-error curve’s U-shape is this decomposition made visible: the left side is bias-dominated (capacity too low), the right side is variance-dominated (capacity too high), the bottom is the point where neither dominates. Cross-validation, from the previous chapter, is the method for getting a reliable estimate of that curve; this chapter is what you do with it — pick the capacity at the minimum, then verify once on the test set.

When you diagnose overfitting you have three main interventions, and in practice they often complement rather than substitute for each other. More training data is usually the strongest but most expensive; it forces the model to abstract instead of memorising because memorising 50,000 points is harder than memorising 500. Reducing capacity — smaller model, lower polynomial degree, shallower tree — is cheap but requires you to know how much is too much, which is often only visible through the validation-curve sweep. Regularisation adds a penalty to the loss that encourages simpler solutions without reducing nominal model size; ridge (L2) shrinks all coefficients toward zero, lasso (L1) drives many to exactly zero and so doubles as feature selection, and early stopping monitors validation error during training and stops when it turns up. In deep learning early stopping is effectively free and used alongside explicit regularisation; in classical ML it is usually an explicit choice of ridge vs lasso vs both.

Underfitting is the opposite problem with opposite remedies. More capacity — higher polynomial degree, deeper tree, wider network. Better features — more interactions, more external data, nonlinear transformations of existing features. Weaker regularisation if you inherited a heavily regularised model. The trickiest underfit diagnoses are the ones disguised as information problems: the model is structurally fine but the available features simply do not contain the signal, so no amount of capacity will help. That looks identical to genuine underfit on the validation curve but the fix is different — collecting richer features rather than growing the model. The final chapter of this topic shifts to a different failure mode entirely: what happens when the data is heavily imbalanced and 99% accuracy is easy for any model that just predicts the majority class. That is not overfitting or underfitting — it is a failure of the metric to see the thing you actually care about predicting.

Key Points

Overfitting shows as a large gap between low training error and high validation error — the model has memorised the training set rather than learning the pattern, so it will fail on any example it has not seen.

Underfitting shows as both training and validation error being high (with a small gap between them) — the model is structurally too simple to capture the data's pattern regardless of how much data you give it.

The U-shaped validation-error curve as model complexity grows is the bias-variance trade-off made visible: bias (structural restriction) falls with capacity, variance (sensitivity to noise) rises, and the optimal complexity minimises the sum.

Three knobs fix overfitting — more training data (raises the bar on memorisation), less capacity (direct reduction), and regularisation (penalises large coefficients without reducing model size); L2 ridge is the default, L1 lasso also does feature selection, early stopping is essentially free in iterative training.

Cross-validation from the previous chapter produces the validation error estimate; this chapter turns that estimate into a decision — which capacity minimises it, and how to adjust if the minimum is still unsatisfyingly high (underfit) or the gap is still large (overfit).

✓ Checkpoint

Check Your Understanding

Answer these questions about overfitting and underfitting diagnoses covered in this chapter.

A model reports training MSE = 0.02 and validation MSE = 0.38. Which conclusion is most accurate?

You sweep polynomial degree from 1 to 15 on a fixed training set and find: training MSE falls monotonically from 0.15 at d=1 to 0.003 at d=15; validation MSE falls from 0.15 at d=1 to 0.05 at d=3, then rises to 0.4 at d=15. What degree should you use, and why?

A team has overfitting on their current model (train acc 99%, val acc 71%). Put these remedies in order from the cheapest to try (lowest investment) to the most expensive (highest investment).

1.Collect 10,000 more labelled training examples
2.Add L2 regularisation to the existing model and tune λ via cross-validation
3.Reduce the model's capacity (fewer tree splits, shallower network, lower polynomial degree)
4.Build an entirely new model family with different structural assumptions

A team's degree-12 polynomial fit is overfitting. Applying L2 regularisation with λ = 0.5 effectively reduces the model to 'behave like' a lower-degree polynomial without changing its nominal degree, so they can keep the flexibility of degree 12 but control the wiggliness with λ.