Evaluation Chapter 1 of 4 · tap to browse
Train, Validation, and Test Splits
Three splits, three purposes — and one honest number at the end
A bank's risk team has 10,000 labelled loan applications and is choosing between logistic regression, a decision tree, and a gradient-boosted model. Each model reports 95%+ accuracy on the training data. Which model actually generalises to next month's applicants — and how would you measure that without the luxury of waiting a month?
- 1 Explain why accuracy on the training set is not a meaningful measurement of a model's ability to generalise, and why a held-out test set is required to make that measurement honestly.
- 2 Describe the distinct roles of the training set, validation set, and test set — fitting the model, tuning hyperparameters, and measuring final out-of-sample performance — and explain why each role must happen on a separate subset of the data.
- 3 Identify common data-leakage patterns — scaling before splitting, feature engineering informed by the full dataset, temporal leakage across time-ordered data — and explain why each one contaminates the test-set estimate of performance.
- 4 Choose an appropriate split ratio given dataset size and class imbalance, explaining the trade-off between keeping enough data for reliable training and keeping enough for a reliable held-out evaluation.
Three Splits, Three Jobs
The risk team at a regional bank has 10,000 labelled loan applications. Each application has a few dozen features — income, employment length, debt-to-income ratio, postcode, credit history — and a binary label: did the applicant default within twelve months. The team is evaluating three models: logistic regression, a decision tree, and a gradient-boosted classifier. Every model reports accuracy above 95% on the full 10,000-row dataset. The senior analyst asks which model to deploy.
The correct answer is that none of those numbers measure anything useful. The models were trained on the same data they are now being evaluated on. A decision tree deep enough can drive its training accuracy to exactly 100% on any dataset by treating the tree as a lookup table from inputs to memorised labels. That does not mean it will work on next month’s applicants. It means it memorised this month’s.
Why Training Accuracy Means Nothing
Training a classifier is an optimisation: the model’s parameters are adjusted until the error on the training data is as low as possible. Every weight, every split threshold, every leaf value exists because it reduced the error on examples the model has now seen many times. Reporting that error as “accuracy” is circular — the model is at a minimum of its own objective on those specific rows, and that minimum depends on the data more than on the underlying pattern.
This circularity is not a subtle theoretical issue. A decision tree with no depth limit on the bank’s 10,000 applications can simply learn a split for every single row: a tree with 10,000 leaves, one per applicant, each labelled with that applicant’s true outcome. Training accuracy: 100%. Accuracy on the next 1,000 new applicants: roughly the base rate — no better than a coin weighted by the class proportion. The tree did not learn to predict defaults. It memorised the labels.
The only honest way to measure whether a model generalises is to evaluate it on data it has never seen during training. That data has to be set aside — held out — before any fitting begins. Split first, then train. Never the reverse.
Three Splits, Three Purposes
A single held-out set is enough to measure final performance, but running a real ML project involves many decisions that each need their own held-out measurement: which features to include, which model family to try, what hyperparameters to set, how much regularisation to use. If every one of those decisions is made by looking at the same held-out set, the set loses its independence — after fifty hyperparameter tweaks evaluated on the same 1,500 rows, the model has been implicitly fitted to those 1,500 rows even though it was never trained on them. The held-out score becomes optimistic.
The standard solution is to split the data into three parts:
- Training set — the data the model actually learns from. Weights get adjusted, tree splits get chosen, parameters get fitted. The larger this set, the more patterns the model can potentially capture.
- Validation set — a workbench for comparing models and tuning hyperparameters. Look at it as often as you need while making decisions. Different feature sets, different learning rates, different regularisation strengths are all evaluated here.
- Test set — a held-out envelope, sealed until the end. Opened exactly once, after every decision about the model is finalised. The test accuracy is the honest answer to how well the model generalises.
The rule that matters most: every choice that changes the model based on a held-out measurement must happen before the test score is seen. Compare a dozen hyperparameter settings on the validation set, pick the best, retrain on training + validation, evaluate once on the test set, report that number. Going back to try something else after seeing the test score breaks the honesty of the measurement.
A useful mental model: the validation set is the pencil draft, where you work out the solution. The test set is the final-exam answer sheet — visible only after you have committed to what you believe is the right answer. Peeking at the answer sheet and then changing your working is a different activity altogether.
Data Leakage: The Silent Failure
The subtlest way to break an honest evaluation is not by explicitly training on test data — most practitioners catch that — but by letting information from the test set influence the training process indirectly. This is called data leakage, and it turns an honest 85% test accuracy into an optimistic 92% that dissolves the moment the model sees production data.
The classic case: feature scaling. A StandardScaler computes a mean and standard deviation and uses them to centre and normalise every row. If the scaler is fitted on the full 10,000-row dataset before splitting, those statistics were computed partly from the 1,500 rows now sitting in the test set. The model trains on features whose standardisation the test set helped determine. The test set never appears in the training loss, but its distribution has leaked in through the scaler.
Scaler leakage is the most visible form, but many variants exist:
- Feature engineering using the full dataset. Computing a feature like “customer’s percentile within the dataset’s income distribution” uses every applicant’s income to compute the percentile — including the test applicants. The feature values for training rows now encode information about test rows.
- Target encoding without cross-validation. Replacing a categorical variable with the mean target value per category computes those means from all rows, including the held-out ones. The training model gets to see an aggregate signal from test labels.
- Imputation using the full dataset mean. Filling missing values with the overall dataset mean lets test-row features shift the mean, which then fills missing training-row values with a statistic that already reflects the test set.
- Temporal leakage. For time-ordered data — sales by day, sensor readings, user actions — a random split mixes past and future rows. The model trains on some future rows and is then evaluated on past ones. Production-time performance, which can only use past data, will be worse than the test score suggests.
- Grouped data leakage. If the dataset has multiple rows per patient or multiple frames per video, a random split can put different rows from the same patient in both training and test. The model has now “seen” that patient during training and will predict their test rows more accurately than it could a genuinely new patient.
The defensive habit is to think of every preprocessing step as a model of its own. A scaler is a model: it has parameters (mean and standard deviation) that are fitted to data. An imputer is a model. A feature-encoder is a model. Every fitted-then-applied transformation must be fitted on the training set only and then applied — without refitting — to validation and test. In scikit-learn this is enforced by wrapping the scaler and the classifier together in a Pipeline:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# X, y: 10,000 labelled loan applications
# X shape (10_000, n_features), y shape (10_000,) with 0 = repaid, 1 = defaulted
# Step 1: split BEFORE any preprocessing
X_train, X_temp, y_train, y_temp = train_test_split(
X, y,
test_size=0.30,
random_state=42,
stratify=y, # preserves the 8% default rate in both halves
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp,
test_size=0.50,
random_state=42,
stratify=y_temp,
)
# Step 2: bundle scaler and model so the scaler learns only from training data
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(C=1.0, max_iter=1000)),
])
# Step 3: fit on training only — the scaler's mean and std are training statistics
pipe.fit(X_train, y_train)
# Step 4: evaluate on validation during development
val_acc = pipe.score(X_val, y_val)
# Step 5: after every decision is final, evaluate once on test
test_acc = pipe.score(X_test, y_test)
print(f"Val accuracy: {val_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")The pipeline does more than save typing. It makes leakage impossible at the scaler step: pipe.fit(X_train, y_train) cannot accidentally see validation or test rows, because the call was made with only X_train.
Choosing Split Ratios
Once the three-split principle is in place, the remaining question is how much data to put in each split. The right answer depends on the dataset’s absolute size, not on a fixed percentage.
| Dataset size | Recommended split | Reason |
|---|---|---|
| Millions of rows | 99.0% / 0.5% / 0.5% | Even 0.5% leaves thousands of examples per split — plenty for reliable measurements. More training data improves the model more than more test data improves the estimate. |
| Tens of thousands (like our 10,000-row bank set) | 70% / 15% / 15% or 80% / 10% / 10% | Balances enough training data for the model with enough test data (1,000–1,500 rows) for a stable estimate even at a low base rate. |
| A few thousand | 60% / 20% / 20% | Smaller training fraction wastes capacity, but each held-out set needs to be absolutely large enough. Use cross-validation on the non-test portion if the tuning process needs more measurements. |
| Under a few hundred | Use cross-validation instead | A single validation set of 30 or 40 examples gives an estimate whose standard error is comparable to the differences you are trying to measure between models. Cross-validation (next chapter) is the honest answer at small n. |
A practical constraint for classification: every split must contain enough examples of the rare class for the metric to be meaningful. At an 8% default rate, a validation set of 100 rows has roughly 8 defaulters — and a single false negative shifts the false-negative rate by 12 percentage points. When the rare class is what you actually care about predicting, plan the split around that class’s absolute count, not the total size.
Two observations from the learning curve tell you something useful about your project. First, if training and validation error converge at your actual dataset size — as they do here at n ≈ 5,000 — adding more data will not significantly improve the model. Effort is better spent on features or model architecture. Second, if there is still a large gap at your dataset size, more data is likely to help more than more model capacity; the model is overfitting because its training set is too small for what it needs to learn.
The learning curve is the single most informative chart you can produce when deciding whether to collect more data. It costs almost nothing — compute validation error at n = 100, 500, 1000, 2000, and your full training size — and it answers the question “would more data help?” with specific evidence rather than a guess. Teams that routinely produce one when starting a project spend less time debating data-collection priorities.
Why This Chapter Is the Foundation
Everything in the rest of this topic rests on the three-split rule and the leakage-free pipeline. Cross-validation, covered next, is an extension of the validation split designed for smaller datasets — it repeats the split many times rather than relying on a single one. Overfitting and underfitting, the chapter after that, are defined relative to the gap between training and held-out error — the learning curve above is a first look at that gap. Class imbalance — the chapter after — starts with the problem that split ratios alone are not enough when the rare class is what you actually need to measure.
If you take only one habit away: split the data first, fit every preprocessing step on the training portion only, and open the test envelope exactly once. The rest is mechanics.
Why can't I just evaluate my model on the training data?
A model trained on a dataset has already adjusted its parameters specifically to minimise error on those examples. Reporting its accuracy on the same data measures how well it memorised them, not how well it will perform on applicants it has never seen. A decision tree deep enough can reach 100% training accuracy on any dataset by simply storing every example's label — but that does not mean it generalises. The held-out test set gives the only honest answer to the question a stakeholder actually asks: how well will this model work when it goes live?
Why do I need a validation set if I already have a test set?
The validation set is where you compare models, tune hyperparameters, and make design decisions. Every time you look at the test set and then change something, the test set becomes a little less independent — you have now optimised toward it, even indirectly. The validation set lets you iterate freely without contaminating the test measurement. When you are finally happy with your choices, you evaluate once on the test set and report that number. No further iteration after the test score is seen.
What is data leakage and why is it so serious?
Data leakage is when information from the validation or test set sneaks into the training process. The classic case: fitting a StandardScaler on the full dataset before splitting, so the scaler's mean and standard deviation were computed partly from test examples. The model then trains on features that have been standardised using statistics the test set helped determine — the test score is inflated and no longer reflects real-world performance. Leakage is serious because it produces an overly optimistic result that stakeholders trust, and the model then underperforms in production where no leakage exists.
The bank's analyst has 10,000 labelled loan applications and is about to train a decision tree. An 8% default rate means 800 of those applicants defaulted, 9,200 repaid. The analyst is choosing between a 90/5/5 split, a 70/15/15 split, and a 50/25/25 split. Before making the split: - With a 70/15/15 split, how many of the 800 defaulters end up in the test set on average — and is that a large enough sample to measure a false-negative rate to within a couple of percentage points? - If a careless engineer fits a `StandardScaler` on the full 10,000 rows and then splits the data, is the test-set accuracy measured afterward likely to be higher or lower than the accuracy the model would actually achieve on next month's applicants? Explain in one sentence why. - If the dataset were 200 applicants instead of 10,000, what specifically would go wrong with a 70/15/15 split — and is there anything you could do that still gives an honest evaluation without a dedicated held-out set?
One Honest Evaluation
The reason the three-split rule matters is that training accuracy is not a measurement of model quality — it is a measurement of how well an optimiser succeeded at the job it was given. Every weight and every tree split exists because it reduced loss on the training examples. Reporting that loss as a performance number is circular. The only measurement with any claim to honesty is the one made against rows the model has never seen, and the only way to guarantee the model has never seen them is to set them aside before training starts. This inversion is what separates a stakeholder-ready model from a demo that impresses in development and disappoints in production.
The three splits exist because one held-out set is not enough when model development involves many decisions. A single test set used to guide feature selection, hyperparameter choice, and architecture comparison loses its independence — after a few dozen decisions informed by its score, it has been fitted to, even if never explicitly trained on. The validation set absorbs those iterative measurements so the test set remains clean. The practical discipline is simple to state and surprisingly hard to follow: look at the test set exactly once, after every decision about the model has been finalised. Going back to try a new feature or a new hyperparameter after the test score has been seen means the test score no longer measures what it claimed to.
Leakage is the most common cause of models that look strong in development and fail on arrival. The mechanism is always the same: some information about the held-out rows leaks into something the training process depends on. A scaler fitted on the full dataset carries test-row statistics into training. A percentile feature computed over all rows uses test-row values to define the percentile bins. Imputation with the full-dataset mean lets test rows shift the mean that fills training-row gaps. Temporal leakage puts future rows in training and past rows in test, breaking the very direction production will operate in. The defensive habit is to treat every transformation with fitted parameters as a mini-model, fit it only on the training portion, and apply it — without re-fitting — to validation and test. A Pipeline in scikit-learn (or equivalent abstraction elsewhere) makes this automatic at the scaler step, but it does not protect against leakage in feature engineering done by hand before the split. The only durable protection is to assume every preprocessing step is a leakage risk until proven otherwise.
Split ratios are the most-argued and least-interesting knob in this whole area. The right answer is almost never driven by the ratio; it is driven by the absolute count in each split. A validation set of 40 rows will be noisy regardless of whether that is 20% of 200 or 0.04% of 100,000. At small dataset sizes, a single split simply runs out of measurement precision — the variance in the estimate is comparable to the differences between models you are trying to distinguish. The answer at small n is not a different ratio; it is a different technique. Cross-validation, covered next, reuses the data more efficiently by repeating the split many times, producing a more reliable estimate from the same underlying sample. The three-split rule does not go away — a test set is still held out from the entire CV process — but the validation split gets replaced with something that works at scales where a single held-out measurement does not.
Training accuracy measures how well a model memorised its training data, not how well it generalises — a deep decision tree can reach 100% training accuracy by treating the tree as a lookup table, which teaches it nothing about new applicants.
The three splits have three distinct jobs: training fits the model, validation compares configurations during development, and test provides one honest out-of-sample measurement, opened exactly once after every decision is final.
Data leakage happens when information from validation or test sneaks into the training process — fitting a scaler on the full dataset, using full-dataset statistics in feature engineering, random splits of time-ordered data — and it inflates test scores in ways that dissolve the moment the model meets production data.
Split ratios should be chosen against absolute counts rather than fixed percentages: at a million rows even 0.5% is plenty for each held-out set, while at a few hundred rows a 15% test set is too small for any single measurement to be stable.
A leakage-free pipeline wraps every fitted transformation (scaler, imputer, encoder) together with the model, so fitting is automatically restricted to the training portion and held-out rows can never contribute to anything the model learns.
Check Your Understanding
Answer these questions about the bank loan split scenario covered in this chapter. Each question tests a different learning objective.
An engineer trains a decision tree on all 10,000 bank loan applications and reports 100% training accuracy. Which statement best describes what that 100% tells you about the model's ability to predict defaults on next month's applicants?
A data scientist evaluates fifteen different hyperparameter combinations on the test set, keeping track of which one scored highest, and reports that top test score as the model's performance. What is the primary problem with this workflow?
Put the following steps in the correct order for a leakage-free workflow that fits a StandardScaler and a LogisticRegression on the bank loan dataset.
- 1.Fit a StandardScaler on the training features and transform training, validation, and test features using the fitted scaler.
- 2.Split the 10,000 applications into training, validation, and test sets.
- 3.Evaluate the final model once on the test set.
- 4.Fit the LogisticRegression on the scaled training features and tune hyperparameters using validation performance.
An analyst with 180 labelled medical records is planning to use a 70/15/15 split, giving 126 training records, 27 validation records, and 27 test records. They conclude this is an acceptable setup because 70/15/15 is the standard default.