Can I use CV to compare two models and pick the better one?

Yes — comparing CV scores across models is the correct way to select a model when you do not have a separate validation set. Each model's CV score estimates its out-of-sample performance without touching the test set. Once you have selected the best model using CV, retrain it on all non-test data and evaluate it once on the test set. Do not go back and try more models after seeing the test score.

What is the difference between CV score and test accuracy?

The CV score estimates performance during model selection — it guides decisions about hyperparameters and feature choices. The test accuracy is the final honest measurement after all decisions are made. They are often close but not identical: CV is computed on the training portion of the data (multiple folds), while the test set is an independent held-out set. If the test accuracy is substantially lower than the CV score, there may be overfitting to the CV process itself — for example, from tuning too many hyperparameters using CV scores as the feedback signal.

Evaluation Chapter 2 of 4 · tap to browse

01 Train / Val / Test 02 Cross-Validation 03 Overfitting & Underfitting 04 Imbalanced Data

Cross-Validation

Every example is used for validation exactly once

A regional bank branch has 800 labelled loan applications. A single 70/15/15 split gives 91% accuracy. A different random seed gives 86%. Same model. Same data. Which number goes to the credit risk committee?

Learning Objectives

1 Explain how k-fold cross-validation works and why it produces more reliable estimates than a single validation split
2 Choose an appropriate k for a given dataset size and explain the bias-variance trade-off in that choice
3 Explain what stratified k-fold does and when it is necessary
4 State when to use leave-one-out CV and what its computational cost is relative to k-fold

¶ Narrative

Repeating the Split

A regional bank branch has 800 labelled loan applications — smaller than the 10,000-applicant dataset from the previous chapter. The analyst tries a 70/15/15 split: 560 for training, 120 for validation, 120 for test. At an 8% default rate, the 120-applicant validation set contains roughly 10 defaulters. The false-negative rate — the metric the credit risk committee cares about — is being estimated from 10 examples.

The analyst runs the split twice with different random seeds. The first gives 91% accuracy. The second gives 86%. Same model. Same 800 applicants. A 5-percentage-point swing from a different random seed.

Which number goes to the committee?

The Problem With One Split

With 10,000 applicants, a 15% validation set gives 1,500 applicants — enough for reliable estimates. With 800 applicants, 15% is 120 — and at an 8% default rate, that yields roughly 10 defaulters. One misclassification shifts the false-negative rate by 10 percentage points.

The split problem is not just about defaulters. The entire 120-applicant validation set might, by chance, be drawn from a period when the bank’s lending standards were stricter — making those applicants systematically easier or harder to classify than the training applicants. A different random seed draws a different period. The measured accuracy reflects the luck of the draw, not the quality of the model.

Run the same 800 applicants through the same logistic regression with seed 42: you get 91% validation accuracy. Swap to seed 137: you get 86%. The model has not changed. The training data has not changed. Only which 120 applicants happened to end up in the held-out set changed. That 5-point swing is sampling noise masquerading as model performance. Reporting either number in isolation is misleading.

How k-Fold Works

Instead of one fixed split, divide the 800 applications into k equal folds. Train the model on k−1 folds, validate on the remaining fold, then repeat k times — each time holding out a different fold. Every application is used for validation exactly once. Average the k accuracy scores.

Five-fold cross-validation on the 800-application dataset. Each row represents one training–validation iteration. The highlighted fold rotates across all five positions. Every application appears in exactly one validation fold. The test set, shown below with a dashed border, is held out from the entire CV process.

A worked example: 5-fold cross-validation on the 800-application dataset.

Fold held out	Training examples	Validation examples	Val accuracy
Fold 1	640	160 (fold 1)	88.1%
Fold 2	640	160 (fold 2)	86.9%
Fold 3	640	160 (fold 3)	89.4%
Fold 4	640	160 (fold 4)	87.5%
Fold 5	640	160 (fold 5)	88.8%
Mean	—	—	88.1% ± 0.9%

The ± 0.9% standard deviation across folds measures estimate reliability. A single split gave either 91% or 86% depending on which 120 applicants ended up held out. Five-fold CV gives 88.1% ± 0.9% — a tighter, more trustworthy number derived from five independent evaluations of the same model on different held-out subsets.

The test set is still held out and untouched. Cross-validation replaces the validation split — not the test split. The correct pipeline:

Split off the test set first — 120 applicants, never touched until final evaluation.
Use CV on the remaining 680 applicants to tune and evaluate the model.
Evaluate the final model on the test set once, after all CV-based decisions are final.

The CV score estimates how well the model generalises. The test score measures it honestly, once.

Choosing k and Stratification

The choice of k involves a trade-off between estimate reliability and computational cost.

k	Training size per fold	Validation size	Estimate variance	Compute cost
3	533 (67%)	267	High — few folds to average	Low
5	640 (80%)	160	Moderate	Moderate
10	720 (90%)	80	Low — many folds to average	High
n=800 (LOO)	799 (99.9%)	1	Very low	Very high

For the bank dataset at 800 applications, k=5 or k=10 are both reasonable. k=5 is faster — five model training runs instead of ten. k=10 gives slightly lower estimate variance because ten fold scores are averaged instead of five. The practical rule: k=5 for datasets under 5,000 examples, k=10 for larger datasets.

Stratified k-fold matters here because 8% of applicants default — 64 defaulters out of 800. A regular 5-fold split might assign 10 defaulters to one fold and 16 to another by chance. With 10 defaulters in a validation fold, the false-negative rate estimate from that fold is almost meaningless. Stratified k-fold preserves the 8% default rate in every fold, guaranteeing roughly 13 defaulters per fold.

Random versus stratified fold assignment for the 64 defaulters in the 800-application dataset. Random assignment distributes defaulters unevenly — one fold gets 16 while another gets 9. Stratified assignment guarantees 8% defaulters in every fold.

For any imbalanced classification problem, stratified k-fold is the default — not an optional refinement. Using regular k-fold with an 8% minority class introduces fold-to-fold variance in minority class counts that has nothing to do with model quality. Stratification eliminates that source of noise.

Leave-One-Out and When to Use It

Leave-one-out (LOO) is k-fold where k equals n. Each of the 800 applications is held out once. The model trains on the remaining 799 each time, for a total of 800 training runs.

The advantage: each training set contains 799 of 800 available examples — the model sees nearly all the data in every fold. The estimate variance approaches zero because 800 fold scores are averaged. The disadvantage is cost: 800 training runs for a dataset of 800 examples. For most models this is computationally prohibitive.

Estimate variance and relative compute cost as k increases from 2 to n=800. Variance falls as more folds are averaged. Compute cost rises linearly with k. The k=5 and k=10 reference lines mark the practical range where estimate reliability and cost are both acceptable.

LOO is appropriate when:

The dataset has fewer than 50–100 examples, making any ordinary fold too small to evaluate performance.
The model trains very fast — linear regression, small decision trees — so hundreds of training runs are affordable.
The context demands minimum bias in the estimate — medical trials, rare event datasets, scientific reproducibility.

For the bank’s 800-application dataset with logistic regression, LOO requires 800 training runs. k=10 requires ten. The estimate quality improvement from k=10 to LOO is typically less than 0.5 percentage points for most models. The compute cost multiplies by 80. Use LOO when training is cheap and the dataset is very small.

python

import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# All 800 applications: X (features), y (default label, binary)

# Step 1: Separate the test set before any CV — never touched until the end
X_trainval, X_test, y_trainval, y_test = train_test_split(
  X, y,
  test_size=0.15,
  random_state=42,
  stratify=y           # preserves 8% default rate in both halves
)

# Step 2: Build a pipeline so the scaler is fitted inside each fold
# This prevents preprocessing leakage across folds
pipe = Pipeline([
  ('scaler', StandardScaler()),
  ('model', LogisticRegression(C=1.0, max_iter=1000))
])

# Step 3: Stratified 5-fold CV on the 680 non-test applicants
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_scores = cross_val_score(
  pipe, X_trainval, y_trainval,
  cv=skf,
  scoring='accuracy'
)

print(f"CV scores:  {cv_scores.round(3)}")
print(f"Mean ± std: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# cross_val_score does NOT touch X_test or y_test — the test set is safe

# Step 4: Retrain on all non-test data, then evaluate once on the test set
pipe.fit(X_trainval, y_trainval)
test_accuracy = pipe.score(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.3f}")

The pipeline wraps the scaler and the model together. cross_val_score refits the entire pipeline in each fold — scaler included — so the fold’s validation applicants are never incorporated into the scaler’s stored statistics. This prevents a subtle form of preprocessing leakage inside the CV loop itself.

In this section

Does cross-validation replace the test set?

No. Cross-validation replaces the validation split — not the test split. The test set must still exist, still be held out from the entire CV process, and still be used only once at the very end. CV tells you how well the model performs given the current training data and hyperparameters. The test set gives the honest out-of-sample estimate after all decisions are final.

Why does k=5 or k=10 get recommended so often?

Empirically, 5-fold and 10-fold CV produce estimates close to the true out-of-sample error across a wide range of model types and dataset sizes. Going higher than k=10 gives diminishing returns in estimate quality while training cost grows linearly. Going below k=5 introduces substantial variance in the estimate. The sweet spot is k=5 for datasets under 5,000 examples and k=10 for larger.

When should I use stratified k-fold instead of regular k-fold?

Any time your target variable is imbalanced — meaning the minority class makes up significantly less than 50% of the data — use stratified k-fold. Without stratification, some folds might contain very few minority class examples by chance, making the performance estimate on those folds unreliable. At an 8% default rate, an unstratified 5-fold split might put only 6 defaulters in one fold and 16 in another. Stratified k-fold guarantees the same 8% ratio in every fold.

◎ Intuition

The bank's regional branch has 800 labelled loan applications. The analyst tries a 70/15/15 split and gets 91% accuracy. A colleague runs the same split with a different random seed and gets 86%. The analyst is about to apply 5-fold cross-validation to the 680 non-test applications. Before seeing the results: - With 5-fold CV, each fold holds out 160 of the 680 applications for validation. At an 8% default rate, roughly how many defaulters appear in each validation fold — and is that more or less reliable than the 10 defaulters in the original 120-applicant validation set? - As k increases from 5 to 10, each training set grows larger but more models need to be trained. Do you expect the CV error estimate to become more reliable, less reliable, or stay roughly the same? - The decision tree model shows much higher CV error at k=2 than at k=10. What might explain why smaller training sets hurt a decision tree more than a logistic regression?

↺ Reflection

The Reliable Estimate

The reason cross-validation exists is that with 800 examples and a 15% validation fraction, the performance estimate depends on which 120 applicants happened to land in the held-out set. Two different random seeds give 91% and 86% on the same model. That 5-point swing is not information about the model — it is noise about the split. Cross-validation eliminates this noise by repeating the split k times and averaging the results, giving each of the 800 applicants exactly one turn as a validation example. The mean of five fold scores is more stable than any single fold score because averaging reduces the influence of any one group of applicants on the final estimate.

The k choice does not control how good the model is — it controls how trustworthy the estimate of model quality is. This distinction matters. Low k means a small number of large folds: the validation sets are large, but only a few estimates are averaged. High k means a large number of small folds: each fold’s validation set is small, but many estimates are averaged. The variance in the estimate falls as k increases — not because the model improves, but because more repeated sampling of the same dataset smooths out the sampling noise. k=5 or k=10 occupies the practical sweet spot: variance is low enough for reliable model comparisons, and compute cost remains linear in the number of training examples rather than in n times the number of examples.

Stratification is a default habit, not an edge-case technique. At 8% defaulters, a regular 5-fold split might assign only 8 defaulters to one fold and 18 to another. A fold with 8 defaulters cannot reliably estimate the false-negative rate the bank cares about. Stratified k-fold guarantees the same 8% ratio in every fold regardless of random seed, making each fold’s estimate based on the same minority-class representation. The scikit-learn parameter is one word: stratify=y in StratifiedKFold. The cost of not using it is fold-to-fold variance in minority class composition that looks like model variance — a tuning signal built on noise.

Cross-validation gives a reliable estimate of how well the current model and hyperparameter choices perform. It does not tell you whether the model could be better. A CV error that stays near 13% regardless of how k changes or which feature subset is used is a signal that the model may be too simple to capture all the predictable structure in the data — or it may have captured everything the features encode. Distinguishing those two situations requires examining how error changes as the model complexity increases, which is the question the next chapter addresses through the lens of overfitting and underfitting.

Key Points

k-fold cross-validation divides data into k folds, trains on k-1 and validates on 1 repeatedly, ensuring every example is used for validation exactly once and averaging k performance estimates for a more reliable result.

The choice of k controls a bias-variance trade-off in the performance estimate itself: low k produces high-variance estimates; high k produces low-variance estimates at greater computational cost; k=5 or k=10 is the standard choice.

Stratified k-fold preserves the class ratio in every fold and should be the default for any imbalanced classification problem — with 8% defaulters, an unstratified split can produce folds with as few as 4 defaulters, making per-fold estimates meaningless.

Cross-validation replaces the validation split in the modelling pipeline but does not replace the test split — the test set must remain held out from the entire CV process and used only once at the end.

Leave-one-out CV minimises estimate bias but requires training n separate models, making it impractical for datasets larger than a few hundred examples with non-trivial models.

✓ Checkpoint

Check Your Understanding

Answer these questions about the bank loan cross-validation scenario covered in this chapter. Each question tests a different learning objective.

An analyst runs the same logistic regression on 800 loan applications with two different random seeds for a 70/15/15 split. Seed 42 gives 91% validation accuracy; seed 137 gives 86%. What is the most accurate description of this 5-point difference?

For the 800-application bank dataset with an 8% default rate, which combination of cross-validation settings is most appropriate?

Put the following steps in the correct order for a cross-validation pipeline that maintains an honest test evaluation.

1.Run stratified 5-fold CV on the non-test data to estimate model performance
2.Separate the test set (held out throughout)
3.Retrain the final model on all non-test data and evaluate once on the test set
4.Use CV scores to select the best hyperparameters

An analyst with 60 labelled medical records and a logistic regression model should prefer leave-one-out CV over 5-fold CV because the dataset is very small and LOO minimises the amount of training data excluded per fold.