SeeingML

Data Splits & Evaluation Foundations

Training a model and measuring its accuracy on the same data it learned from tells you nothing about whether it has learned anything generalisable. This topic covers how to split data into training, validation, and test sets, why cross-validation produces more reliable estimates, what overfitting and underfitting look like geometrically, and how to handle datasets where one class is far more common than another.

Prerequisites:
  • Understanding Data Required — Evaluation requires understanding what datasets are and what training means
intermediate
◷ 76 min total ① 4 chapters ⬡ 4 playgrounds
Chapters
01
intermediate 18 min

Train, Validation, and Test Splits

A model's accuracy on the data it learned from is a meaningless number — the model has already seen those examples and can memorise them. To measure whether a model has learned anything generalisable, the data must be split before training: a training set to fit the model, a validation set to compare hyperparameters without touching the test set, and a test set evaluated exactly once at the very end. This chapter explains the three splits, what makes data leakage between them so dangerous, and how to choose split ratios when your dataset is 10,000 rows versus 100.

5 sections Start →
02
intermediate 16 min

Cross-Validation

When a dataset is small, a single validation split produces performance estimates with high variance — the measured accuracy depends heavily on which examples happened to land in which set. Cross-validation solves this by repeating the split k times, using a different held-out fold each time, and averaging the k accuracy scores. This chapter explains how k-fold works, how to choose k, what stratification does for imbalanced data, and when leave-one-out cross-validation is appropriate.

5 sections Start →
03
intermediate 21 min

Overfitting and Underfitting

A model that is too simple misses the real structure in the data (underfit). A model that is too complex memorises the training set's noise (overfit). Both produce bad generalisation but in opposite ways. This chapter develops the diagnostic — the gap between training and validation error — and the three main remedies: more data, less capacity, and regularisation. It closes the loop with Cross-Validation from the previous chapter by answering 'what do I do once CV tells me my model is in the wrong regime?'

5 sections Start →
04
intermediate 21 min

Imbalanced Data

In imbalanced classification problems — loan defaults at 8%, disease screening at 2%, fraud detection below 0.1% — headline accuracy is a misleading default. A trivial classifier that always predicts the majority class already achieves 92%/98%/99.9% accuracy while catching zero minority cases. This chapter develops the right metrics for the imbalanced regime: the confusion matrix as the primary diagnostic, precision and recall as the two lenses on positive-class performance, the F1 score and cost-weighted alternatives, and the precision-recall and ROC curves. It then covers the three operational techniques for rebalancing: threshold adjustment, class weighting in the loss, and resampling (oversampling / SMOTE / undersampling).

5 sections Start →