Why must the scaler be fitted on the training set only?

Fitting a scaler on the full dataset before splitting uses information from the test set to compute scaling parameters (mean, standard deviation, min, max). When those parameters are then used to transform the training data, the model has indirectly seen test-set statistics during training — a form of data leakage. The correct procedure is to fit the scaler on the training set only, then use those same parameters to transform both the training and test sets.

Which scaling method handles outliers better?

Z-score standardisation is more robust to outliers than min-max normalisation. With min-max, a single extreme value sets the minimum or maximum, compressing all other values into a tiny range. With z-score, outliers shift the mean and inflate the standard deviation slightly but do not compress the majority of values. For heavily skewed data, robust scaling (using median and interquartile range instead of mean and standard deviation) is even more resistant to outliers.

Features Chapter 3 of 6 · tap to browse

01 Feature Vectors 02 Categorical Encoding 03 Normalisation 04 Distance Metrics 05 Engineering Features 06 Feature Selection

Normalisation and Scaling

When the units of measurement silently corrupt your distance calculations

A Kaggle-winning fraud detection model initially ranked transaction amount (0–50,000 dollars) as the most important feature — until the team scaled their features and discovered that transaction hour (0–23) was equally predictive once the scale imbalance was removed.

Learning Objectives

1 Name two scaling methods (z-score standardisation and min-max normalisation) and describe their output ranges.
2 Explain why features with large raw ranges dominate Euclidean distance calculations in unscaled feature vectors.
3 Identify which features in a described dataset require scaling and apply the correct method for the algorithm and data distribution.
4 Diagnose data leakage in a described preprocessing pipeline where a scaler is fitted on the full dataset before train-test splitting.

¶ Narrative

The Scale Problem

The music recommendation engine’s feature vector combines six measurements: tempo in beats per minute (60–180), energy as a fraction (0–1), danceability as a fraction (0–1), acousticness as a fraction (0–1), loudness in decibels (−40 to 0), and duration in seconds (90–600). Feeding these six numbers into KNN seems straightforward. But KNN computes the Euclidean distance between feature vectors to decide which songs are similar — and the raw numerical magnitudes determine exactly how much each feature contributes to that distance.

Raw feature ranges in the music dataset. Duration (510 s range) and tempo (120 BPM range) dwarf the unit-scale features. Before scaling, a song's duration dominates every KNN distance calculation — not because duration is more informative, but because its numbers are larger.

A jazz song with tempo 115 BPM and energy 0.32 compared to a pop song with tempo 118 BPM and energy 0.78: the tempo difference is 3, the energy difference is 0.46. The Euclidean distance is √(3² + 0.46²) ≈ 3.04. Tempo contributes 3 units; energy contributes 0.46 units. Tempo dominates by a factor of 6.5 — not because tempo is more musically relevant than energy, but because beats per minute are larger numbers than fractions. Add duration to the vector and the problem is far worse: a 30-second duration difference contributes 30 units while a full swing in energy contributes only 1 unit.

Z-score standardisation

Z-score standardisation transforms each feature by subtracting its mean and dividing by its standard deviation:

z = \frac{x - μ}{σ}

The result has mean 0 and standard deviation 1. A song with tempo 115 BPM, in a dataset where tempo averages 117 BPM with standard deviation 15, receives a tempo z-score of (115 − 117) / 15 ≈ −0.13. A song with energy 0.32, where energy averages 0.55 with standard deviation 0.23, receives an energy z-score of (0.32 − 0.55) / 0.23 ≈ −1.0. Now both features are on the same scale — expressed in standard deviations from their mean.

python

from sklearn.preprocessing import StandardScaler

# Fit on training data only — never on the full dataset
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # transform only, never fit_transform

Z-score standardisation does not bound the output — z-scores outside [−3, +3] are possible for outliers. It also assumes the feature has a roughly bell-shaped distribution. For skewed features with extreme outliers, a single outlier inflates the standard deviation and compresses all other values.

Min-max normalisation

Min-max normalisation maps each feature to a fixed range, typically [0, 1]:

x^{'} = \frac{x - x _{m i n}}{x _{m a x} - x _{m i n}}

A tempo of 115 BPM, in a dataset where tempo ranges from 62 to 195 BPM, maps to (115 − 62) / (195 − 62) ≈ 0.40. The output is always between 0 and 1 when all values fall within the observed training range. Values outside that range produce values outside [0, 1], which can cause problems for algorithms that assume bounded inputs.

python

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

Min-max normalisation is sensitive to outliers. A single song with an extreme tempo sets the maximum, compressing all other tempo values into a narrow range. When outliers are present, z-score standardisation or robust scaling (based on median and interquartile range) is more appropriate.

Comparing the methods

Method	Output range	Bounded output	Outlier sensitivity	Typical use
No scaling (raw)	Original units	No	N/A	Tree-based algorithms only
Z-score (StandardScaler)	≈ −3 to +3	No	Moderate	Default for most distance-based algorithms
Min-max (MinMaxScaler)	[0, 1]	Yes	High	When bounded range is required (e.g. neural network inputs)
Robust (RobustScaler)	Centred on median	No	Low	Datasets with significant outliers

Common Mistake

Fitting the scaler on the full dataset before splitting into training and test sets is a common and consequential mistake. The scaler computes statistics (mean, standard deviation, minimum, maximum) from whichever data it sees. If it sees the full dataset — including the test set — those statistics encode information about the test data. When the model is then trained on the scaled training data, it has indirectly used test-set information during training. This is data leakage. The model’s performance on the test set will look better than it actually is, and the model will underperform in production. Always split first. Then fit the scaler on the training set only. Apply the same fitted scaler (with training-set statistics) to transform both sets.

Real World

The standard pattern in winning Kaggle competition notebooks is: split → fit scaler on train → transform train and test separately. Teams that instead apply scaler.fit_transform(X) on the full dataset before splitting and then wonder why their cross-validation score is optimistic are experiencing exactly this leakage. Scikit-learn’s Pipeline class enforces the correct order automatically: inside a pipeline, the scaler is fitted only on the training fold of each cross-validation split, never on the validation fold.

The scale problem has the same structure as the encoding problem in the previous chapter. Integer encoding of unordered categories introduced false ordering — numbers that implied relationships that did not exist in reality. Unscaled features introduce false magnitude — numbers that dominate distance calculations for reasons unrelated to the feature’s actual information content. Both errors corrupt the geometry of feature space. Both are silent: the model trains without complaint and achieves reasonable accuracy, making the bug invisible until the representation is examined directly.

In this section

What is the difference between normalisation and standardisation?

Normalisation (min-max scaling) maps feature values to a fixed range, typically [0, 1], using the observed minimum and maximum. Standardisation (z-score) subtracts the mean and divides by the standard deviation, producing values with approximately zero mean and unit variance. Normalisation is bounded; standardisation is not. Both are forms of feature scaling and the terms are often used loosely and interchangeably.

Does scaling change the information in the data?

No. Scaling is a monotonic linear transformation — it shifts and stretches values but preserves all ordering and relative distances within each feature. A song with higher tempo than another before scaling has higher tempo after scaling. No information is created or destroyed. Only the numerical magnitudes change.

Do tree-based algorithms need feature scaling?

No. Decision trees, random forests, and gradient boosted trees split on one feature at a time using threshold comparisons — they never compute distances between feature vectors. The relative ordering of values within a feature is all that matters for splits, and scaling preserves ordering. Tree-based algorithms are completely insensitive to feature scaling.

◎ Intuition

The playground is about to show you the same 120 songs plotted in 2D feature space under three conditions: raw values, z-score standardisation, and min-max normalisation. Before you switch — when tempo is on the x-axis (range 60–180 BPM) and energy is on the y-axis (range 0–1), which axis do you think will dominate the vertical spread of songs, and what do you expect to happen to that spread when both features are scaled to the same numerical range?

↺ Reflection

Key Ideas

Feature vectors combine measurements from different physical quantities: tempo in beats per minute, energy as a dimensionless fraction, duration in seconds. When KNN computes Euclidean distance between two feature vectors, the distance is the square root of the sum of squared differences across every feature. A feature with a raw range of 120 BPM contributes up to 120 units to the squared sum; a feature with a range of 1 unit contributes at most 1. The algorithm treats the larger-range feature as 14,400 times more important in squared distance — not because it contains more information, but because its units produce larger numbers. This is the false-magnitude problem.

Z-score standardisation resolves false magnitude by expressing every feature in standard deviations from its mean. After transformation, a feature’s contribution to Euclidean distance reflects how unusual a song’s value is relative to the typical spread of that feature in the training data. A tempo two standard deviations above average contributes as much as an energy level two standard deviations above average — even though the raw numbers were 140 BPM versus 0.96 respectively. No physical unit is privileged over any other.

Min-max normalisation resolves the same problem differently: by mapping the observed minimum and maximum of each feature to 0 and 1, all features share the same [0, 1] range. The advantage over z-score is a bounded, interpretable output. The disadvantage is sensitivity to outliers: a single song with an unusually high tempo sets the maximum and compresses all other songs’ tempo values toward zero. In a dataset with meaningful outliers, this compression can hurt more than the original scale imbalance.

The training-set-only rule for scaler fitting is not a technicality — it is the same data leakage principle that governs every step of the ML pipeline. The test set simulates data the model has never seen. If the scaler uses test-set statistics (mean, standard deviation, minimum, maximum) to transform the training data, the model has indirectly observed the test set during training. Its performance estimate on the test set will be optimistic. The correct procedure: split the data, fit the scaler on the training fold only, apply the same scaler parameters to both training and test folds. Scikit-learn’s Pipeline class enforces this automatically within cross-validation loops.

Scaling is essential for distance-based algorithms: KNN, SVM with RBF kernel, PCA, neural networks, and linear models with L2 regularisation all compute something proportional to Euclidean distance or rely on feature magnitudes being comparable. Decision trees, random forests, and gradient boosted trees are completely unaffected by scaling — they split on threshold comparisons within a single feature and never compute inter-feature distances. This insensitivity is part of why tree-based methods dominate tabular benchmarks: they are robust to both encoding errors and scale errors that corrupt distance-based methods. The next chapter examines a third class of representation decision: how to measure similarity between items in a feature space, and why the choice of distance metric matters as much as the choice of features.

Key Points

Euclidean distance is sensitive to numerical magnitude: a feature with range 500 contributes 500× more than a feature with range 1, regardless of how informative either feature actually is.

Z-score standardisation (subtract mean, divide by standard deviation) produces approximately zero mean and unit variance. Min-max normalisation maps to [0, 1] using observed minimum and maximum. Both eliminate false magnitude from distance calculations.

The scaler must be fitted on the training set only. Fitting on the full dataset before splitting leaks test-set statistics into the training process — a form of data leakage that inflates apparent model performance.

Tree-based algorithms are completely insensitive to feature scaling. Distance-based algorithms (KNN, SVM, neural networks, linear regression with regularisation) require it.

✓ Checkpoint

Check Your Understanding

Four questions on feature scaling, data leakage, and algorithm sensitivity. Click a question to reveal the answer — there is no score.

A KNN model predicts loan default using two features: annual income (range $20,000–$200,000) and credit score (range 300–850). The income feature has a range roughly 330× wider than credit score. What is the most likely consequence of using raw unscaled features?

It is acceptable to fit a StandardScaler on the full dataset (training + test combined) before splitting into training and test sets, as long as no target labels are used during scaling.

A dataset contains transaction amounts ranging from $1 to $850,000, with most transactions under $5,000 but a small number of very large legitimate transactions. Which scaling method is most appropriate?

Order these preprocessing steps from first to last for a supervised learning pipeline that uses KNN with cross-validation:

1.Split data into training and test sets
2.Fit StandardScaler on the training set
3.Transform training set with fitted scaler
4.Transform test set with the same fitted scaler (do not refit)