Dimensionality Chapter 3 of 4 · tap to browse
Principal Component Analysis
Replace many correlated features with a few directions of maximum variance
A data scientist's 10-feature housing dataset has a top-left correlation cluster of sqft/rooms/bedrooms/baths at r > 0.7. Running PCA produces one principal component — PC1 — that captures 57% of the dataset's total variance all by itself, and the first three principal components together capture 87%. The original 10 noisy, correlated columns can be replaced by 3 clean, uncorrelated ones with almost no loss. The downstream linear regression's coefficients are stable for the first time.
- 1 Explain PCA as a rotation of the feature-space axes to align with the directions of maximum data variance, so correlated features are replaced by uncorrelated linear combinations ordered by how much spread they carry.
- 2 Describe how the principal components are the eigenvectors of the data's covariance matrix, with each eigenvalue equal to the variance of the data projected onto that eigenvector, and explain why symmetric covariance matrices guarantee real perpendicular principal components.
- 3 Read a scree plot and cumulative-variance curve to choose the number of components k to keep, using practical retention thresholds (80%, 95%) and the elbow rule where marginal variance contribution sharply decreases.
- 4 Identify common PCA pitfalls — unscaled features letting high-variance columns dominate the components, fitting PCA on the full dataset before the train/test split (leakage), interpreting PC loadings as causal relationships — and explain what to do instead.
Rotation, Projection, Compression
The previous chapter ended with an observation: four housing features — sqft, rooms, bedrooms, and baths — inter-correlate at r > 0.7 because they all measure roughly the same underlying concept, “house size”. The features are redundant but the concept is real. Ideally we would have one feature called size that combines all four, keeps most of what they collectively carry, and makes the downstream regression’s coefficients stable. Principal component analysis — PCA — is the tool that constructs that size feature from the correlation structure alone, with no domain knowledge required.
The geometric picture
PCA is most easily understood as a rotation of the coordinate axes. Take the scatter of sqft vs rooms. Because the two features correlate at r = 0.85, the cloud is stretched along a diagonal. The direction of maximum spread — what we would call the long axis of the cloud — is not aligned with either the sqft axis or the rooms axis. It sits at roughly 45° between them.
If we define two new features — PC1, the position along the diagonal, and PC2, the perpendicular distance to the diagonal — then PC1 carries almost all the spread in the cloud and PC2 carries almost none. Specifically: with r = 0.85, PC1’s variance (its eigenvalue) is 1.85 in standardised units and PC2’s is 0.15. Of the total variance in the dataset (sum of eigenvalues = 2.0), PC1 explains 92.5% and PC2 the remaining 7.5%.
This is the entire pedagogical core of PCA. In two dimensions it is just a 45° rotation. In ten dimensions it is a rotation in 10-dimensional space to a new basis whose first axis captures the most variance of any possible axis, the second captures the most variance among directions perpendicular to the first, and so on. Every axis in the new basis is uncorrelated with every other — the correlation matrix of the transformed features is the identity plus zero everywhere else.
The algebra
The directions of maximum variance are exactly the eigenvectors of the data’s covariance matrix. For a d-feature dataset, the covariance matrix is d × d: the entry at row i, column j is the covariance between feature i and feature j. The diagonal entries are the variances of each feature alone. If the features have been standardised (mean 0, variance 1), the covariance matrix is identical to the correlation matrix we looked at in the previous chapter.
For the sqft/rooms pair with correlation 0.85, the covariance matrix is:
The eigenvectors of this 2×2 matrix turn out to be exactly (1, 1)/√2 and (1, −1)/√2 — the 45° and −45° directions — and the eigenvalues are 1 + 0.85 = 1.85 and 1 − 0.85 = 0.15. This is not a coincidence: for any symmetric matrix with equal diagonal entries, the eigenvectors are always the ±45° directions with eigenvalues equal to the diagonal plus and minus the off-diagonal term.
More generally, the matrices appearing in PCA are always symmetric (covariance matrices are symmetric by construction — Cov(X, Y) = Cov(Y, X)) and positive-semi-definite (variances are never negative). Symmetric positive-semi-definite matrices have real, non-negative eigenvalues and orthogonal eigenvectors, so PCA always produces real perpendicular principal components. This is why PCA is computationally stable and why it exists as a well-defined procedure at all: not every matrix has such nice structure.
In scikit-learn, PCA() does not actually compute the covariance matrix explicitly — it uses singular value decomposition (SVD) of the data matrix directly. SVD is numerically more stable and avoids squaring the condition number of the data. For large datasets, SVD is also much faster. The underlying mathematics is identical: the principal components are the right singular vectors, and the singular values squared (divided by n−1) are the eigenvalues. The geometric picture is the same regardless of which algorithm computes it.
Compression: keeping only the top k components
The real operational value of PCA comes from keeping only the first k principal components — the ones with the largest eigenvalues. Projecting the data onto those k directions gives a lower-dimensional representation that retains as much variance as possible. The variance retained is the sum of the first k eigenvalues divided by the sum of all eigenvalues.
Consider a real housing dataset with 10 features. After standardising each feature and running PCA, the eigenvalues might come out as 5.70, 2.20, 0.80, 0.55, 0.35, 0.18, 0.12, 0.06, 0.03, 0.01 — a total of 10.0 (as expected when every feature contributes variance 1 after standardisation).
The scree plot (left panel) visualises the rapid decay of eigenvalues. The cumulative curve (right panel) reads off how much variance you retain for any choice of k. For this dataset:
- k = 3: 87% of variance retained. A good default choice — past the elbow, below the 95% threshold, uses only 30% of the original feature count.
- k = 5: 96% retained. More conservative; appropriate if 4% variance loss is too much for the downstream task.
- k = 7: 99.1% retained. Nearly lossless but wastes most of PCA’s compression benefit.
A common rule: choose k by the variance threshold (80% or 95% depending on task sensitivity), then verify by checking downstream performance. If the model’s validation accuracy is equal at k=3 and k=5, pick the smaller one. If k=3 costs meaningful accuracy, use k=5. The scree plot is the reason to have a short candidate list of k values, not the final word on which one to use.
A worked example: five houses
Take five houses from a dataset, standardised and described only by sqft and rooms:
| House | sqft (z) | rooms (z) | PC1 score | PC2 score | Price class |
|---|---|---|---|---|---|
| A | −1.40 | −1.20 | −1.84 | −0.14 | low |
| B | −0.60 | −0.30 | −0.64 | −0.21 | low |
| C | 0.20 | 0.40 | 0.42 | −0.14 | mid |
| D | 0.90 | 1.00 | 1.34 | −0.07 | high |
| E | 1.60 | 1.40 | 2.12 | 0.14 | high |
PC1 score is the projection onto PC1: x · PC1 = (sqft_z + rooms_z)/√2. PC2 is (sqft_z − rooms_z)/√2. The PC1 scores neatly range from −1.84 to +2.12, separating houses by size. The PC2 scores range only from −0.21 to +0.14 — a tenth the spread of PC1, reflecting that 92.5% of the variance lives along PC1.
The price class visibly tracks PC1 position in both panels. In the original space, a regression on sqft and rooms would have had the correlation problem we saw in the last chapter. After projection onto PC1 alone, the regression becomes a simple univariate fit — price as a function of the size index, with no correlation to worry about.
The preprocessing order matters
PCA fits on training data to learn the rotation, then applies that rotation to validation and test data. The same pipeline discipline from the Data Splits chapter applies: fitting PCA on the full dataset before splitting is a form of data leakage — the rotation has been informed by validation and test rows.
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# X: 10,000 house feature rows with 10 columns (sqft, rooms, bedrooms, baths,
# lot_size, garages, floors, age, year_built, crime_rate)
# y: house price in GBP
X_trainval, X_test, y_trainval, y_test = train_test_split(
X, y, test_size=0.15, random_state=42,
)
# The pipeline: standardise -> PCA -> regression
# PCA is fitted INSIDE the pipeline, so it only sees X_trainval when calling .fit
pipe = Pipeline([
('scale', StandardScaler()),
('pca', PCA(n_components=3)),
('reg', LinearRegression()),
])
pipe.fit(X_trainval, y_trainval)
# Inspect what PCA learned
pca_step = pipe.named_steps['pca']
print("Variance ratio per PC:", pca_step.explained_variance_ratio_.round(3))
# Output (example): [0.57 0.22 0.08]
print(f"Cumulative at k=3: {pca_step.explained_variance_ratio_.sum():.1%}")
# Output: Cumulative at k=3: 87.0%
# Evaluate on test
print("R2 on test:", pipe.score(X_test, y_test))
# View the PC1 loadings — what the 'size' component weights each original feature
pc1_loadings = pca_step.components_[0]
feature_names = ['sqft', 'rooms', 'bedrooms', 'baths', 'lot_size',
'garages', 'floors', 'age', 'year_built', 'crime_rate']
for f, w in sorted(zip(feature_names, pc1_loadings), key=lambda x: -abs(x[1])):
print(f" {f:14s} {w:+.3f}")
# Output (example):
# sqft +0.420
# rooms +0.405
# bedrooms +0.385
# baths +0.350
# lot_size +0.220
# garages +0.190
# floors +0.150
# age -0.130
# year_built +0.125
# crime_rate -0.050The PC1 loadings (last block) confirm the interpretation: PC1 is a positively-weighted combination of all the size-related features, with smaller coefficients on the less-correlated features. It is a ‘size index’ that summarises what the four redundant size features collectively contained — in one column instead of four.
Common pitfalls
Three failure modes show up repeatedly in practice:
Unscaled features. If one column has variance 10,000 (dollars) and another has variance 1 (z-scored feature), PCA will essentially recover the dollar feature as PC1 and everything else as small corrections. Always standardise before PCA, or use a Pipeline that does it automatically.
Leakage from fitting on the full dataset. Fitting PCA on train + test together means the rotation was informed by test rows. Subtle, but test performance becomes optimistic. The Pipeline above is leakage-free because pipe.fit(X_trainval, ...) never sees X_test.
Over-interpreting loadings. A PC1 with positive weights on every feature is usually a ‘general intensity’ component. A PC2 that contrasts old-construction features against new-construction features might look like an ‘era’ component but is really just a mathematical artifact of the covariance structure. Interpretations are hypotheses to verify with domain knowledge, not causal truths.
PCA is a linear method and only captures linear correlation structure. For data that lies on a curved manifold — a non-linear lower-dimensional structure — PCA will miss the structure. The next chapter compares PCA against non-linear alternatives (t-SNE, UMAP) and the supervised alternative (LDA), developing the judgement for when each is the right tool.
What does 'direction of maximum variance' mean concretely?
Imagine you have a cloud of points in a 2D plane where the cloud is stretched diagonally — long along one direction, thin perpendicular to it. The direction of maximum variance is the long axis: the direction along which the points are most spread out. If you projected every point onto that axis (dropped each one straight onto the line), the resulting 1D values would have the largest possible variance for any single direction. PC1 is that direction. PC2 is the direction of maximum variance among all directions perpendicular to PC1 — which in 2D just means the perpendicular direction itself.
Why must the features be standardised before PCA?
PCA finds directions of maximum variance in the feature space. If one feature is measured in thousands (annual income) and another in units (number of children), the income feature's variance is millions of times larger purely because of its scale, so PC1 will point almost entirely along the income axis regardless of the underlying relationships. Standardising every feature to mean 0 and variance 1 first (StandardScaler) ensures that variance measures genuine spread rather than scale mismatch. Without standardisation, PCA is often useless — the top components just recover the biggest-scale features.
How do I choose how many components to keep?
Three common approaches. First, a variance threshold: keep enough components to reach 80% or 95% of total variance. The 80% rule is aggressive (keeps fewer components); 95% is conservative (more). Second, the elbow rule: plot the variance explained per component and look for the point where the curve levels off — below the elbow, additional components contribute negligibly. Third, downstream task performance: try different values of k and see which gives the best validation accuracy on the model you actually care about. The variance threshold is the fastest, the elbow is usually close to optimal, and the task-specific evaluation is the most rigorous.
A startup is working with customer survey data: 40 features per respondent, many of them scaled on 1-to-5 Likert scales and heavily correlated. They want to reduce the feature set before training a logistic regression. Before running PCA: - If the first principal component has loadings that are all positive and roughly equal in magnitude (around 0.15 each across 40 features), what does that suggest about the interpretation of PC1 — and does that interpretation help or hurt downstream modelling? - They forgot to standardise the features, and one of the 40 columns is `annual_income` in raw dollars (variance ≈ 10¹⁰) while the others are 1–5 Likert scales (variance < 2). What do you predict PC1 will look like, and is that a useful reduction? - The scree plot shows the first five eigenvalues as 12.4, 6.1, 3.8, 2.0, 1.2 and everything after that under 0.8. How many components would you keep, and what is your reasoning — elbow, variance threshold, or something else?
What PCA Is and What It Is Not
The correlation-and-redundancy chapter ended by pointing out that four housing features — sqft, rooms, bedrooms, baths — all measure roughly the same underlying concept of ‘house size’. Principal component analysis is the tool that constructs that summary concept from the data alone. Geometrically, PCA rotates the coordinate axes so the first new axis points along the direction of maximum variance in the data cloud. If the cloud is stretched along a diagonal because two features correlate, PC1 is that diagonal, and projecting the data onto it absorbs most of the spread into a single new feature. The remaining principal components are ordered by variance captured and are always perpendicular to each other — an uncorrelated basis, by construction.
The algebraic statement is tighter: the principal components are the eigenvectors of the data’s covariance matrix, and the corresponding eigenvalues are the variances of the data projected onto those directions. Because covariance matrices are symmetric (Cov(X, Y) = Cov(Y, X)) and positive-semi-definite (variances are non-negative), their eigenvectors are always real and perpendicular. This is not an assumption PCA imposes — it is a fact about how covariance matrices are built, and it is why PCA is a structurally well-behaved procedure on any real dataset. Implementations in scikit-learn and elsewhere use singular value decomposition rather than literal eigendecomposition of the covariance for numerical stability, but the principal components computed are mathematically identical.
Choosing the number of components to keep is a judgement call that has three standard approaches: a variance-retention threshold (keep enough components to cover 80% or 95% of total variance), an elbow rule (keep components up to where the scree plot visibly bends), or downstream task performance (try several values and pick the smallest that preserves model quality). In practice the three approaches usually agree within one or two components. The elbow is often near the 85–95% cumulative variance point for typical datasets; a reasonable default is to start at 90% cumulative and adjust based on how the downstream model behaves. Keeping too many components is cheap in terms of accuracy but expensive in terms of compute and interpretability; keeping too few loses signal the model might have used.
PCA has three common failure modes and all of them are about pipeline hygiene rather than mathematics. Unscaled features let the largest-variance column dominate the components — a single raw income feature in dollars will swamp forty Likert-scale features just because of its scale. Fitting PCA on the full dataset before splitting leaks test information into the rotation, which produces an optimistic test score. Over-interpreting loadings attributes causal meaning to mathematical artifacts — a PC2 that happens to contrast old and new construction features is not necessarily an ‘era index’ in any causal sense; it is the second direction of maximum variance, and that might align with era by accident. The fixes are mechanical: standardise inside a Pipeline, fit the pipeline on training only, and use loadings as hypotheses to verify rather than conclusions. PCA is a linear method and will miss non-linear structure — a cloud of points lying on a curved manifold will be poorly summarised by straight principal component axes. The next chapter compares PCA against t-SNE, UMAP, and LDA, which relax the linearity assumption or incorporate label information, and develops the judgement for when each is the right tool.
PCA rotates the coordinate system of the feature space so the new axes — the principal components — point along directions of maximum variance, with PC1 capturing the most, PC2 the most remaining, and so on.
The principal components are mathematically the eigenvectors of the data's covariance matrix; the corresponding eigenvalues measure the variance captured along each direction, and their sum equals the dataset's total variance.
Covariance matrices are always symmetric and positive-semi-definite, which guarantees real perpendicular eigenvectors — this is the structural reason PCA always produces uncorrelated components in a well-defined order.
Choosing k — the number of components to keep — uses a scree plot (elbow rule), a cumulative variance threshold (80% or 95%), or downstream task performance; 80% is aggressive, 95% is conservative, the elbow is usually close to optimal.
Standardise features before PCA (otherwise high-scale features dominate), fit PCA inside a pipeline on training data only (otherwise test information leaks into the rotation), and treat PC loadings as hints about interpretation rather than causal relationships.
Check Your Understanding
Answer these questions about PCA scenarios covered in this chapter.
Two standardised features x1 and x2 have correlation r = 0.85. Their covariance matrix is [[1, 0.85], [0.85, 1]]. What are the eigenvalues, and what fraction of total variance does PC1 capture?
An analyst builds a PCA pipeline on a 10,000-row dataset with 50 features. The scree plot shows cumulative variance of 64% at k=3, 85% at k=7, and 95% at k=15. Which choice of k is most defensible for a downstream linear classifier?
Put the following steps in the correct order for a leakage-free PCA pipeline applied to a train/test split dataset.
- 1.Fit PCA on the scaled training features and transform training, validation, and test features using the fitted rotation.
- 2.Split the dataset into train, validation, and test.
- 3.Fit a StandardScaler on the training features and transform training, validation, and test features using the fitted scaler.
- 4.Fit the downstream classifier on the training PCA scores and evaluate on the test PCA scores.
A data team applies PCA to a 40-feature customer dataset where most features are 1-5 Likert scales but one feature is annual income in raw dollars (values ranging from 20,000 to 200,000). PC1's loadings are 0.99 on income and essentially zero on all other features. They conclude PC1 represents a 'wealth axis' that captures most of the meaningful variation in the dataset.