Dimensionality
Adding more features to a dataset seems like it can only help — more information, more signal. In practice, high-dimensional spaces behave strangely: distances lose meaning, data becomes sparse, and models overfit catastrophically. This topic develops intuition for the curse of dimensionality, explains why correlated features cause redundancy, and covers dimensionality reduction techniques including PCA — which uses the eigenvectors you studied in the matrices topic to find the directions of maximum variance.
- Features & Representations Required — Dimensionality reduction operates on feature vectors — you need to understand feature spaces first
- Matrices and Transformations Required — PCA decomposes the covariance matrix using eigenvectors — the matrix topic builds the required intuition
- Derivatives and Gradients Required — Variance maximisation in PCA involves optimising an objective — derivatives provide the necessary calculus background
The Curse of Dimensionality
Adding more features feels like adding information. In high dimensions it is the opposite: volume rushes to the corners, data becomes so sparse that nearly all points look equidistant, and nearest-neighbour queries stop being meaningful. This chapter develops geometric intuition for four mechanisms of the curse — volume collapse, shell concentration, distance concentration, and exponential sparsity — and explains which algorithms break first.
Correlation and Redundancy
When two features correlate strongly, each one's variance is mostly explained by the other. A model that receives both effectively sees one feature twice — wasting model capacity, slowing training, and making linear coefficients numerically unstable. This chapter explains Pearson correlation as a measure of shared variance, introduces the correlation matrix as a diagnostic tool for spotting redundant feature clusters, and develops the multicollinearity phenomenon that causes coefficients to swing wildly when correlated features are included together.
Principal Component Analysis
PCA takes a dataset of potentially correlated features and rotates the coordinate system so that the new axes — the principal components — point along the directions of greatest variance. Each principal component is an eigenvector of the data's covariance matrix, its eigenvalue measuring how much of the variance lies along that direction. This chapter develops PCA geometrically (the rotation view), algebraically (the covariance eigendecomposition), and operationally (scree plots, variance-retention thresholds, fit-on-train-only discipline).
Other Reduction Techniques
PCA is linear and unsupervised. When data lies on a curved manifold it cannot capture — a classic failure is a spiral dataset — or when labels are available and separation between classes matters more than capturing global variance, different tools are needed. This chapter compares t-SNE (preserves local neighbourhoods, great for visualisation), UMAP (faster, preserves more global structure than t-SNE), and LDA (uses labels to find the direction of maximum class separation). The goal is to develop judgement about when each is appropriate.