Dimensionality Chapter 4 of 4 · tap to browse
Other Reduction Techniques
t-SNE, UMAP, LDA — when PCA is not the right tool
A data scientist tries PCA on a catalogue of 10,000 songs described by 128 audio features. The first two principal components show a single elongated blob — no visible genre clusters. They switch to UMAP and the same songs spread into four clearly separated groups corresponding to jazz, pop, classical, and rock. PCA was not wrong; it was finding maximum variance, and the direction of maximum variance ran orthogonal to the genre boundaries. UMAP preserves local neighbourhood structure instead — and that structure is what genre membership actually is.
- 1 Explain the core difference between PCA (linear, variance-maximising, unsupervised) and non-linear manifold methods (t-SNE, UMAP), including why PCA fails on curved data and why non-linear methods can recover structure PCA misses.
- 2 Tune the t-SNE perplexity parameter across its practical range (5 to 50), understanding that very low perplexity over-fragments clusters and very high perplexity merges them — and that inter-cluster distances in a t-SNE plot are generally not reliable regardless of tuning.
- 3 Compare the trade-offs among PCA, t-SNE, UMAP, and LDA — speed versus preservation, linear versus non-linear, unsupervised versus supervised — and select the appropriate method for a given downstream task.
- 4 Describe LDA as a supervised reduction method that maximises the ratio of between-class to within-class variance, explain why it produces better class-separated projections than PCA when labels are available, and recognise its limitations (at most C-1 components for C classes, assumes Gaussian class distributions).
When Linear Is Not Enough
A music platform has a catalogue of 10,000 songs, each described by 128 audio features produced by a neural network. They want a 2D visualisation to check whether the features naturally separate songs by genre. The first attempt uses PCA: the top two principal components are plotted as x and y axes, songs coloured by genre. The result is disappointing — a single elongated cloud with all four genres overlapping. It looks like the features are not good enough.
They switch to UMAP with default parameters. Same songs, same features. Four clearly separated clusters, one per genre, with jazz and classical adjacent and pop and rock adjacent. The features were fine all along. PCA’s linear assumption was the problem: the genre separation lives along a curved manifold in the 128-dimensional space, and no straight line through the origin captures it. UMAP’s non-linear method finds it easily.
This chapter is about when PCA is not enough and what to use instead.
Why PCA can miss structure
PCA is fundamentally a rotation. It finds the best linear subspace of the feature space — the k-dimensional plane that captures the most variance. If the interesting structure in the data is also linear, PCA is essentially optimal. If it is curved, PCA’s projection will average over the curvature and the top components may not reveal it.
A classic failure is the Swiss roll: 3D points arranged on a curled-up 2D sheet. The intrinsic structure is 2D (the uncurled sheet), but the sheet curves through 3D space. PCA on the Swiss roll produces a top-2D projection that flattens the curl by collapsing the two ends of the sheet on top of each other — the unrolled structure is destroyed. A non-linear manifold-learning method can recover the sheet.
Audio features, image embeddings, and word embeddings from neural networks tend to have curved structure for the same reason: the features encode similarity via learned nonlinear transformations. Linear projection often does not respect those similarities.
t-SNE: preserving local neighbourhoods
t-SNE (t-distributed Stochastic Neighbour Embedding) is specifically designed for visualisation. Its optimisation objective: find a 2D layout in which every point’s probability distribution over its nearest neighbours matches the corresponding distribution in the original high-dimensional space. It is explicitly preserving local structure — who is near whom — and not trying to preserve global distances.
The core parameter is perplexity: how many neighbours each point attempts to preserve. It acts like a smoothed version of k in k-nearest-neighbours. At perplexity 5, each point considers about 5 neighbours, which produces tight small clusters. At perplexity 80, each point considers many neighbours, which produces larger merged regions. The practical range is 5 to 50, with a default of 30.
Three properties of t-SNE that you need to know:
- Distances between clusters are not meaningful. t-SNE’s loss does not preserve inter-cluster distances, only local neighbourhoods. Two clusters that are far apart in the original high-dimensional space might end up close in the t-SNE layout, or vice versa. Use t-SNE to answer ‘are there clusters?’, not ‘how are the clusters arranged relative to each other?’
- Results depend on random initialisation. Different random seeds produce different layouts that all have similar clusters but different orientations. Do not build anything downstream that depends on the specific coordinate values.
- t-SNE is slow. On datasets above 10,000 points the runtime becomes inconvenient (tens of minutes to hours). UMAP is usually faster and gives comparable or better visualisations.
UMAP: faster and more global
UMAP (Uniform Manifold Approximation and Projection) was introduced as a near-drop-in replacement for t-SNE that addresses several of its limitations. It is based on a topological theory (simplicial complexes on a fuzzy graph — the mathematical detail is not required to use it) but the practical upshot is:
- Faster. Runtime is roughly O(n log n) instead of O(n²). On 100,000 points, UMAP takes minutes; t-SNE takes hours.
- Preserves more global structure. Inter-cluster distances in a UMAP plot are more trustworthy than in t-SNE — clusters that are genuinely close in the original space tend to be close in the UMAP layout.
- Principled defaults. The main parameters are
n_neighbors(the local neighbourhood size, roughly equivalent to perplexity — default 15) andmin_dist(how tightly points within a cluster pack together — default 0.1). The defaults work well for a wide range of datasets with minimal tuning.
For most current applications where you would have used t-SNE a few years ago, UMAP is now the default choice. The practical workflow is: try UMAP with default parameters. If the visualisation is not informative, increase n_neighbors to 30 or 50 to emphasise global structure, or decrease it to 5 to see more local detail.
t-SNE and UMAP are visualisation tools, not feature-engineering tools. The 2D projection they produce is only meant to be read by eye. Using t-SNE or UMAP output as input features for a downstream classifier is not a standard approach because (1) the transformation is non-invertible and not applicable to new points without retraining, and (2) the optimisation targets visualisation, not downstream task performance. For feature engineering, PCA and variational autoencoders are the appropriate linear and non-linear reductions respectively.
LDA: supervised class separation
All three methods above (PCA, t-SNE, UMAP) are unsupervised — they ignore any labels the data might have. Linear Discriminant Analysis (LDA) does the opposite: it uses labels explicitly to find the directions that best separate classes.
LDA finds directions that maximise the ratio of between-class variance to within-class variance. Along these directions, the class means are as far apart as possible while the within-class spread is as small as possible — exactly what a downstream linear classifier wants. For a C-class problem, LDA produces at most C − 1 components (for 3 classes, at most 2 LDA axes).
LDA’s constraints are:
- Requires labels. Unlike PCA, you cannot apply it without a target variable.
- Linear. It is a linear method and will fail on non-linearly separable data just like PCA.
- Gaussian class assumption. LDA assumes each class has a Gaussian distribution with the same covariance. Violations (skewed classes, very different within-class variances) degrade performance, though often gracefully.
- At most C − 1 components. For binary classification, LDA produces only 1 component. For 10 classes, at most 9. If you need more reduced dimensions than C − 1, use PCA or a non-linear method.
Choosing between methods
| Method | Linearity | Labels | Speed (10k points) | Best for |
|---|---|---|---|---|
| PCA | Linear | Unsupervised | < 1s | Feature engineering, interpretability, downstream ML pipelines |
| t-SNE | Non-linear | Unsupervised | ~1 min | Exploratory visualisation of small-to-medium datasets |
| UMAP | Non-linear | Unsupervised | < 10s | Visualisation, clustering preparation, large datasets |
| LDA | Linear | Supervised | < 1s | Preprocessing for linear classifiers, especially multiclass |
A practical decision tree:
- Do you have labels and need class separation? Use LDA (or try PCA if LDA’s C − 1 component limit is too restrictive).
- Is this for downstream ML (not just visualisation)? Use PCA. It is deterministic, reproducible, invertible, and applies to new data points directly.
- Is this for visualisation of structure in unlabelled data? Use UMAP (default) or t-SNE (older alternative, less preferred for large datasets).
- Is your data large (100k+ rows)? UMAP is essentially the only practical non-linear choice; t-SNE becomes prohibitively slow.
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap # pip install umap-learn
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# X: 10,000 × 128 audio feature matrix (already standardised)
# y: 10,000 genre labels (0=jazz, 1=pop, 2=classical, 3=rock)
# PCA — fast, linear, unsupervised
pca_2d = PCA(n_components=2).fit_transform(X)
# t-SNE — slow, non-linear, unsupervised, preserves local neighbourhoods
tsne_2d = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X)
# UMAP — fast, non-linear, unsupervised, preserves local + more global structure
umap_2d = umap.UMAP(n_components=2, n_neighbors=15,
min_dist=0.1, random_state=42).fit_transform(X)
# LDA — fast, linear, SUPERVISED, at most C-1 = 3 components for 4 classes
lda_3d = LinearDiscriminantAnalysis(n_components=3).fit_transform(X, y)Where dimensionality reduction fits overall
After four chapters, the topic’s arc is complete: the curse of dimensionality explained why high-dimensional features are costly, the correlation chapter showed how to identify redundancy, PCA provided the primary tool for replacing correlated features with uncorrelated principal components, and this chapter extended the toolkit for cases where PCA’s linear assumption is too restrictive. In practice, most real dimensionality-reduction work uses PCA (for feature engineering) and UMAP (for visualisation), with LDA for supervised multiclass problems and t-SNE as a legacy tool mostly replaced by UMAP.
The next topic — Data Splits and Evaluation — applies a different discipline to the same concern: instead of reducing features, you measure more honestly. Both are parts of the same underlying goal: build models that generalise, by either simplifying the input space or rigorously testing the output.
When is PCA enough, and when do I need something else?
PCA is the right tool when your data's meaningful structure is linear — if the quantity you care about is approximately a linear function of the features, PCA will capture the directions where that function varies most. It is fast, deterministic, and the resulting components are interpretable as weighted sums of the original features. Switch to non-linear methods when PCA's top components visibly fail to separate the things you care about (classes, clusters), or when the data's intrinsic structure is obviously curved. The rule is: try PCA first because it is cheap and interpretable; move to t-SNE/UMAP only if PCA does not show you what you need.
What does t-SNE's perplexity parameter actually do?
Perplexity controls how many neighbours each point tries to preserve in the low-dimensional layout. Roughly, it is a smoothed version of 'k' in k-nearest-neighbours: the algorithm computes a per-point similarity distribution over k ≈ perplexity nearest neighbours in the original space, then positions points in 2D so that their low-dimensional similarity distributions match. Small perplexity (5-10) forces t-SNE to preserve only very local structure — clusters fragment into tight sub-clusters. Large perplexity (40+) flattens the distinction between clusters — you get one big blob. The recommended default is 30; tune between 5 and 50 and choose the value that gives the clearest visualisation of the structure you know is there.
Why are distances between clusters in a t-SNE plot unreliable?
t-SNE explicitly preserves local neighbourhoods but has no objective that tries to preserve global distances. Two clusters that are far apart in the high-dimensional space might end up close in a t-SNE layout, or vice versa — the algorithm is indifferent to inter-cluster distances. This is why t-SNE is best used as a visualisation tool for 'are there clusters?' rather than 'how are the clusters arranged?' UMAP, by contrast, explicitly preserves more global structure, so inter-cluster distances in a UMAP plot are more trustworthy — still not exact but usually more meaningful than t-SNE's.
A research team has a dataset of 50,000 cell images described by 512-dimensional features produced by a vision neural network. They want to visualise whether cells of different types naturally cluster in the feature space. They are choosing between PCA, t-SNE, UMAP, and LDA. Before picking a method: - If the team does NOT have cell-type labels, which of the four methods are not viable and why? Of the remaining methods, which would give the most informative visualisation and which would run the fastest on 50,000 points? - The cells cluster in the feature space but the boundaries between clusters are non-linear (clusters curve around each other). Which method would best preserve this structure — and why would PCA struggle even if the team had plenty of compute? - If the team DOES have cell-type labels for a subset of the data and their actual downstream goal is classification (not just visualisation), what is the most defensible method choice, and how does the number of cell types affect that choice?
Choosing the Right Method
PCA’s linearity is a feature in most contexts and a bug in a few. As a preprocessing step for downstream machine learning, PCA is almost always the right starting point: it is fast, deterministic, reproducible, interpretable through the component loadings, invertible, and directly applicable to new data points without retraining. The failure modes are specific. When the meaningful structure in the data is curved — as it commonly is for features produced by neural networks — PCA’s linear subspace can cut through the curvature in ways that destroy the structure. When labels are available and the downstream goal is class separation rather than variance retention, PCA’s unsupervised optimisation can miss the directions along which the classes actually differ because those directions do not carry the most variance. Both failures are diagnosable: plot the top two principal components, colour by the downstream signal (class, cluster, target), and check whether the signal is visible in the projection. If it is, PCA is enough. If it is not, you need something else.
t-SNE and UMAP are both non-linear manifold-learning methods designed for visualisation. t-SNE preserves local neighbourhoods — each point’s distribution over its nearest neighbours in the original space is matched as closely as possible in the 2D layout. Its key operational quirks: inter-cluster distances are not meaningful (so do not read a t-SNE plot as ‘cluster A is far from cluster B’ — t-SNE makes no such claim), results depend on random initialisation (different seeds give different-looking layouts), and the O(n²) runtime is prohibitive past a few tens of thousands of points. UMAP addresses each of these limitations: it preserves more global structure, runs in O(n log n), and has principled defaults for its main parameters (n_neighbors and min_dist). For most current use cases UMAP has simply replaced t-SNE. The exception is when reproducing older literature that used t-SNE specifically — then using t-SNE preserves comparability.
LDA sits in a different corner of the matrix from the others: linear, but supervised. It finds the directions that maximise the ratio of between-class variance to within-class variance, so the class means are as far apart as possible relative to the within-class spread along each direction. For a linear classifier downstream this is often the best preprocessing you can do — the projection is explicitly optimised for class distinction rather than for variance retention. The constraints are that LDA produces at most C − 1 components for C classes (so a 3-class problem gives at most 2 LDA axes, a binary problem just 1), that it is linear (so the same curvature limitation as PCA applies), and that it formally assumes Gaussian class distributions with equal covariance (so performance degrades gracefully on skewed classes but the optimisation is only exactly right on Gaussians).
The practical decision tree is short. If you have labels and your downstream task is classification, try LDA first — it is fast, uses the supervision signal, and often gives a strong baseline. If you have labels but want more components than C − 1, or if your data is non-linearly separable, PCA or a non-linear method is better. If you have unlabelled data and the goal is feature engineering for any downstream model, PCA is the safe default. If you have unlabelled data and the goal is visualisation, UMAP is the default, with t-SNE as an alternative for smaller datasets or literature comparability. The four-chapter arc of this topic closes here: the curse of dimensionality explained why high-dimensional features are costly, the correlation chapter identified redundancy, PCA provided the primary compression tool, and this chapter extended the toolkit for cases where PCA is not enough. The next topic — Data Splits and Evaluation — pivots from reducing the feature space to measuring what the model actually learned; both serve the same goal of building models that generalise.
PCA is linear and unsupervised — it maximises variance and works well when the meaningful structure in the data is approximately linear, but fails when the structure is curved (a Swiss roll, a learned embedding space) and misses class separation when variance and class labels are not aligned.
t-SNE preserves local neighbourhoods — who is near whom — at the cost of unreliable global distances and O(n²) runtime; it is a visualisation tool for exploring whether clusters exist in moderate-sized datasets, not a feature-engineering method.
UMAP is faster than t-SNE (O(n log n)), preserves more global structure so inter-cluster distances are more trustworthy, and has principled defaults that usually work without tuning — it has largely replaced t-SNE as the default non-linear visualiser for datasets above a few thousand points.
LDA is supervised: it uses class labels to find the direction that maximises the ratio of between-class variance to within-class variance, producing C − 1 components for C classes and typically outperforming PCA as preprocessing for linear classifiers when labels are available.
Practical decision tree: labels + class separation → LDA; downstream ML pipeline → PCA; visualisation of unlabelled structure → UMAP; very large datasets or comparing against older work → UMAP over t-SNE on speed alone.
Check Your Understanding
Answer these questions about the reduction-method scenarios covered in this chapter.
A research team visualises cell embeddings using PCA and sees a single elongated cloud with no visible clusters. They try t-SNE on the same embeddings and see four clean well-separated clusters. What best explains the difference?
A data scientist runs t-SNE with perplexity=5 on a 2000-point dataset. The resulting layout shows 30+ small tight clusters where domain knowledge says there should be about 6 groups. What is the most likely issue?
Match these four reduction scenarios to the most appropriate method. Put the scenarios in order from the one best served by LDA to the one best served by UMAP.
- 1.Unlabelled exploration of 100,000 customer profiles for natural clustering, with a visualisation as deliverable.
- 2.Preprocessing a 20,000-row customer dataset with known churn labels for a downstream linear classifier targeting binary churn prediction.
- 3.Preprocessing a 100,000-row image dataset with a known class label across 8 classes for a linear classifier.
- 4.Interactive exploration of a 3000-point gene expression dataset without labels, looking for subtypes.
A team applying LDA to a 10-class classification problem should expect LDA to reduce the data to up to 9 dimensions, regardless of how many features the original dataset has, and the reduced representation should typically work better for a downstream linear classifier than a PCA reduction to the same number of dimensions.