What is LDA and when should I use it?

Linear Discriminant Analysis is a supervised reduction method. Unlike PCA, it uses the class labels: it finds the direction along which the class means are most separated relative to the within-class variance. For a C-class problem it produces at most C − 1 components (a 3-class problem produces at most 2 LDA axes), so it is not appropriate when you need more reduced dimensions than classes. Use LDA when you have labels and your goal is class separation rather than faithful representation of the data. It often outperforms PCA as a preprocessing step for linear classifiers because it is explicitly optimised for class distinction, but the assumption of Gaussian class distributions means it can fail on non-Gaussian or non-linearly separable data.

Why is UMAP so much faster than t-SNE on large datasets?

t-SNE's core optimisation is O(n²) in the number of points because every pair of points contributes to the loss. UMAP uses a sparse local neighbourhood graph and a more efficient optimisation, reducing the dependence on n to something closer to O(n log n). On 100,000 points t-SNE takes hours; UMAP takes minutes. UMAP also has principled defaults for almost every parameter, whereas t-SNE requires tuning perplexity for each new dataset. For most current applications UMAP has replaced t-SNE as the default non-linear visualisation method — the speed advantage is decisive and the visualisations are usually comparable or better.

Dimensionality Chapter 4 of 4 · tap to browse

01 Curse of Dimensionality 02 Correlation & Redundancy 03 PCA 04 Other Methods

Other Reduction Techniques

Q: When is PCA enough, and when do I need something else?

PCA is the right tool when your data's meaningful structure is linear — if the quantity you care about is approximately a linear function of the features, PCA will capture the directions where that function varies most. It is fast, deterministic, and the resulting components are interpretable as weighted sums of the original features. Switch to non-linear methods when PCA's top components visibly fail to separate the things you care about (classes, clusters), or when the data's intrinsic structure is obviously curved. The rule is: try PCA first because it is cheap and interpretable; move to t-SNE/UMAP only if PCA does not show you what you need.

Q: What does t-SNE's perplexity parameter actually do?

Perplexity controls how many neighbours each point tries to preserve in the low-dimensional layout. Roughly, it is a smoothed version of 'k' in k-nearest-neighbours: the algorithm computes a per-point similarity distribution over k ≈ perplexity nearest neighbours in the original space, then positions points in 2D so that their low-dimensional similarity distributions match. Small perplexity (5-10) forces t-SNE to preserve only very local structure — clusters fragment into tight sub-clusters. Large perplexity (40+) flattens the distinction between clusters — you get one big blob. The recommended default is 30; tune between 5 and 50 and choose the value that gives the clearest visualisation of the structure you know is there.

Q: Why are distances between clusters in a t-SNE plot unreliable?

t-SNE explicitly preserves local neighbourhoods but has no objective that tries to preserve global distances. Two clusters that are far apart in the high-dimensional space might end up close in a t-SNE layout, or vice versa — the algorithm is indifferent to inter-cluster distances. This is why t-SNE is best used as a visualisation tool for 'are there clusters?' rather than 'how are the clusters arranged?' UMAP, by contrast, explicitly preserves more global structure, so inter-cluster distances in a UMAP plot are more trustworthy — still not exact but usually more meaningful than t-SNE's.

t-SNE, UMAP, LDA — when PCA is not the right tool

A data scientist tries PCA on a catalogue of 10,000 songs described by 128 audio features. The first two principal components show a single elongated blob — no visible genre clusters. They switch to UMAP and the same songs spread into four clearly separated groups corresponding to jazz, pop, classical, and rock. PCA was not wrong; it was finding maximum variance, and the direction of maximum variance ran orthogonal to the genre boundaries. UMAP preserves local neighbourhood structure instead — and that structure is what genre membership actually is.

Learning Objectives

1 Explain the core difference between PCA (linear, variance-maximising, unsupervised) and non-linear manifold methods (t-SNE, UMAP), including why PCA fails on curved data and why non-linear methods can recover structure PCA misses.
2 Tune the t-SNE perplexity parameter across its practical range (5 to 50), understanding that very low perplexity over-fragments clusters and very high perplexity merges them — and that inter-cluster distances in a t-SNE plot are generally not reliable regardless of tuning.
3 Compare the trade-offs among PCA, t-SNE, UMAP, and LDA — speed versus preservation, linear versus non-linear, unsupervised versus supervised — and select the appropriate method for a given downstream task.
4 Describe LDA as a supervised reduction method that maximises the ratio of between-class to within-class variance, explain why it produces better class-separated projections than PCA when labels are available, and recognise its limitations (at most C-1 components for C classes, assumes Gaussian class distributions).

¶ Narrative

When Linear Is Not Enough

A music platform has a catalogue of 10,000 songs, each described by 128 audio features produced by a neural network. They want a 2D visualisation to check whether the features naturally separate songs by genre. The first attempt uses PCA: the top two principal components are plotted as x and y axes, songs coloured by genre. The result is disappointing — a single elongated cloud with all four genres overlapping. It looks like the features are not good enough.

They switch to UMAP with default parameters. Same songs, same features. Four clearly separated clusters, one per genre, with jazz and classical adjacent and pop and rock adjacent. The features were fine all along. PCA’s linear assumption was the problem: the genre separation lives along a curved manifold in the 128-dimensional space, and no straight line through the origin captures it. UMAP’s non-linear method finds it easily.

This chapter is about when PCA is not enough and what to use instead.

Why PCA can miss structure

PCA is fundamentally a rotation. It finds the best linear subspace of the feature space — the k-dimensional plane that captures the most variance. If the interesting structure in the data is also linear, PCA is essentially optimal. If it is curved, PCA’s projection will average over the curvature and the top components may not reveal it.

A classic failure is the Swiss roll: 3D points arranged on a curled-up 2D sheet. The intrinsic structure is 2D (the uncurled sheet), but the sheet curves through 3D space. PCA on the Swiss roll produces a top-2D projection that flattens the curl by collapsing the two ends of the sheet on top of each other — the unrolled structure is destroyed. A non-linear manifold-learning method can recover the sheet.

Audio features, image embeddings, and word embeddings from neural networks tend to have curved structure for the same reason: the features encode similarity via learned nonlinear transformations. Linear projection often does not respect those similarities.

The same 120-song catalogue visualised by three methods. PCA (left) captures maximum variance but the genre clusters overlap — the variance is driven by within-genre spread rather than between-genre separation. t-SNE (middle) preserves local neighbourhoods, producing four tight clusters with minimal inter-cluster structure. UMAP (right) preserves local neighbourhoods AND some global structure — clusters appear at positions that reflect their audio-feature similarity: jazz and classical near, pop and rock near.

t-SNE: preserving local neighbourhoods

t-SNE (t-distributed Stochastic Neighbour Embedding) is specifically designed for visualisation. Its optimisation objective: find a 2D layout in which every point’s probability distribution over its nearest neighbours matches the corresponding distribution in the original high-dimensional space. It is explicitly preserving local structure — who is near whom — and not trying to preserve global distances.

The core parameter is perplexity: how many neighbours each point attempts to preserve. It acts like a smoothed version of k in k-nearest-neighbours. At perplexity 5, each point considers about 5 neighbours, which produces tight small clusters. At perplexity 80, each point considers many neighbours, which produces larger merged regions. The practical range is 5 to 50, with a default of 30.

t-SNE on the same 120-song dataset at three perplexity values. At perplexity = 5 each genre fragments into sub-clusters because the algorithm is only preserving the 5 nearest neighbours — it sees within-genre variation as meaningful structure. At perplexity = 30 (the default, highlighted) the four genre clusters appear cleanly. At perplexity = 80 the layout over-smooths — classes start to merge into a single blob.

Three properties of t-SNE that you need to know:

Distances between clusters are not meaningful. t-SNE’s loss does not preserve inter-cluster distances, only local neighbourhoods. Two clusters that are far apart in the original high-dimensional space might end up close in the t-SNE layout, or vice versa. Use t-SNE to answer ‘are there clusters?’, not ‘how are the clusters arranged relative to each other?’
Results depend on random initialisation. Different random seeds produce different layouts that all have similar clusters but different orientations. Do not build anything downstream that depends on the specific coordinate values.
t-SNE is slow. On datasets above 10,000 points the runtime becomes inconvenient (tens of minutes to hours). UMAP is usually faster and gives comparable or better visualisations.

UMAP: faster and more global

UMAP (Uniform Manifold Approximation and Projection) was introduced as a near-drop-in replacement for t-SNE that addresses several of its limitations. It is based on a topological theory (simplicial complexes on a fuzzy graph — the mathematical detail is not required to use it) but the practical upshot is:

Faster. Runtime is roughly O(n log n) instead of O(n²). On 100,000 points, UMAP takes minutes; t-SNE takes hours.
Preserves more global structure. Inter-cluster distances in a UMAP plot are more trustworthy than in t-SNE — clusters that are genuinely close in the original space tend to be close in the UMAP layout.
Principled defaults. The main parameters are n_neighbors (the local neighbourhood size, roughly equivalent to perplexity — default 15) and min_dist (how tightly points within a cluster pack together — default 0.1). The defaults work well for a wide range of datasets with minimal tuning.

For most current applications where you would have used t-SNE a few years ago, UMAP is now the default choice. The practical workflow is: try UMAP with default parameters. If the visualisation is not informative, increase n_neighbors to 30 or 50 to emphasise global structure, or decrease it to 5 to see more local detail.

💡 Insight

t-SNE and UMAP are visualisation tools, not feature-engineering tools. The 2D projection they produce is only meant to be read by eye. Using t-SNE or UMAP output as input features for a downstream classifier is not a standard approach because (1) the transformation is non-invertible and not applicable to new points without retraining, and (2) the optimisation targets visualisation, not downstream task performance. For feature engineering, PCA and variational autoencoders are the appropriate linear and non-linear reductions respectively.

LDA: supervised class separation

All three methods above (PCA, t-SNE, UMAP) are unsupervised — they ignore any labels the data might have. Linear Discriminant Analysis (LDA) does the opposite: it uses labels explicitly to find the directions that best separate classes.

LDA finds directions that maximise the ratio of between-class variance to within-class variance. Along these directions, the class means are as far apart as possible while the within-class spread is as small as possible — exactly what a downstream linear classifier wants. For a C-class problem, LDA produces at most C − 1 components (for 3 classes, at most 2 LDA axes).

Three classes projected by PCA (left, unsupervised) vs LDA (right, supervised). PCA selects the direction of maximum variance, which in this dataset is dominated by within-class spread — the three classes overlap heavily after projection. LDA selects the direction that maximises between-class separation — the three classes form clean distinct bands. When labels are available and class separation is the downstream task, LDA typically outperforms PCA as a preprocessing step.

LDA’s constraints are:

Requires labels. Unlike PCA, you cannot apply it without a target variable.
Linear. It is a linear method and will fail on non-linearly separable data just like PCA.
Gaussian class assumption. LDA assumes each class has a Gaussian distribution with the same covariance. Violations (skewed classes, very different within-class variances) degrade performance, though often gracefully.
At most C − 1 components. For binary classification, LDA produces only 1 component. For 10 classes, at most 9. If you need more reduced dimensions than C − 1, use PCA or a non-linear method.

Choosing between methods

Method	Linearity	Labels	Speed (10k points)	Best for
PCA	Linear	Unsupervised	< 1s	Feature engineering, interpretability, downstream ML pipelines
t-SNE	Non-linear	Unsupervised	~1 min	Exploratory visualisation of small-to-medium datasets
UMAP	Non-linear	Unsupervised	< 10s	Visualisation, clustering preparation, large datasets
LDA	Linear	Supervised	< 1s	Preprocessing for linear classifiers, especially multiclass

A practical decision tree:

Do you have labels and need class separation? Use LDA (or try PCA if LDA’s C − 1 component limit is too restrictive).
Is this for downstream ML (not just visualisation)? Use PCA. It is deterministic, reproducible, invertible, and applies to new data points directly.
Is this for visualisation of structure in unlabelled data? Use UMAP (default) or t-SNE (older alternative, less preferred for large datasets).
Is your data large (100k+ rows)? UMAP is essentially the only practical non-linear choice; t-SNE becomes prohibitively slow.

python

import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap  # pip install umap-learn
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# X: 10,000 × 128 audio feature matrix (already standardised)
# y: 10,000 genre labels (0=jazz, 1=pop, 2=classical, 3=rock)

# PCA — fast, linear, unsupervised
pca_2d = PCA(n_components=2).fit_transform(X)

# t-SNE — slow, non-linear, unsupervised, preserves local neighbourhoods
tsne_2d = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X)

# UMAP — fast, non-linear, unsupervised, preserves local + more global structure
umap_2d = umap.UMAP(n_components=2, n_neighbors=15,
                   min_dist=0.1, random_state=42).fit_transform(X)

# LDA — fast, linear, SUPERVISED, at most C-1 = 3 components for 4 classes
lda_3d = LinearDiscriminantAnalysis(n_components=3).fit_transform(X, y)

Where dimensionality reduction fits overall

After four chapters, the topic’s arc is complete: the curse of dimensionality explained why high-dimensional features are costly, the correlation chapter showed how to identify redundancy, PCA provided the primary tool for replacing correlated features with uncorrelated principal components, and this chapter extended the toolkit for cases where PCA’s linear assumption is too restrictive. In practice, most real dimensionality-reduction work uses PCA (for feature engineering) and UMAP (for visualisation), with LDA for supervised multiclass problems and t-SNE as a legacy tool mostly replaced by UMAP.

The next topic — Data Splits and Evaluation — applies a different discipline to the same concern: instead of reducing features, you measure more honestly. Both are parts of the same underlying goal: build models that generalise, by either simplifying the input space or rigorously testing the output.

In this section

When is PCA enough, and when do I need something else?

PCA is the right tool when your data's meaningful structure is linear — if the quantity you care about is approximately a linear function of the features, PCA will capture the directions where that function varies most. It is fast, deterministic, and the resulting components are interpretable as weighted sums of the original features. Switch to non-linear methods when PCA's top components visibly fail to separate the things you care about (classes, clusters), or when the data's intrinsic structure is obviously curved. The rule is: try PCA first because it is cheap and interpretable; move to t-SNE/UMAP only if PCA does not show you what you need.

What does t-SNE's perplexity parameter actually do?

Perplexity controls how many neighbours each point tries to preserve in the low-dimensional layout. Roughly, it is a smoothed version of 'k' in k-nearest-neighbours: the algorithm computes a per-point similarity distribution over k ≈ perplexity nearest neighbours in the original space, then positions points in 2D so that their low-dimensional similarity distributions match. Small perplexity (5-10) forces t-SNE to preserve only very local structure — clusters fragment into tight sub-clusters. Large perplexity (40+) flattens the distinction between clusters — you get one big blob. The recommended default is 30; tune between 5 and 50 and choose the value that gives the clearest visualisation of the structure you know is there.

Why are distances between clusters in a t-SNE plot unreliable?

t-SNE explicitly preserves local neighbourhoods but has no objective that tries to preserve global distances. Two clusters that are far apart in the high-dimensional space might end up close in a t-SNE layout, or vice versa — the algorithm is indifferent to inter-cluster distances. This is why t-SNE is best used as a visualisation tool for 'are there clusters?' rather than 'how are the clusters arranged?' UMAP, by contrast, explicitly preserves more global structure, so inter-cluster distances in a UMAP plot are more trustworthy — still not exact but usually more meaningful than t-SNE's.

◎ Intuition

A research team has a dataset of 50,000 cell images described by 512-dimensional features produced by a vision neural network. They want to visualise whether cells of different types naturally cluster in the feature space. They are choosing between PCA, t-SNE, UMAP, and LDA. Before picking a method: - If the team does NOT have cell-type labels, which of the four methods are not viable and why? Of the remaining methods, which would give the most informative visualisation and which would run the fastest on 50,000 points? - The cells cluster in the feature space but the boundaries between clusters are non-linear (clusters curve around each other). Which method would best preserve this structure — and why would PCA struggle even if the team had plenty of compute? - If the team DOES have cell-type labels for a subset of the data and their actual downstream goal is classification (not just visualisation), what is the most defensible method choice, and how does the number of cell types affect that choice?

↺ Reflection

Choosing the Right Method

PCA’s linearity is a feature in most contexts and a bug in a few. As a preprocessing step for downstream machine learning, PCA is almost always the right starting point: it is fast, deterministic, reproducible, interpretable through the component loadings, invertible, and directly applicable to new data points without retraining. The failure modes are specific. When the meaningful structure in the data is curved — as it commonly is for features produced by neural networks — PCA’s linear subspace can cut through the curvature in ways that destroy the structure. When labels are available and the downstream goal is class separation rather than variance retention, PCA’s unsupervised optimisation can miss the directions along which the classes actually differ because those directions do not carry the most variance. Both failures are diagnosable: plot the top two principal components, colour by the downstream signal (class, cluster, target), and check whether the signal is visible in the projection. If it is, PCA is enough. If it is not, you need something else.

t-SNE and UMAP are both non-linear manifold-learning methods designed for visualisation. t-SNE preserves local neighbourhoods — each point’s distribution over its nearest neighbours in the original space is matched as closely as possible in the 2D layout. Its key operational quirks: inter-cluster distances are not meaningful (so do not read a t-SNE plot as ‘cluster A is far from cluster B’ — t-SNE makes no such claim), results depend on random initialisation (different seeds give different-looking layouts), and the O(n²) runtime is prohibitive past a few tens of thousands of points. UMAP addresses each of these limitations: it preserves more global structure, runs in O(n log n), and has principled defaults for its main parameters (n_neighbors and min_dist). For most current use cases UMAP has simply replaced t-SNE. The exception is when reproducing older literature that used t-SNE specifically — then using t-SNE preserves comparability.

LDA sits in a different corner of the matrix from the others: linear, but supervised. It finds the directions that maximise the ratio of between-class variance to within-class variance, so the class means are as far apart as possible relative to the within-class spread along each direction. For a linear classifier downstream this is often the best preprocessing you can do — the projection is explicitly optimised for class distinction rather than for variance retention. The constraints are that LDA produces at most C − 1 components for C classes (so a 3-class problem gives at most 2 LDA axes, a binary problem just 1), that it is linear (so the same curvature limitation as PCA applies), and that it formally assumes Gaussian class distributions with equal covariance (so performance degrades gracefully on skewed classes but the optimisation is only exactly right on Gaussians).

The practical decision tree is short. If you have labels and your downstream task is classification, try LDA first — it is fast, uses the supervision signal, and often gives a strong baseline. If you have labels but want more components than C − 1, or if your data is non-linearly separable, PCA or a non-linear method is better. If you have unlabelled data and the goal is feature engineering for any downstream model, PCA is the safe default. If you have unlabelled data and the goal is visualisation, UMAP is the default, with t-SNE as an alternative for smaller datasets or literature comparability. The four-chapter arc of this topic closes here: the curse of dimensionality explained why high-dimensional features are costly, the correlation chapter identified redundancy, PCA provided the primary compression tool, and this chapter extended the toolkit for cases where PCA is not enough. The next topic — Data Splits and Evaluation — pivots from reducing the feature space to measuring what the model actually learned; both serve the same goal of building models that generalise.

Key Points

PCA is linear and unsupervised — it maximises variance and works well when the meaningful structure in the data is approximately linear, but fails when the structure is curved (a Swiss roll, a learned embedding space) and misses class separation when variance and class labels are not aligned.

t-SNE preserves local neighbourhoods — who is near whom — at the cost of unreliable global distances and O(n²) runtime; it is a visualisation tool for exploring whether clusters exist in moderate-sized datasets, not a feature-engineering method.

UMAP is faster than t-SNE (O(n log n)), preserves more global structure so inter-cluster distances are more trustworthy, and has principled defaults that usually work without tuning — it has largely replaced t-SNE as the default non-linear visualiser for datasets above a few thousand points.

LDA is supervised: it uses class labels to find the direction that maximises the ratio of between-class variance to within-class variance, producing C − 1 components for C classes and typically outperforming PCA as preprocessing for linear classifiers when labels are available.

Practical decision tree: labels + class separation → LDA; downstream ML pipeline → PCA; visualisation of unlabelled structure → UMAP; very large datasets or comparing against older work → UMAP over t-SNE on speed alone.

✓ Checkpoint

Check Your Understanding

Answer these questions about the reduction-method scenarios covered in this chapter.

A research team visualises cell embeddings using PCA and sees a single elongated cloud with no visible clusters. They try t-SNE on the same embeddings and see four clean well-separated clusters. What best explains the difference?

A data scientist runs t-SNE with perplexity=5 on a 2000-point dataset. The resulting layout shows 30+ small tight clusters where domain knowledge says there should be about 6 groups. What is the most likely issue?

Match these four reduction scenarios to the most appropriate method. Put the scenarios in order from the one best served by LDA to the one best served by UMAP.

1.Unlabelled exploration of 100,000 customer profiles for natural clustering, with a visualisation as deliverable.
2.Preprocessing a 20,000-row customer dataset with known churn labels for a downstream linear classifier targeting binary churn prediction.
3.Preprocessing a 100,000-row image dataset with a known class label across 8 classes for a linear classifier.
4.Interactive exploration of a 3000-point gene expression dataset without labels, looking for subtypes.

A team applying LDA to a 10-class classification problem should expect LDA to reduce the data to up to 9 dimensions, regardless of how many features the original dataset has, and the reduced representation should typically work better for a downstream linear classifier than a PCA reduction to the same number of dimensions.