If high dimensions are so bad, why do neural networks work?

Because neural networks do not compute nearest neighbours in the raw feature space. They learn a mapping from the high-dimensional input to a lower-dimensional representation where the things that matter (the classes, the clusters, the generative structure) are well separated. The 'effective dimensionality' of the learned representation is usually much smaller than the input. The curse is about raw data geometry; deep models work by escaping it through learned feature transformations.

Does choosing a different distance metric help?

Partially. Manhattan (L1) distance is slightly less affected by the curse than Euclidean (L2) because summing absolute differences concentrates less than summing squared ones. Fractional Lp distances (e.g. L0.5) can preserve discriminability further. But no metric escapes the fundamental geometry: in truly high dimensions, every reasonable metric has the same eventual problem. The real fix is to reduce dimension, not to pick a different distance.

Dimensionality Chapter 1 of 4 · tap to browse

01 Curse of Dimensionality 02 Correlation & Redundancy 03 PCA 04 Other Methods

The Curse of Dimensionality

What happens to data, distance, and intuition when features multiply

A recommendation engineer adds a hundred new audio features to their k-NN system hoping to refine 'similar songs' queries. Accuracy stays flat, inference slows by 30×, and the nearest 10 songs to almost any query begin to look suspiciously identical across queries. The curse of dimensionality is not a theoretical nuisance — it is the reason that system quietly stopped working.

Learning Objectives

1 Explain why the volume of a d-dimensional unit ball collapses to zero relative to the unit cube as d grows, and what this implies for the distribution of points sampled uniformly in a high-dimensional space.
2 Describe the shell-concentration phenomenon: in high dimensions, nearly all of a hypersphere's volume lies close to its surface, so random samples are rarely found near the centre.
3 Predict how the ratio of nearest-to-farthest neighbour distances changes with dimension and dataset size, and identify when distance-based algorithms (k-NN, k-means, clustering) become unreliable.
4 Recognise exponential sparsity: the number of samples required to maintain density scales as n^d, so a dataset that is 'enough' at d=2 is catastrophically thin at d=10 or d=100.

¶ Narrative

Four Ways Geometry Betrays Intuition

A recommendation engineer adds a hundred new audio features to their nearest-neighbour music system. More information about each track, they reason, must give sharper “similar songs” results. Instead, accuracy plateaus, query latency explodes, and the top 10 neighbours start to look suspiciously generic across very different query songs. The mechanism is not a bug in their code — it is a geometric fact about what happens to data as dimension grows. That fact is the curse of dimensionality, and four distinct geometric failures contribute to it.

Failure one: volume collapses to the corners

In two dimensions, a unit disk inscribed in a unit square fills π⁄4 ≈ 78.5% of it. In three dimensions, a unit ball fills π⁄6 ≈ 52.4%. By ten dimensions, the inscribed ball occupies roughly 0.25% of the cube — the remaining 99.75% is in the corners. By twenty dimensions, the ball is effectively invisible at 10⁻⁸ of the cube’s volume.

Volume of the inscribed hypersphere as a fraction of the unit hypercube, on a log scale from d=1 to d=20. At d=10 the ball occupies 0.25% of the cube; at d=20 it is 10⁻⁸. In high dimensions the interior of the ball — the region we picture when we imagine a 'cluster' — is vanishingly small compared to the corners.

This matters for machine learning because random samples drawn uniformly in a high-dimensional bounded space will not be “near the origin” in the sense our low-dimensional intuition suggests. They live in the corners. A model trained to expect data inside the ball will be working on an empty region.

Failure two: volume concentrates at the surface

Inside a unit ball, what fraction of the volume is within the outer 10% shell — the region with radius greater than 0.9? In one dimension (a line segment), trivially 10%. In two dimensions (a disk), 1 − 0.9² = 19%. In three, 27%. By d=50, the figure is 99.5%: essentially every point in a high-dimensional ball is close to the surface.

Fraction of a d-dimensional ball's volume that lies within the outer 10% shell (r > 0.9). At d=1 the shell carries 10%; at d=5 it carries 41%; at d=20 it passes 88%; at d=50 it is 99.5%. Past d ≈ 30, essentially all of a ball's volume is near its surface — the 'middle' of a high-dimensional cluster is empty.

For a Gaussian cluster in high dimensions, this is counter-intuitive but formally similar: samples live in a thin shell at distance ≈ σ√d from the mean, rather than being concentrated around the mean as the 1D intuition suggests. A “central tendency” as a location is a 1D–2D picture. In high-D, the centre of mass is a mathematical convention that no point is actually near.

Failure three: distances concentrate

The most operationally consequential failure is distance concentration. In low dimensions, some points in a dataset are meaningfully closer to a query than others — the nearest-neighbour ratio (farthest − nearest) / nearest is large. As dimension grows, that ratio shrinks toward zero: nearest and farthest become almost the same number.

For a query point in a uniformly sampled 2-dimensional dataset, the farthest neighbour might be six times further away than the nearest. That is a clear discriminative signal — ‘nearby’ has a meaning. In 50 dimensions, the farthest and the nearest sit within a factor of about 1.8 — not even double. The nearest neighbour might be the 47th sorted neighbour for all you can tell from the numbers.

Distance metric	Low d (d=2)	Medium d (d=10)	High d (d=50)
Euclidean (L2)	nearest ≈ 0.15, farthest ≈ 1.00 — ratio ≈ 6.1	nearest ≈ 0.62, farthest ≈ 2.05 — ratio ≈ 2.3	nearest ≈ 1.80, farthest ≈ 3.35 — ratio ≈ 0.86
Manhattan (L1)	nearest ≈ 0.22, farthest ≈ 1.50 — ratio ≈ 5.8	nearest ≈ 1.55, farthest ≈ 5.10 — ratio ≈ 2.3	nearest ≈ 7.00, farthest ≈ 15.0 — ratio ≈ 1.14

The practical meaning: k-nearest-neighbour classifiers and k-means clustering both rely on the premise that ‘nearest’ is a small and discriminative subset of the data. When every point is roughly the same distance from every other point, these algorithms effectively reduce to guessing.

💡 Insight

Manhattan (L1) distance is slightly less affected by distance concentration than Euclidean (L2). Fractional metrics like L0.5 are even less affected. This is why some high-dimensional nearest-neighbour systems use Manhattan or fractional distances — not because they give “better” neighbours in an absolute sense, but because they retain more contrast before collapsing.

Failure four: exponential sparsity

To maintain the same local density of data points across dimensions, the number of samples you need grows exponentially. If 10 points uniformly spaced on a unit interval give reasonable coverage, then in 2D you need 10² = 100 to get the same per-unit-area density, in 3D you need 10³ = 1,000, and in 10 dimensions you need 10¹⁰ — ten billion. No dataset provides that.

The same 10 points displayed in 1D, 2D, and 3D. On a line, 10 points are dense. On a 2D square, they are already noticeably scattered — you would need 100 to match the 1D density. On a 3D cube, they are almost pinpricks — matching 1D density would require 1,000 points. By d=10 the same density requires 10¹⁰ points, which no real dataset provides.

In practice, ML datasets are not uniform — real data usually lies on a much lower-dimensional manifold embedded in the high-dimensional ambient space. A dataset with 512-dimensional embeddings might have intrinsic dimension closer to 10 or 20 if the semantic structure is limited. This is what makes modern deep learning tractable despite the ambient dimensions. But when the learned representation itself is still high-dimensional, or when raw features are used without reduction, sparsity bites.

Which algorithms break first

Distance-based methods feel the curse most directly. The ordering by sensitivity is roughly:

k-Nearest Neighbours breaks first. Its entire mechanism is “find the nearest k points”; when nearest loses meaning, so does classification.
k-means and other centroid-based clustering breaks next. Centroids still exist mathematically but no longer capture a local neighbourhood.
Kernel methods (SVM with RBF kernel, Gaussian processes) degrade because the kernel’s effective bandwidth depends on a distance that no longer varies.
Linear models can still overfit — a model with more features than examples has perfect memorisation capacity even without distance computations.
Tree-based models are robust because each split uses a single feature, never a distance.
Neural networks are robust because they learn a low-dimensional representation where the classes separate — they escape the ambient dimension rather than fighting it.

python

import numpy as np

def shell_fraction(d: int) -> float:
  """Fraction of a d-dim ball's volume inside the outer 10% shell."""
  return 1 - 0.9 ** d

def relative_contrast_approx(d: int) -> float:
  """Rough (farthest - nearest) / nearest ratio for uniform [0,1]^d, n≈120."""
  # Approximation: nearest ~ sqrt(d) * 0.16, farthest ~ sqrt(d/6) * 2.8
  nearest  = max(0.15, np.sqrt(d) * 0.16)
  farthest = np.sqrt(d / 6) * 2.8 + 0.8
  return (farthest - nearest) / nearest

for d in [2, 5, 10, 25, 50, 100]:
  print(f"d={d:3d}  shell@0.9={shell_fraction(d):6.1%}  "
        f"contrast={relative_contrast_approx(d):4.2f}")

# d=  2  shell@0.9= 19.0%  contrast=4.05
# d=  5  shell@0.9= 41.0%  contrast=2.24
# d= 10  shell@0.9= 65.1%  contrast=1.49
# d= 25  shell@0.9= 92.8%  contrast=0.93
# d= 50  shell@0.9= 99.5%  contrast=0.69
# d=100  shell@0.9=100.0%  contrast=0.49

The practical takeaway from this chapter: if your data lives in dozens or hundreds of dimensions and your model is distance-based, either accept that you will get diminishing returns from adding features, or reduce dimension before the algorithm sees the data. The next three chapters develop the toolkit for doing that second thing — first by identifying redundant features that can be safely dropped, then by finding directions of maximum variance (PCA), and finally by comparing PCA against modern non-linear alternatives.

In this section

Why does adding features often hurt instead of help?

Every additional feature stretches the space a point can live in. With n training examples fixed, each new dimension cuts local density by roughly a constant factor. The model's neighbours, centroids, and local statistics are all computed on increasingly sparse data. At some point the 'information' a new feature adds is less than the sparsity it costs, and performance degrades. The classic diagnostic: hold out a validation set, plot accuracy as features are added one at a time, and watch for the inflection where the curve turns downward.

Does this only affect distance-based methods?

Distance-based methods (k-NN, k-means, kernel SVM, DBSCAN) feel the curse most directly because their mechanism is: find nearby points. When 'nearby' loses meaning — because all points are roughly equidistant — these algorithms effectively randomise. Tree-based models (random forests, gradient boosting) are more robust because they only compare along one feature at a time at each split. Linear models are in between: they do not compute distances, but they can still overfit with more features than examples.

Is there a specific dimension where things break?

There is no single threshold, but some rough reference points help. Below d ≈ 10, distance-based methods usually work. Between 10 and 50, performance becomes highly dataset-dependent: if your data lies on a low-dimensional manifold embedded in high ambient dimension, distances may still carry signal. Above d ≈ 100, raw Euclidean distance between most point pairs is essentially indistinguishable and distance-based reasoning fails without heavy preprocessing.

◎ Intuition

A product team wants to add facial-feature measurements to a recommendation model — 128 features extracted from each user's photo. The current model uses 8 behavioural features and performs well. Before adding the facial features: - If the 8-feature model gives a nearest-neighbour contrast ratio of roughly 3 (farthest neighbour is three times further than the nearest), what do you estimate the 128-feature model's contrast ratio would be — and would k-NN still produce meaningful recommendations at that ratio? - The 8-feature model was trained on 50,000 users. To maintain the same per-dimension data density in the 128-feature space, how many users would the training set need? Is that number reasonable for any consumer product? - Is there anything the team could do to make the 128 facial features useful without inflating the dataset size — for example, by mapping them into a smaller learned representation before the nearest-neighbour lookup?

↺ Reflection

What to Keep From This Chapter

The curse of dimensionality is best understood as four distinct geometric failures that reinforce each other. Volume collapses toward the corners — a unit ball inscribed in a 10-dimensional cube fills less than 0.3% of it, so random high-dimensional samples are rarely near any intuitive ‘centre’. Volume also concentrates at the surface of whatever region data occupies — the interior of a high-dimensional cluster is empty, with nearly all points living on a thin shell. Distances concentrate so tightly that the farthest neighbour of a query point is barely further than the nearest, which is the direct mechanism behind nearest-neighbour algorithms breaking at high dimension. And sparsity grows exponentially so that the same 10 points that gave reasonable coverage in 1D would need to be ten billion points at d=10 to maintain the same density. These phenomena are not independent observations; they are different sides of the same underlying fact about volume in high-dimensional Euclidean space.

The algorithms that break first are exactly those whose mechanism is “find nearby points”. k-nearest-neighbours, k-means, kernel methods with Gaussian kernels, and most clustering algorithms all depend on a meaningful local neighbourhood. When dimension grows and every point is roughly equidistant from every other point, local neighbourhoods dissolve and these algorithms effectively reduce to random guessing. The playground makes this tangible: at d=2 the nearest neighbour is six times closer than the farthest; at d=50 it is barely closer at all. The practical rule is that raw distance-based reasoning starts to degrade around d=10 and is essentially unreliable past d=100 without intervention.

There are two partial mitigations and one real solution. The first mitigation is to use a distance metric less affected by concentration — Manhattan instead of Euclidean, or even fractional Lp distances. This buys a modest amount of effective dimensionality but does not change the fundamental geometry; at truly high dimensions every metric eventually collapses. The second mitigation is to accept diminishing returns from adding features — past some dataset-dependent threshold, more features stop helping because the sparsity they introduce exceeds the information they carry. The real solution is dimensionality reduction: either removing redundant features (the next chapter), projecting onto directions of maximum variance (the PCA chapter), or learning a non-linear low-dimensional embedding (the chapter after that). Deep learning sidesteps the curse not by working in high-dimensional space but by learning its way out of it — mapping raw inputs into a much smaller learned representation before any distance-based reasoning happens.

The message to take into the next three chapters: dimension is a cost, not a gift. Every feature must earn its place in the model, and a model that naively stacks hundreds of raw features is paying the curse’s tax on every inference without getting proportional benefit in return. The rest of this topic is about how to pay less.

Key Points

High-dimensional unit balls are nearly invisible compared to their bounding cubes — at d=10 the ball occupies 0.25% of the cube and at d=20 it is 10⁻⁸, so random samples drawn uniformly in a bounded high-dimensional space live overwhelmingly in the corners rather than near the centre.

Almost all of a high-dimensional ball's volume is concentrated in a thin outer shell — by d=50, 99.5% of the volume lies within 10% of the surface — which means the intuitive picture of data clustered around a central location fails as dimension grows.

Distance concentration is the operationally consequential failure: the ratio of farthest to nearest neighbour falls toward one as dimension rises, so k-NN, k-means, and other distance-based methods lose their discriminative power because 'nearest' stops being meaningfully different from 'generic'.

Sparsity grows exponentially with dimension: the number of samples needed to maintain local density scales as n^d, so a dataset that is 'enough' at d=2 is catastrophically thin at d=10 or d=100 — which is why more features without more data usually hurts rather than helps.

Tree-based models and neural networks resist the curse because trees split on one feature at a time and networks learn a lower-dimensional representation; distance-based methods (k-NN, k-means, kernel SVM) break first and need either dimensionality reduction or a metric switch to remain useful.

✓ Checkpoint

Check Your Understanding

Answer these questions about the curse of dimensionality scenarios covered in this chapter.

A dataset has uniformly-distributed points in the unit cube [0,1]^d. For which statement is the intuition least reliable at d=50?

A data scientist increases the dimensionality of their k-NN music recommender from 8 features to 128. They keep the training catalogue at 50,000 songs. Which outcome should they most expect?

Put the following algorithm classes in order from most-sensitive to the curse of dimensionality (breaks first as d grows) to least-sensitive (robust to high d).

1.Decision tree / random forest
2.k-Nearest Neighbours with Euclidean distance
3.Neural network with a learned low-dimensional embedding
4.k-means clustering
5.Kernel SVM with Gaussian RBF

A team has a dataset with 500 training examples in a 1,024-dimensional BERT embedding. They should expect k-NN classification performance on this dataset to be dramatically better than k-NN on the same 500 examples represented as 10 hand-crafted features — because the 1,024-dimensional space carries much more information.