Why do machine learning models use hundreds of dimensions?

Each dimension in an embedding space corresponds to one independent direction of variation. A single number cannot encode the difference between 'king' and 'queen' and 'monarch', but 768 independent numbers can — each dimension captures a different axis of meaning the model learned. More independent dimensions means finer-grained distinctions, up to the point of diminishing returns.

Is the standard basis the only valid basis for 2D space?

No. Any two non-parallel vectors form a valid basis for 2D space. The standard basis e₁=(1,0) and e₂=(0,1) is the most common choice because it makes arithmetic simple, but oblique, rotated, or scaled bases work equally well — they assign different coordinates to the same geometric points.

Vectors Chapter 3 of 3 · tap to browse

01 What Is a Vector? 02 Vector Operations 03 Vector Spaces and Basis

Vector Spaces and Basis

Linear combinations, span, and the coordinate systems that power ML embeddings

A large language model stores every word as a point in a 768-dimensional space where each dimension encodes something the model learned to distinguish — tense, formality, topic. This chapter shows what that means: what a dimension is, why 768 of them are needed, and why they must all point in independent directions.

Learning Objectives

1 Recognise a linear combination as a sum of vectors each scaled by a scalar coefficient, and compute one given specific vectors and scalars.
2 Explain what the span of a set of vectors means geometrically — the complete set of points reachable by linear combinations of those vectors.
3 Explain why a basis must be linearly independent, using the grid-collapse example to show what happens when two basis vectors become parallel.
4 Connect high-dimensional vector spaces to ML embeddings, explaining why hundreds of independent directions are needed to encode nuanced meaning.

¶ Narrative

Combinations, Span, and Basis

Take any two vectors. You can scale each one by a number, then add the results together. The output is a new vector. Do this for every possible pair of scaling numbers and you sweep out a set of reachable points. That set is what this chapter is about.

Linear combinations

A linear combination of vectors v₁ and v₂ is any sum of the form:

Linear combination

w = c_{1} v_{1} + c_{2} v_{2}

where c₁ and c₂ are scalars — any real numbers, positive, negative, or zero. The vector w is the result: you scale v₁ by c₁, scale v₂ by c₂, then add them component-wise.

Every specific choice of (c₁, c₂) lands you at a specific point. Change the scalars and you land somewhere else. The question is: which points can you reach by varying the scalars freely?

For example: with v₁ = (2, 1) and v₂ = (−1, 1), the combination 3v₁ + 2v₂ gives:

Worked example

3 (2, 1) + 2 (- 1, 1) = (6, 3) + (- 2, 2) = (4, 5)

Choosing different scalars lands you somewhere else: 1(2, 1) + 0(−1, 1) = (2, 1), or −1(2, 1) + 1(−1, 1) = (−3, 0). Each pair (c₁, c₂) reaches a different point in the plane.

Real World

In a neural network, every neuron in a fully-connected layer computes a linear combination. Its output is c₁x₁ + c₂x₂ + … + cₙxₙ — a weighted sum of its inputs. The weights are the scalars; the input activations are the vectors. Linear combinations are therefore not abstract: they are the core operation every dense layer performs on every forward pass.

Varying the scalars c₁ and c₂ moves the result **w** = c₁**v₁** + c₂**v₂** to different points in the plane.

python

import numpy as np

v1 = np.array([2, 1])
v2 = np.array([-1, 1])

c1, c2 = 3, 2
w = c1 * v1 + c2 * v2   # array([4, 5])

Span: the set of all reachable points

The span of a set of vectors is exactly that: the complete collection of points reachable through all possible linear combinations.

💡 Insight

The span is not just the two vectors themselves — it is every point you can reach by combining them. Think of the scalars as dials you can turn to any value. The span is the full sweep of where those dials can take you.

Whether the span fills a line, a plane, or a higher-dimensional space depends entirely on the vectors you start with. Two vectors that point in the same direction (or opposite directions) only reach points along that one direction — their span is a line. Two vectors that genuinely point in different directions reach every point in the plane — their span is the whole 2D space.

In neural networks, the weights of a layer must span the full output space for the layer to represent any output. If the weight vectors are linearly dependent, some outputs are unreachable — the layer has less expressive power than its size suggests.

Two non-parallel vectors span the entire 2D plane. As one vector rotates toward the other, the span collapses to a single line.

Linear independence: each vector adds a new direction

Two vectors are linearly independent when neither can be expressed as a scalar multiple of the other — each one genuinely points in a new direction.

Plain English: two vectors are independent when you cannot get from one to the other by stretching or shrinking. Each must point somewhere genuinely new.

For example, (1, 2) and (2, 4) are dependent: (2, 4) = 2 × (1, 2). Any linear combination c₁(1, 2) + c₂(2, 4) = (c₁ + 2c₂)(1, 2) — it always collapses to a single scalar multiple of (1, 2). The span is the line through the origin in direction (1, 2), not the full plane.

When independence holds, combining the two vectors can reach points in a full 2D plane. When independence fails — when one vector is just a stretched or flipped version of the other — the two vectors only span a line. There is only one direction available, no matter what the scalars are.

Common Mistake

A common misconception is that more vectors always means a larger span. Adding a third vector to a linearly independent pair in 2D does not expand the span beyond the plane — the third vector is already reachable as a combination of the first two. Adding a redundant vector adds no new directions, only more parameters to describe the same set of points.

The condition for independence generalises to any number of vectors: v₁, v₂, …, vₙ are linearly independent when the only way to combine them to get the zero vector is to set all the scalars to zero. If you can combine them to get zero with at least one non-zero scalar, they are dependent — one of them can be written as a combination of the others.

Basis: the minimal spanning set

A basis is a set of vectors that is simultaneously:

Linearly independent — no vector is redundant
Spanning — every point in the space can be reached

A basis is the minimal complete description of a space. Remove any vector from a basis and some points become unreachable. Add any vector and it is automatically a combination of the ones already there — redundant.

💡 Insight

A basis is a coordinate system. Once you choose a basis, every point in the space has exactly one address: the pair of scalars (c₁, c₂) that produces it as a linear combination of the basis vectors. Change the basis and the addresses change — but the points do not.

The most familiar basis in 2D is the standard basis: e₁ = (1, 0) and e₂ = (0, 1). The address of any point (x, y) in the standard basis is just x units of e₁ plus y units of e₂ — which is why coordinates look natural. But this is not the only valid basis. Any two non-parallel vectors form a valid basis for 2D space. They give a different coordinate system to the same geometric plane.

The same point has different coordinates in different bases. Left: standard basis gives (3, 2). Right: oblique basis **e₁** = (1, 1), **e₂** = (1, −1) gives different but equally valid coordinates.

💡 Insight

Coming up — Principal Component Analysis: PCA finds a new basis for data by choosing basis vectors that align with the directions of greatest variance. The first basis vector points along the axis of maximum spread; the rest are orthogonal to it. Expressing data in this new basis — changing coordinates — is the core operation PCA performs.

Dimension and the necessity of independence

The dimension of a vector space is the number of vectors in any basis for it. This is always well-defined: any two bases for the same space have the same number of vectors.

In 2D, every basis has exactly 2 vectors. In 3D, every basis has exactly 3. In the vector spaces that ML models use — typically 128, 256, 768, or 4096 dimensions — every basis has exactly that many vectors.

Real World

Word embeddings (Word2Vec, GloVe, BERT) map each word or token to a vector in a high-dimensional space — 768 dimensions is a typical size for BERT. Each of those 768 dimensions is an independent direction that the model learned during training: something like formality, tense, topic, emotional register. To distinguish between “king”, “queen”, “duke”, “empress”, and thousands of other words with similar meanings but distinct roles, you need enough independent directions. 768 independent axes provide far more resolution than 2 or 3 could.

Image feature vectors work the same way. A CNN’s penultimate layer produces a vector of 2048 or 4096 numbers. Each dimension corresponds to a learned visual feature — curves, textures, object parts. The vector space they live in has 2048 independent directions, each encoding something the model learned to recognise.

The key word is independent: if two of those 768 directions were parallel, one would be redundant — you would need only 767. The model’s training pressure drives the representations toward using each direction for something distinct.

In this section

What is a basis for a vector space?

A basis is a minimal set of linearly independent vectors that spans the entire space. Minimal means no vector in the set is redundant — removing any one of them leaves some points unreachable. In 2D, any two non-parallel vectors form a valid basis. In n-dimensional space, you need exactly n linearly independent vectors.

What does span mean in linear algebra?

The span of a set of vectors is the complete collection of points you can reach by forming linear combinations — adding scaled versions of those vectors together. Two non-parallel vectors in 2D span the entire plane. Two parallel vectors span only a single line through the origin.

What does it mean for vectors to be linearly independent?

Vectors are linearly independent when none of them can be expressed as a linear combination of the others — each one contributes a genuinely new direction. In 2D, two vectors are linearly independent if and only if they are not parallel. When independence fails, the span collapses: two parallel vectors span only a line, not a plane.

Key Terms

Basis Vector Linear Combination Span Linear Independence Basis

◎ Intuition

Imagine two streets in a city. Street A runs east. Street B runs north. Any address can be reached by walking some number of blocks along Street A and some number along Street B. The two street directions are your basis — your coordinate system for the city. Now imagine rotating Street B slowly toward Street A until both streets run east. What happens to the set of addresses you can reach? You can still move east as far as you like — but you cannot reach any address that is north of your starting point. Whole neighbourhoods become unreachable. The two streets together only let you travel in one direction. That collapse — from a full city plane to a single line — is what happens to a vector space when its basis vectors become linearly dependent. Two independent directions span a full 2D space. One direction, no matter how many vectors you add pointing the same way, spans only a line.

↺ Reflection

Basis and the Geometry of Space

That is the core geometry of a basis. A basis is not the space itself; it is a way of describing the space. Change the basis and the coordinates change. Keep the basis fixed and every point has exactly one address — the unique pair of scalars (c₁, c₂) such that c₁e₁ + c₂e₂ = point. This uniqueness is guaranteed precisely by linear independence.

As e₁ moves toward e₂, two things happen simultaneously: the deformed parallelogram grid narrows toward a line and the set of reachable points shrinks from a full plane to a 1D strip. At the moment the two vectors become parallel — determinant exactly zero — the grid collapses to a single line and the plane-wide span vanishes. No combination of scalars could reach any point off that line. Two vectors that differ only by a scalar multiple contribute only one independent direction, not two. The span loses a dimension.

Linear independence is therefore not a technicality. It is the condition that prevents this collapse. A basis must be independent because any redundant vector can be expressed as a combination of the others — it contributes no new direction, only a new label for a point already reachable.

This is why BERT represents each token as a vector in a 768-dimensional space with 768 independent directions, not 2. To distinguish “bank” (financial institution) from “bank” (river bank) from “bank” (to tilt in a turn), and to do so for the roughly 30 000 tokens in its vocabulary, the model needs enough independent axes of meaning. Each of the 768 dimensions captures something the training pressure found useful — formality, tense, topic, emotional register, syntactic role. If two of those 768 dimensions were parallel, one would be redundant: the model would be carrying a direction it had already described. 768 independent axes provide 768 genuinely distinct ways to vary; a dependent set would give fewer actual degrees of freedom.

Key Points

A linear combination c₁v₁ + c₂v₂ produces a new vector by scaling and adding. Varying c₁ and c₂ over all real numbers sweeps out the span — the complete set of reachable points.

Two vectors span a full 2D plane if and only if they are linearly independent — neither is a scalar multiple of the other. When they are parallel, their span collapses to a single line through the origin.

A basis is a minimal spanning set: linearly independent and spanning. Every point has exactly one address in a given basis; change the basis and the addresses change, but the underlying geometry does not.

High-dimensional ML embeddings require many independent directions because each dimension encodes a distinct axis of meaning the model learned. Dependent dimensions would be redundant — they could be described by the others.

✓ Checkpoint

Check Your Understanding

Three questions on vector spaces, span, and basis. Select an answer, then reveal to see the explanation.

Vectors u = (2, 4) and v = (1, 2) are parallel (v is a scalar multiple of u). What is the span of {u, v}?

The standard basis e₁ = (1, 0) and e₂ = (0, 1) is the only valid basis for 2D space.

A language model represents each token as a vector in a 768-dimensional space. Why does the model need 768 dimensions rather than, say, 3?

Given v₁ = (2, 1) and v₂ = (−1, 1), which of the following is a linear combination of v₁ and v₂ that equals (3, −2)?