Why are weight matrices in neural networks called linear layers?

A linear layer applies matrix multiplication y = Wx to its input vector x, which is a linear transformation satisfying W(u + v) = Wu + Wv and W(cu) = cW(u). No nonlinear operations are applied. The word 'linear' refers to this algebraic property. Activation functions such as ReLU are added after the linear layer specifically to introduce nonlinearity.

Matrices Chapter 1 of 3 · tap to browse

01 What Does a Matrix Do? 02 Matrix Multiplication 03 Eigenvectors

What Does a Matrix Do?

A matrix is a transformation — it takes a vector and produces a new vector

Every linear layer in a neural network applies a weight matrix to transform one embedding vector into another — learning a recipe that routes useful information to the next layer.

Learning Objectives

1 Recognise matrix-vector multiplication as a row-by-column dot product that produces one output component per row of the matrix.
2 Identify the transformation a 2×2 matrix performs by reading its columns: the first column shows where e₁ = (1, 0) lands, the second shows where e₂ = (0, 1) lands.
3 Explain what a weight matrix in a linear neural network layer does: it applies a linear transformation that rotates and stretches the input vector into a new vector space.

¶ Narrative

Matrices as Transformations

A matrix is not a spreadsheet of values. It is a linear transformation — a function that takes a vector as input and produces a new vector as output. The numbers inside the matrix are the parameters of that transformation. Everything a matrix does geometrically — rotating, stretching, shearing, reflecting — is encoded in those four numbers.

Matrix-vector multiplication

A 2×2 matrix M multiplied by a 2D vector v gives a new 2D vector. The rule is row times column: the first output component is the dot product of the first row of M with v; the second is the dot product of the second row with v:

Matrix-vector multiplication

M v = (a c b d) (v_{x} v_{y}) = (a v_{x} + b v_{y} c v_{x} + d v_{y})

(2013) (12) = (2 \cdot 1 + 1 \cdot 2 0 \cdot 1 + 3 \cdot 2) = (46)

The input vector (1, 2) has been transformed into the output vector (4, 6). Both input components contributed to both output components — the output is a genuine mixture of the inputs, not a simple component-by-component scaling.

Applying [[2,1],[0,3]] to the input vector (1,2). Each output component is computed by dotting one row of the matrix with the full input vector. Row 1 produces 4; row 2 produces 6. The input arrow becomes the output arrow — the vector has been transformed.

This is exactly what happens inside a neural network’s linear layer. Each output neuron computes one row-dot-vector product: it combines all input values using learned weights. A layer with 3072 output neurons computes 3072 such dot products — one per row of the weight matrix — producing a 3072-dimensional output from a 768-dimensional input in a single matrix multiplication.

Every output component is a weighted sum of all the input components. The first output mixes v_x and v_y according to weights a and b. The second mixes them according to c and d. This is why the transformation is called linear — it scales and adds, never multiplies inputs together or applies any other nonlinear operation.

The column interpretation

There is a more revealing way to read the same formula. Factor out v_x and v_y:

Column interpretation

M v = v_{x} (a c) + v_{y} (b d)

The output is a linear combination of the two columns of M. The first column — (a, c) — is the vector that e₁ = (1, 0) maps to under M. Substitute v_x = 1, v_y = 0 and the formula gives exactly (a, c). The second column — (b, d) — is where e₂ = (0, 1) lands: substitute v_x = 0, v_y = 1 and you get (b, d).

💡 Insight

To understand what a matrix does, read its columns. The first column tells you where the x-axis goes. The second column tells you where the y-axis goes. Knowing these two destinations completely determines the transformation — any other vector’s destination follows by linear combination.

In a neural network, reading the columns of a weight matrix tells you what each input dimension contributes to the output space. Each column is a direction in the output space that one input feature gets mapped to. Training adjusts these columns — and therefore these directions — until the mapping routes information in whatever way minimises the loss.

The unit square transforms according to where the basis vectors go. Column 1 of the matrix is exactly where e₁ = (1,0) lands. Column 2 is exactly where e₂ = (0,1) lands. The unit square deforms into the parallelogram whose sides are the two column vectors.

The four fundamental transformations

Any 2×2 matrix performs a combination of four basic transformation types:

Type	Matrix [[a, b], [c, d]]	What it does geometrically
Rotation 90° CCW	[[0, −1], [1, 0]]	Rotates every vector 90° counterclockwise. Lengths and angles preserved. Determinant = 1.
Uniform scale ×2	[[2, 0], [0, 2]]	Doubles all lengths in both axes. Area quadruples. Determinant = 4.
Horizontal shear	[[1, 1], [0, 1]]	Slides each point rightward by its height. Squares become parallelograms. Determinant = 1.
Reflection (x-axis)	[[1, 0], [0, −1]]	Flips every point across the x-axis: (x, y) → (x, −y). Determinant = −1.

Area and the determinant

Each row of the table has a determinant value. This number — computed as ad − bc for a 2×2 matrix [[a, b], [c, d]] — measures how much the transformation scales area.

det (a c b d) = a d - b c

A determinant of 1 means area is preserved — the parallelogram has the same area as the original unit square. Rotation and shear both have determinant 1: they reshape the square without changing its area. A determinant of 4 means area quadruples — a uniform 2× scale multiplies each dimension by 2, so area scales by 2² = 4. A negative determinant means the transformation flips orientation — like looking at the image in a mirror. And a determinant of zero means the transformation collapses the plane to a line: two columns became parallel, area dropped to zero, and information was permanently lost.

💡 Insight

A matrix with determinant zero is called singular. It destroys information — multiple input vectors map to the same output, and the transformation cannot be reversed. In neural networks, a collapsed weight matrix would map all inputs to a one-dimensional subspace, destroying the network’s ability to distinguish between inputs.

The determinant will appear again in Chapter 3 of this topic — eigenvectors — where it determines whether a transformation stretches or flips space.

To see the column interpretation applied to a specific transformation: the 90° rotation matrix [[0, −1], [1, 0]] has columns (0, 1) and (−1, 0). Column 1 tells you where e₁ = (1, 0) lands: at (0, 1) — the unit vector that was pointing right now points up. Column 2 tells you where e₂ = (0, 1) lands: at (−1, 0) — the unit vector that was pointing up now points left. Together, the entire plane has rotated 90° counterclockwise. The unit square stays the same size and shape — only its orientation changes, consistent with determinant = 1.

Four canonical matrix transformations. Each panel shows the unit square before (ghost outline) and after (filled parallelogram) the transformation. Column arrows show where e₁ and e₂ went — the column interpretation applied to each case.

In neural networks

Real World

Every linear layer in a neural network is matrix multiplication. Given an input vector x of dimension n, a linear layer applies a weight matrix W of shape (m × n) to produce an output y = Wx of dimension m.

The weight matrix W is not hand-crafted — it is learned during training. Each column of W encodes what direction in the output space a particular input dimension maps to. When a transformer model converts a 768-dimensional token embedding into a 3072-dimensional hidden state in a feed-forward sublayer, it is applying a (3072 × 768) weight matrix — a linear transformation from one vector space into a higher-dimensional one. The learned columns point toward the directions the network found useful to distinguish tokens.

Common Mistake

Matrix multiplication is not element-wise. For a 2×2 matrix M and vector v, the first output component is a·v_x + b·v_y — mixing both input components — not simply a·v_x. Element-wise multiplication of matrices (the Hadamard product) is a separate, rarely-used operation. When you see Mv in a neural network context, it always means the full row-column dot product defined above.

The same four transformations are also used in data augmentation during training. A standard image preprocessing pipeline might randomly rotate each training image by ±15°, scale by 0.9–1.1×, apply a small shear, and flip horizontally — each of these is a matrix applied to every pixel’s coordinates. This teaches the model to recognise objects regardless of their geometric orientation, without collecting new training data.

In this section

What does matrix-vector multiplication do geometrically?

Multiplying a 2×2 matrix M by a vector v rotates, scales, shears, or reflects v to produce a new vector. The specific transformation depends on the entries of M. Geometrically, the matrix acts on the entire plane: every point gets mapped to a new point according to the same linear rule.

What do the columns of a matrix represent?

The first column of a 2×2 matrix is where the basis vector e₁ = (1, 0) lands after the transformation. The second column is where e₂ = (0, 1) lands. Reading the columns tells you the complete geometry of what the matrix does to space — you can reconstruct the full transformation from the two column vectors alone.

How is a rotation matrix different from a scaling matrix?

A rotation matrix preserves the length and shape of every vector — it only changes direction. Its determinant is exactly 1, so area is unchanged. A scaling matrix stretches or compresses vectors along one or both axes, changing lengths but not angles when the scale factors are equal. Its determinant equals the product of the scale factors, measuring how much area changes.

Key Terms

Matrix Linear Transformation Determinant

◎ Intuition

A 2×2 matrix [[a, b], [c, d]] starts as the identity: a=1, b=0, c=0, d=1. This is the transformation that changes nothing — every vector stays where it is. Now imagine slowly increasing b from 0 to 1 while keeping a, c, and d fixed. Entry b is the x-component of the second column — it controls where the basis vector e₂ = (0, 1) lands horizontally. At b=0, e₂ points straight up. At b=1, e₂ tilts to the right. Before interacting, predict: as b increases from 0 to 1, what happens to the unit square? Does it rotate? Does it stretch? Does one edge move while another stays fixed? Which edge moves and which stays put?

↺ Reflection

Columns and the Geometry of Transformation

The rotation matrix [[0, −1], [1, 0]] has columns (0, 1) and (−1, 0). Column 1 says e₁ = (1, 0) lands at (0, 1) — pointing up. Column 2 says e₂ = (0, 1) lands at (−1, 0) — pointing left. Every vector rotates 90° counterclockwise. The shape of the unit square is preserved exactly — it becomes a congruent square, just rotated. The shear matrix [[1, 1], [0, 1]] has columns (1, 0) and (1, 1). Column 1 says e₁ stays at (1, 0). Column 2 says e₂ moves to (1, 1). The square tilts: its top edge slides rightward while the bottom stays fixed. Area is unchanged, but shape is not.

The determinant ad − bc measures how much the transformation scales area. For the rotation matrix [[0, −1], [1, 0]], det = 0·0 − (−1)·1 = 1 — area is multiplied by 1, unchanged. For the uniform scale matrix [[2, 0], [0, 2]], det = 2·2 − 0·0 = 4 — a unit square of area 1 becomes a square of area 4. For the shear matrix [[1, 1], [0, 1]], det = 1·1 − 1·0 = 1 — a parallelogram with exactly the same area as the original square.

When the determinant reaches zero — when ad = bc — the columns become parallel. The transformation collapses the entire 2D plane onto a single line through the origin. Every input vector, regardless of direction, maps to a point on that one line, losing a full dimension of information. A matrix with determinant zero is singular — it cannot be inverted because you cannot recover the original 2D position from a 1D output.

In a neural network, a linear layer with weight matrix W applies the transformation y = Wx. The columns of W are vectors in the output space: each column encodes where one input dimension gets sent. During training, gradient descent adjusts these columns so that directions the network finds informative are preserved or amplified. A (3072 × 768) weight matrix in a transformer’s feed-forward layer has 768 columns — each is a direction in the 3072-dimensional output space that a particular input dimension of the 768-dimensional token embedding maps to. The transformation is learned geometry.

Key Points

The first column [a, c] of a 2×2 matrix shows where e₁ = (1, 0) lands. The second column [b, d] shows where e₂ = (0, 1) lands. Reading the columns is the most direct way to understand what a matrix does to space.

The determinant ad − bc measures the factor by which the transformation scales area. A rotation matrix has determinant 1 — area is preserved. A 2× uniform scale has determinant 4 — area quadruples. A shear matrix has determinant 1 — area is preserved, but shape changes.

When the determinant equals zero, the two columns are parallel. The transformation collapses the entire plane onto a single line through the origin, losing a full dimension of information. A matrix with determinant zero is called singular.

A linear layer in a neural network applies a weight matrix to rotate and stretch the input embedding. Each column of the weight matrix encodes the direction in the output space that a particular input dimension maps to — the learned geometry of what the network found useful to preserve.

✓ Checkpoint

Check Your Understanding

Three questions on matrix transformations. Select an answer, then reveal to see the explanation.

A 2×2 matrix has columns (3, 0) and (0, 2). What does this matrix do to the unit square?

Matrix-vector multiplication is commutative: Mv = vM for any 2D vector v and 2×2 matrix M.

Put the following steps in the correct order to apply the 2×2 matrix M = [[a, b], [c, d]] to the 2D vector v = (vₓ, vᵧ):

1.Write the result as the output vector — the two computed components assembled
2.Compute the second output component: dot the second row (c, d) with the input (vₓ, vᵧ) to get c·vₓ + d·vᵧ
3.Identify the matrix entries a, b, c, d and the input vector components vₓ and vᵧ
4.Compute the first output component: dot the first row (a, b) with the input (vₓ, vᵧ) to get a·vₓ + b·vᵧ

A linear layer in a neural network has a weight matrix W of shape (512 × 768). The input is a 768-dimensional vector. What is the dimension of the output?