Why do deep networks need nonlinear activations?

The composition of any two linear transformations is itself a linear transformation. If W₂ and W₁ are both matrices, then W₂(W₁x) = (W₂W₁)x, where W₂W₁ is a single matrix. No matter how many linear layers you stack, the entire network collapses to a single matrix multiplication — equivalent to a network with one layer. Nonlinear activations such as ReLU break this collapse by making each layer's output a nonlinear function of its input.

Matrices Chapter 2 of 3 · tap to browse

01 What Does a Matrix Do? 02 Matrix Multiplication 03 Eigenvectors

Matrix Multiplication as Composition

Multiplying two matrices chains their transformations — AB means apply B first, then A

Every forward pass through a multi-layer neural network is a composition of matrix multiplications. The reason nonlinear activations are required between layers is that the composition of linear transformations is always a single linear transformation — depth without nonlinearity adds no expressive power.

Learning Objectives

1 Compute the product of two 2×2 matrices using the row-by-column rule.
2 Explain why AB ≠ BA in general by giving a geometric example where the order of rotation and shear matters.
3 Explain why stacking linear layers in a neural network without nonlinear activations collapses to a single linear transformation.

¶ Narrative

Chaining Transformations

A single matrix transforms space once. Two matrices in sequence transform it twice. The question is how to combine the two operations into one — and the answer is matrix multiplication.

If B maps vectors to intermediate positions and A then maps those intermediate positions to final positions, the combined effect is written AB. Applied to a vector v, the expression ABv means B acts first: compute Bv, then apply A to the result. The product AB is a single matrix that does both operations in sequence without storing the intermediate result.

The row-by-column rule

Computing AB amounts to asking: what does A do to each intermediate vector that B can produce? The formula answers this question one entry at a time, using the dot product of each row of A with each column of B.

Matrix product (2×2)

A B = (a_{1} c_{1} b_{1} d_{1}) (a_{2} c_{2} b_{2} d_{2}) = (a_{1} a_{2} + b_{1} c_{2} c_{1} a_{2} + d_{1} c_{2} a_{1} b_{2} + b_{1} d_{2} c_{1} b_{2} + d_{1} d_{2})

Matrix multiplication as sequential transformation. The unit square is first sheared by B, producing an intermediate parallelogram. Then A scales the result vertically. The final shape is identical to applying AB directly — no intermediate storage required.

As a concrete example, let A be a 2× vertical scale and B be a horizontal shear:

Setup: scale and shear

A = (1002), B = (1011)

AB: shear first, then scale

A B = (1 \cdot 1 + 0 \cdot 0 0 \cdot 1 + 2 \cdot 0 1 \cdot 1 + 0 \cdot 1 0 \cdot 1 + 2 \cdot 1) = (1012)

BA: scale first, then shear

B A = (1 \cdot 1 + 1 \cdot 0 0 \cdot 1 + 1 \cdot 0 1 \cdot 0 + 1 \cdot 2 0 \cdot 0 + 1 \cdot 2) = (1022)

AB ≠ BA. Shearing first and then scaling produces a different transformation than scaling first and then shearing.

Real World

Transformer attention weights each value vector by a learned function of query–key similarity: y = softmax(QKᵀ/√d)V. The matrix products QKᵀ and the outer product with V are chained in a fixed order — swapping them would produce an entirely different computation. Every multi-step neural network operation is sensitive to the order of its matrix products.

Common Mistake

AB means B is applied first. Reading left to right, A appears before B — but B acts on the input first. The convention comes from matrix-vector notation: ABv = A(Bv). B is closest to the vector, so it acts first. When you say “apply A after B,” you write AB.

Order matters geometrically

The difference between AB and BA is not just algebraic — it is geometric. When you apply B first, A acts on an already-transformed shape. When you apply A first, B acts on a differently-transformed shape. The two paths through transformation space lead to different destinations, as the figure below shows directly.

AB vs BA using A = 90° rotation and B = horizontal shear. The left panel shows the unit square sheared then rotated (AB); the right shows it rotated then sheared (BA). The two results land in different regions of the plane — AB ≠ BA is geometric, not just algebraic.

	AB	BA
Verbal meaning	Apply B, then A	Apply A, then B
Acts on v as	A(Bv)	B(Av)
Generally equal?	No	No

The product AB is commutative only in special cases — for example, two pure scalings, or a matrix multiplied by its inverse. In general, the order of matrix multiplication encodes the order of geometric operations.

The identity matrix

One matrix leaves everything unchanged: the identity matrix I, which has 1s on the main diagonal and 0s everywhere else.

Identity matrix

I = (1001), M I = I M = M for any matrix M

The identity matrix is the starting point for understanding matrix inverses: if M⁻¹ exists, then M⁻¹M = MM⁻¹ = I. Geometrically, I is the transformation that does nothing — every vector maps to itself.

The identity plays a structural role in deep learning beyond its algebraic definition. Residual networks (ResNets) and transformer blocks compute h = f(x) + x — the added skip connection is the identity applied to the input. Even when the learned function f collapses or saturates, the identity path preserves the gradient signal and the original representation. Batch normalisation initialises its learnable scale γ to 1 and shift β to 0, making the initial transform a near-identity — training then departs from that neutral starting point rather than from an arbitrary initialisation.

Real World

A transformer’s feed-forward block applies two linear layers with a nonlinear activation in between: y = W₂σ(W₁x). If σ were the identity function, then W₂(W₁x) = (W₂W₁)x — the two layers collapse into one matrix W₂W₁. The network would have the same expressive power as a single linear layer regardless of how wide W₁ and W₂ are. The activation σ (ReLU, GELU, etc.) is the only reason depth adds capacity.

Without a nonlinearity, two linear layers are equivalent to one. The two-layer network x → W₁ → h → W₂ → y computes exactly the same function as the single-layer network x → W₂W₁ → y. Depth without nonlinearity adds no representational power.

In this section

Why does the order of matrix multiplication matter?

Matrix multiplication represents chained transformations, and the order in which you apply two geometric operations generally changes the result. Rotating a sheared shape produces a different configuration than shearing a rotated shape. Algebraically, the row-by-column computation mixes entries asymmetrically: entry (i, j) of AB depends on row i of A and column j of B, so swapping the matrices changes which rows and columns are dotted together.

What does AB mean geometrically?

AB means apply transformation B first, then apply transformation A to the result. The right-hand matrix acts first because matrix-vector multiplication is written Mv, so ABv = A(Bv) — B acts on v first, then A acts on the result. Reading left-to-right gives the reverse of the application order, which is the most common source of confusion with composition.

What is the identity matrix and why does it matter?

The identity matrix I has 1s on the diagonal and 0s elsewhere. Multiplying any vector by I leaves it unchanged. Multiplying any matrix M by I gives back M: MI = IM = M. The identity matrix is the multiplicative neutral element for matrices — the matrix equivalent of multiplying a number by 1. It also serves as the starting point for understanding matrix inverses.

Key Terms

Identity Matrix Composition

◎ Intuition

Let A be a 90° counterclockwise rotation and B be a horizontal shear (the one that tilts the top edge rightward while leaving the bottom fixed). Before interacting with the playground: predict what AB looks like — shear first, then rotate. Now predict what BA looks like — rotate first, then shear. Sketch both in your head. Do you expect them to be the same shape? The same orientation? Once you have a prediction, use the playground to check. Now consider a neural network with two linear layers and no activation between them: **y** = W₂(W₁**x**). What happens as you add a third linear layer W₃? Does the network become strictly more powerful? Think about what W₃(W₂W₁)**x** simplifies to before computing anything.

↺ Reflection

Order, Identity, and Collapse

Matrix multiplication encodes the composition of two linear transformations. The product AB represents B acting first, then A — because in the expression ABv, the vector v is closest to B, so B multiplies it first, and then A multiplies the intermediate result. This is the most common source of confusion with matrix products: the left-to-right reading order is the reverse of the geometric application order.

The non-commutativity of matrix multiplication has a direct geometric interpretation. Take A to be the 90° CCW rotation matrix [[0, −1], [1, 0]] and B to be the horizontal shear [[1, 1], [0, 1]]. Applying B first and then A (the product AB) shears the unit square into a parallelogram and then rotates that parallelogram. Applying A first and then B (the product BA) rotates the unit square into a rotated square and then shears that rotated shape. The resulting shapes occupy different regions of the plane: AB = [[0, −1], [1, 1]] and BA = [[1, −1], [1, 0]] — different matrices with different columns and different determinants.

The identity matrix I = [[1, 0], [0, 1]] is the geometric no-op: it maps every vector to itself — Iv = v for all v. Multiplying any matrix M by I on either side returns M unchanged: MI = IM = M. This makes I the multiplicative neutral element for matrix multiplication, analogous to the number 1 for scalar multiplication. The identity serves as the reference point for matrix inverses: a matrix M is invertible if and only if there exists M⁻¹ such that M⁻¹M = MM⁻¹ = I.

The practical consequence for neural networks is stark. Two linear layers with weight matrices W₁ and W₂ compute y = W₂(W₁x) = (W₂W₁)x. The parenthesized product W₂W₁ is itself a matrix — a single linear transformation. No matter how many linear layers you stack, the composition is always expressible as one matrix multiplication. The entire depth of the network collapses. The only escape is a nonlinear function between layers: once σ is nonlinear, y = W₂σ(W₁x) cannot be written as a single matrix-vector product, and depth genuinely increases what the network can represent.

Key Points

AB means apply B first, then A. The right-hand matrix acts first because ABv = A(Bv). Reading the product left-to-right gives the reverse of the application order.

Matrix multiplication is not commutative in general. A concrete geometric case: let A be a 90° CCW rotation and B be a horizontal shear. AB shears first and then rotates; BA rotates first and then shears. The two composed squares land in different positions — AB ≠ BA.

The identity matrix I leaves every vector unchanged: Iv = v. It is the multiplicative neutral element: MI = IM = M for any matrix M. It plays the same role as 1 in ordinary multiplication.

Stacking linear layers in a neural network without nonlinear activations collapses to a single linear layer. W₂(W₁x) = (W₂W₁)x, and W₂W₁ is just another matrix. Depth without nonlinearity adds no expressive power.

✓ Checkpoint

Check Your Understanding

Four questions on matrix multiplication and composition. Select an answer, then reveal to see the explanation.

To apply transformation B to a vector v and then apply transformation A to the result, which expression is correct?

Matrix multiplication is commutative: AB = BA for any two 2×2 matrices A and B.

A neural network has two linear layers with weight matrices W₁ and W₂, and no activation function between them. Why does this have the same expressive power as a single linear layer?

Let A = [[2, 0], [1, 3]] and B = [[1, 4], [2, 1]]. What is the entry in row 2, column 1 of the product AB?