Why does the backward pass run in reverse order?

Because the chain rule requires the derivative of each function's output with respect to its input, and those outputs were computed in forward order. To differentiate a composition f(g(h(x))), you need dy/dv first (from f), then multiply by dv/du (from g), then by du/dx (from h). That is the reverse of how x → h → g → f → y was computed. Backpropagation exploits this structure by storing intermediate activations during the forward pass, then consuming them in reverse to compute local derivatives.

Derivatives Chapter 3 of 3 · tap to browse

01 The Derivative 02 The Gradient 03 The Chain Rule

The Chain Rule — How Backpropagation Works

Differentiating compositions — the mechanism behind every training update in neural networks

Every time PyTorch or TensorFlow calls .backward(), it is executing the chain rule through a computation graph. The partial derivative that flows back to each weight is a product of many local derivatives — one per layer. When those derivatives are repeatedly small numbers, their product can shrink to nearly zero, leaving early weights with no useful training signal. That is the vanishing gradient problem, and it is the reason transformers and residual networks were designed the way they were.

Learning Objectives

1 Apply the chain rule to a two-function composition y = f(g(x)) by computing the inner derivative du/dx, the outer derivative dy/du, and multiplying them to obtain dy/dx.
2 Trace the chain rule through a two-layer neural network, identifying which partial derivative is computed at each layer during the backward pass and explaining why the order reverses relative to the forward pass.
3 Explain why vanishing gradients occur in deep networks: when many local derivatives are less than 1, their product shrinks exponentially, leaving early weights with an effectively zero training signal.

¶ Narrative

Composition, the Chain Rule, and Backpropagation

Functions inside functions

Most functions in machine learning are compositions. A sigmoid activation takes a linear transformation as its input. A loss function takes a network output as its input. A network output is itself the result of ten or twenty composed transformations. To train any of these systems, you need to differentiate through the composition — that is exactly what the chain rule does.

Think of it this way: if a tiny nudge Δx at the input gets scaled by a factor of 3 through the inner function, and then that intermediate result gets scaled by a factor of 2 through the outer function, the total effect on the output is 3 × 2 = 6 times the original nudge. The chain rule formalises this: the total sensitivity of the output to the input is the product of the sensitivities at each stage. Compose two functions, multiply their individual derivatives.

A nudge entering the composition gets scaled by the inner stage's derivative, then scaled again by the outer stage's derivative. The total scaling is their product — that is the chain rule, made visible.

Before writing the general rule symbolically, let us see the same pattern with concrete numbers — the formula will then just summarise what we already observed.

A worked example

Consider a single neuron: the input x is multiplied by weight w, then passed through the sigmoid activation. The output is y = σ(wx).

This is a composition. The inner function is g(x) = wx. The outer function is f(u) = σ(u). Applying the chain rule:

\frac{d y}{d x} = outer: f^{'} (g (x)) σ (w x) (1 - σ (w x)) \cdot inner: g^{'} (x) w

The inner derivative g′(x) = w — the weight itself. The outer derivative f′(u) = σ(u)(1 − σ(u)) — the sigmoid’s own derivative, evaluated at the inner function’s output.

With specific numbers: let w = 2, x = 1. Then wx = 2, σ(2) ≈ 0.880, and σ(2)(1 − 0.880) ≈ 0.105. The chain rule gives dy/dx = 0.105 × 2 = 0.210.

💡 Insight

Notice that the sigmoid derivative σ(u)(1 − σ(u)) is at most 0.25 — it is maximised at u = 0 and shrinks toward zero for large |u|. This means every neuron using sigmoid activation contributes at most a factor of 0.25 to the chain rule product. In a deep network with ten sigmoid layers, the gradient at the first layer is multiplied by at most 0.25¹⁰ ≈ 10⁻⁶. This is the vanishing gradient problem — encountered in full in the next section.

The chain rule, stated formally

With the worked neuron in hand, the general rule is a tidy summary. If $y = f (g (x))$ , then

\frac{d y}{d x} = \frac{d y}{d u} \cdot \frac{d u}{d x}, where u = g (x)

The outer derivative $d y / d u$ is evaluated at $u = g (x)$ , not at $x$ . The inner derivative $d u / d x$ is evaluated at $x$ . Their product gives the total rate of change of $y$ with respect to $x$ — exactly what the sigmoid-neuron calculation just produced (outer × inner = 0.105 × 2 = 0.210).

For completeness: you can verify the chain rule on pure-math functions the same way. For y = sin(x²), the inner derivative is 2x and the outer derivative is cos(x²), giving dy/dx = cos(x²) · 2x. The mechanics are identical — only the functions differ.

Neural networks are long compositions

A two-layer network computes:

\overset{y}{^} = f_{2} (W_{2} f_{1} (W_{1} x))

This is $f_{2} \circ linear_{2} \circ f_{1} \circ linear_{1}$ applied to $x$ . To adjust $W_{1}$ — the first layer’s weights — you need $\partial L / \partial W_{1}$ , the derivative of the loss all the way back to the first layer. The chain rule provides it:

\frac{\partial L}{\partial W _{1}} = \frac{\partial L}{\partial y ^} \cdot \frac{\partial y ^}{\partial h} \cdot \frac{\partial h}{\partial W _{1}}

where $h = f_{1} (W_{1} x)$ is the hidden layer output. Each factor is the local derivative at one stage of the forward computation.

Forward pass: values flow left to right through the network, computing each intermediate result. Backward pass: gradients flow right to left, each multiplied by the local derivative at that stage. The gradient at W₁ is the product of all local derivatives along the backward path.

Backpropagation in plain English

Backpropagation is the chain rule applied systematically:

Forward pass: run the input through the network, recording each intermediate value.
Start at the loss: compute $\partial L / \partial \overset{y}{^}$ — how much the loss changes per unit change in the prediction.
Propagate backward: at each layer, multiply the incoming gradient by the local derivative of that layer’s output with respect to its input.
Accumulate weight gradients: at each layer, also compute the local derivative with respect to the weights. This gives $\partial L / \partial W_{ℓ}$ for every layer $ℓ$ .

The backward pass visits layers in the reverse order of the forward pass — last layer first, first layer last — because the chain rule requires outer derivatives before inner ones.

A concrete backward pass

Let x = 2, W₁ = 0.5, W₂ = 1.2, and the target output 1.0.

Forward pass:

Inner activation input: W₁ · x = 0.5 × 2 = 1.0
Hidden unit: h = σ(1.0) ≈ 0.731
Network output: ŷ = W₂ · h = 1.2 × 0.731 = 0.877
Loss: L = (ŷ − 1)² = (−0.123)² ≈ 0.015

Backward pass:

∂L/∂ŷ = 2(ŷ − 1) = 2(−0.123) = −0.246
∂ŷ/∂h = W₂ = 1.2
∂h/∂(W₁x) = σ(1)(1 − σ(1)) ≈ 0.731 × 0.269 ≈ 0.197
∂(W₁x)/∂W₁ = x = 2

Chaining: ∂L/∂W₁ = (−0.246)(1.2)(0.197)(2) ≈ −0.116

\frac{\partial L}{\partial W _{1}} = \frac{\partial L}{\partial y ^} \cdot \frac{\partial y ^}{\partial h} \cdot \frac{\partial h}{\partial ( W _{1} x )} \cdot \frac{\partial ( W _{1} x )}{\partial W _{1}} = (- 0.246) (1.2) (0.197) (2) \approx - 0.116

The gradient −0.116 tells gradient descent to increase W₁ slightly (step in the negative gradient direction). Every weight in every layer receives an update computed by exactly this chain of multiplications — the chain rule applied from output to input.

Common Mistake

Backpropagation is sometimes presented as a clever trick discovered by researchers. It is not. It is the direct application of the chain rule to a computation graph. Any software library that can represent a neural network as a graph of differentiable operations can implement backpropagation automatically — this is what PyTorch’s autograd and TensorFlow’s gradient tape do.

Vanishing gradients

The chain rule multiplies derivatives together. If each local derivative has magnitude less than one, their product shrinks with each additional layer. A network with ten sigmoid activations can produce a gradient at the first layer of order $0. 1^{10} = 1 0^{- 10}$ — effectively zero. The first layers stop receiving any training signal.

⚠️ Warning

The sigmoid function’s derivative peaks at 0.25. In a ten-layer network, the gradient passing through ten sigmoid layers is multiplied by up to ten factors of 0.25, giving at most $0.2 5^{10} \approx 1 0^{- 6}$ . In practice the values are often smaller. Early layers in deep sigmoid networks train extraordinarily slowly or not at all. This was the central obstacle in deep learning until ReLU activations, residual connections, and careful initialisation schemes became standard.

ReLU’s derivative is exactly 1 for positive inputs and 0 for negative — never a fraction, so the multiplicative shrinkage that compounds through sigmoid layers does not occur. Residual connections add a skip path that carries the gradient directly from later layers to earlier ones, bypassing the multiplication chain entirely: even if the intermediate layers have small local derivatives, the skip path delivers the gradient undiminished.

A gradient entering from the right is multiplied by the local derivative at each layer. With sigmoid activations (local derivative ≤ 0.25), the gradient reaches the first layer as a sliver. With ReLU activations (local derivative = 1), it passes through unchanged.

📖 History

The chain rule as a rule of calculus is due to Leibniz (1676). Its application to training neural networks was formalised by Rumelhart, Hinton, and Williams in their 1986 paper “Learning representations by back-propagating errors” in Nature. They did not invent backpropagation — Werbos had described the idea in his 1974 PhD thesis — but the 1986 paper brought it to widespread attention and established the vocabulary still in use today.

Real World

Modern deep learning frameworks (PyTorch, JAX, TensorFlow) implement backpropagation via automatic differentiation. You define a computation graph by writing forward-pass code; the framework records each operation and its local derivative. Calling .backward() or grad() traverses the graph in reverse, applying the chain rule at each node. You never write a backward pass by hand.

In this section

What is the chain rule?

If y = f(g(x)), the chain rule says dy/dx = (dy/du) · (du/dx), where u = g(x). In words: to differentiate a composition, multiply the derivative of the outer function (evaluated at the inner function's output) by the derivative of the inner function. The rule extends to any number of composed functions: if y = f(g(h(x))), then dy/dx = (dy/dv) · (dv/du) · (du/dx) where v = g(u) and u = h(x). Each arrow in the chain contributes a multiplicative factor.

What is backpropagation?

Backpropagation is the algorithm for computing the gradient of a neural network's loss function with respect to every weight in a single backward pass. It applies the chain rule layer by layer, starting at the loss and working backward to the input. At each layer it computes the local derivative of that layer's output with respect to its input, then multiplies by the gradient flowing in from the layer above. The result is the gradient with respect to that layer's weights — the number that tells gradient descent how to adjust those weights.

What are vanishing gradients?

Vanishing gradients occur when the product of local derivatives along a backpropagation chain becomes extremely small. If each layer's local derivative has magnitude less than 1 — for example, the sigmoid function's derivative peaks at 0.25 — then multiplying ten such values gives a gradient of order 10⁻⁷ or smaller at the first layer. That layer's weights effectively receive no training signal and stop learning. Architectures like LSTMs, ResNets, and Transformers were designed specifically to prevent this.

Key Terms

Chain Rule Backpropagation Vanishing Gradient

◎ Intuition

Before the deep-network case: consider a simpler scenario. One function doubles its input — a nudge of size 1 becomes a nudge of size 2. A second function triples its input. If you compose them — run the input through both — how much does the output change for a nudge of size 1 at the input? Write down your answer before continuing. Suppose every layer in a ten-layer network has a local derivative of exactly 0.5 at the current point. The gradient that reaches the first layer is the product of ten factors of 0.5: approximately 0.001. Now suppose the local derivatives were 2 instead of 0.5. What happens to the gradient at the first layer? What does that imply for training — and what new problem might it create?

↺ Reflection

Chain Rule, Backprop, and Vanishing Gradients

The chain rule

The chain rule’s multiplicative structure means the total derivative of a composition is only as strong as its weakest link. If any one stage has a near-zero local derivative, the entire product collapses — regardless of what the other stages contribute. This is not a curiosity; it is the central constraint that determined which activation functions became standard in deep learning and which were abandoned.

Backpropagation

Backpropagation executes the chain rule backward through a computation graph. The forward pass stores intermediate activations. The backward pass visits each layer in reverse order, multiplying the incoming gradient by that layer’s local derivative and accumulating the weight gradient. No magic is involved — it is a systematic application of the chain rule.

Vanishing gradients

When local derivatives have magnitude less than 1, they multiply to produce very small numbers. Ten sigmoid layers, each contributing a derivative of at most 0.25, produce a chain of:

0.25 \times 0.25 \times \dots \times 0.25 = 0.2 5^{10} \approx 1 0^{- 6}

The first layer receives a gradient near zero and its weights do not update. ReLU activations (derivative 1 or 0, never fractional), residual connections (which add a direct gradient path bypassing intermediate layers), and careful weight initialisation all exist specifically to keep gradients flowing through deep networks.

Key Points

The chain rule states dy/dx = (dy/du) · (du/dx) for a composition y = f(g(x)). It extends to any number of composed functions by multiplying the local derivative at each stage.

Backpropagation is the chain rule applied in reverse through a neural network. Starting from the loss, it multiplies accumulated gradients by local derivatives layer by layer, visiting layers in the reverse order of the forward pass.

Vanishing gradients occur when many local derivatives are less than 1. In a 10-layer network with sigmoid activations, the gradient at the first layer can be as small as 10⁻¹⁰ — too small to drive any useful weight update.

✓ Checkpoint

Check Your Understanding

Pick an answer and click reveal to see the explanation.

If $g(x) = x^2$ and $f(u) = e^u$, what is $\frac{d}{dx}f(g(x))$ at $x = 2$?

Backpropagation computes gradients by applying the chain rule in the same forward order as the network's forward pass.

A 10-layer network has a derivative of 0.1 at every layer. Approximately what is the gradient magnitude at the first layer?

Arrange these steps in the correct order for computing the gradient ∂L/∂W₁ via backpropagation through a two-layer network.

1.Compute ∂L/∂ŷ — the derivative of the loss with respect to the network output
2.Run the forward pass and store the intermediate value h at the hidden layer
3.Multiply by ∂h/∂(W₁x) — the local derivative at the first layer
4.Multiply by ∂ŷ/∂h — the local derivative at the second layer