Derivatives Chapter 2 of 3 · tap to browse
The Gradient — Derivative in Multiple Dimensions
The derivative extended to many variables — the engine of every training step
Every time you read that a model was 'trained for N GPU-hours', what was happening during those hours was gradient descent: computing the gradient of the loss with respect to every weight, then nudging every weight slightly in the direction the gradient says is downhill. The gradient is computed once per batch via backpropagation; the update is applied once per step. Millions of steps later, the weights have settled near a minimum of the loss surface.
- 1 Explain what a partial derivative ∂f/∂xᵢ measures: the rate of change of f when xᵢ changes and all other variables are held fixed.
- 2 Describe the gradient vector ∇f geometrically: it lives in the input space, points in the direction of steepest ascent of f, and has magnitude equal to the steepness of that ascent.
- 3 Apply the gradient descent update rule w_{t+1} = w_t − η∇L to move one step downhill on a loss surface, identifying the role of the learning rate η and explaining what happens when η is too large or too small.
From Slope to Direction
One variable → many variables
A function of one variable — f(x) = x² — has a single derivative at each point: one number measuring how steeply the output changes as the input increases. A function of two variables — L(w₁, w₂) = w₁² + w₂² — has no single “slope” because there are infinitely many directions you could move from any point.
To handle multiple variables, calculus fixes all but one variable at a time and asks: how does the output change as just this one input changes? The answer is a partial derivative.
Partial derivatives: the slope in one direction
Here L(w₁, w₂) = w₁² + w₂² is a toy loss function that depends on two weights. The partial derivative ∂L/∂w₁ answers: if you nudge weight w₁ alone while leaving w₂ unchanged, how much does the loss change?
The partial derivative ∂L/∂wᵢ measures the rate of change of L when wᵢ changes and all other variables are held fixed — the slope of the function in the wᵢ direction at a given point.Partial DerivativeThe rate of change of a function of several variables with respect to one variable, holding all others fixed. Written ∂f/∂xᵢ — the rounded ∂ signals that other variables are frozen. Geometrically it is the slope of the function in the xᵢ direction at a given point.At the point w = (1, 2), nudging w₁ changes the loss at rate 2 and nudging w₂ changes the loss at rate 4 — the loss is more sensitive to w₂ at this location.
For L(w₁, w₂) = w₁² + w₂², the two partial derivatives are:
The ∂ notation (a rounded d) signals that other variables are frozen. Computing ∂L/∂w₁ means treating w₂ as a constant and differentiating with respect to w₁ exactly as you would for a one-variable function.
The gradient: assembling all the partial derivatives
The gradient collects every partial derivative into a single vector.
The gradient of a scalar function L at a point w is the vector of all partial derivatives: ∇L(w) = (∂L/∂w₁, ∂L/∂w₂, …, ∂L/∂wₙ). It lives in the same space as w — one component per input dimension.GradientThe vector of all partial derivatives of a scalar function: ∇f(x) = (∂f/∂x₁, …, ∂f/∂xₙ). The gradient lives in the input space, points in the direction of steepest ascent of f, and has magnitude equal to the rate of that ascent. Gradient descent moves in the negative gradient direction to minimise f.For a function of n variables — n weights — the gradient has n components, one partial derivative per weight. For our two-weight example:
At w = (1, 2), the gradient is (2, 4). This is the vector that points in the direction of steepest ascent on the loss surface from that position — the direction in which a small step increases the loss most rapidly.
What the gradient points at
The gradient has two geometric properties that make it useful:
- Direction: the gradient points in the direction of steepest ascent of L at that point.
Of all the directions you could step from your current position — adjusting w₁ alone, adjusting w₂ alone, or adjusting both at once in any combination — one direction increases the loss more steeply than any other. The gradient is that direction. Its length tells you how steep the steepest climb is: a large gradient magnitude means a steep surface; a gradient near zero means the surface is nearly flat. To descend — to reduce the loss — you step in the opposite direction.
- Magnitude: ‖∇L(w)‖ equals the steepness of that ascent — the rate at which L increases per unit step in the gradient direction.
Two consequences follow from this:
- The gradient is always perpendicular to the contour lines of L. Contour lines are paths of constant loss — moving along a contour produces no change in L. Since the gradient is the direction of maximum change, it must be perpendicular to the directions of zero change.
- The gradient at the minimum is zero in every direction — the loss surface is flat there. This is why gradient descent converges when ‖∇L‖ ≈ 0.
The gradient descent update
The gradient points uphill. To descend, move in the opposite direction.
The loss surface is the graph of the loss function L over the space of all model weights. Training a neural network is the problem of navigating this surface to find a point where the gradient is near zero — a (local) minimum of L.Loss SurfaceThe graph of a neural network's loss function over the space of all its weights. Training is the problem of navigating this high-dimensional surface to find a region where the gradient is near zero — a local minimum of the loss. Gradient descent traces a path on the loss surface by following the negative gradient at each step.The gradient descent update at step t is:
where η (eta) is the learning rate — a small positive scalar controlling the step size. Each step moves every weight slightly in the direction that reduces the loss.
A concrete step: Suppose the current weights are w = (3, −1), the gradient is ∇L = (4, −2), and the learning rate is η = 0.1.
Weight w₁ decreased from 3 to 2.6 — the gradient was positive there, meaning the loss was rising in that direction, so gradient descent pushed back. Weight w₂ increased from −1 to −0.8 — the gradient was negative there, meaning the loss was falling in that direction, so gradient descent followed it. The loss at the new point is lower than at the old point.
Why the learning rate matters
The learning rate controls how much to trust the gradient at any given point:
- Too small: convergence is correct but slow — many steps to reach the minimum.
- Too large: the step overshoots the minimum. The loss may increase on the next step, causing oscillation or divergence.
- Adaptive: optimisers like Adam automatically scale η per parameter, making the effective step size smaller for parameters with large, consistent gradients and larger for parameters with small or noisy ones.
For concreteness: if η = 0.01 and ‖∇L‖ = 4, the step magnitude is 0.04 — a careful nudge. If η = 1.0 and ‖∇L‖ = 4, the step magnitude is 4.0 — which could move the weights across the entire loss landscape in a single step, almost certainly overshooting any minimum.
A learning rate that is too large does not just slow down training — it can cause the loss to increase. If the step size is larger than the region where the local gradient is a good linear approximation of the loss, the update lands somewhere worse than where it started. This is why learning rate warmup (starting small and increasing gradually) is standard in large-scale training.
Why this scales to millions of dimensions
A large language model may have hundreds of millions to hundreds of billions of weights. Each training step computes the gradient of the loss with respect to every one of those weights — one partial derivative per weight — and applies the update w ← w − η∇L in a single vectorised operation. The mathematics is identical to the two-weight example above; only the dimension changes.
Gradient descent was first formalised by Augustin-Louis Cauchy in 1847, who used it to solve systems of equations numerically. The method sat largely dormant for over a century before Frank Rosenblatt applied a version of it to train the Perceptron in 1958. It became central to machine learning only when backpropagation made computing gradients through multi-layer networks practical — a development formalised by Rumelhart, Hinton, and Williams in 1986.
What is a partial derivative?
A partial derivative ∂f/∂xᵢ measures the rate of change of f when the single variable xᵢ changes and all other variables are held fixed. Geometrically, it is the slope of the function in the xᵢ direction — the derivative you would compute if you stood at a point on the surface and moved only along the xᵢ axis. The notation ∂ (a rounded d) signals that other variables are frozen.
What is the gradient?
The gradient of a scalar function f(x₁, x₂, …, xₙ) at a point x is the vector of all its partial derivatives: ∇f(x) = (∂f/∂x₁, ∂f/∂x₂, …, ∂f/∂xₙ). It lives in the same space as x — if f takes an n-dimensional input, the gradient is an n-dimensional vector. The key geometric property: the gradient points in the direction that increases f most steeply, and its magnitude is the steepness of that ascent.
Why does the gradient point uphill and not toward the minimum?
The gradient is the direction of maximum increase of f — the direction you would walk on the loss surface to climb as steeply as possible. To minimise the loss you want to descend, so gradient descent moves in the negative gradient direction: −∇L. Each step subtracts a small multiple of the gradient from the current weights, moving them slightly downhill. The gradient itself never points toward the minimum; it points away from it.
Imagine you are standing on a hillside shaped like a horse's saddle — the ground curves upward in front of you and behind you, but curves downward to your left and right. You want to find the lowest point. Which direction does the steepest climb point? Now imagine you are standing exactly at the centre of the saddle, where the slopes in all four directions cancel out. What happens to the steepest-ascent arrow at that exact point? Would taking a gradient descent step from there lead you to the bottom of anything? Now a different scenario: you are standing on a very steep hillside and your step size is large — large enough that one step would carry you past the valley floor and up the other side. What do you expect to happen if you keep taking steps of that size? And what would you need to change to actually reach the bottom?
Direction, Magnitude, and the Descent Step
What the gradient encodes
The gradient ∇f(x) is a vector in the input space — the same space as x. Its direction is the direction of steepest ascent; its magnitude is the steepness. At a minimum, the gradient is the zero vector: no direction increases the function, so the loss surface is locally flat.
The gradient lives in the input space, not the output space. For a loss function with a million weights, the gradient has a million components — one measuring how much the loss changes per unit change in each weight.
The descent update
Moving in the negative gradient direction from any point guarantees that you are stepping downhill — at least for a small enough step. The update:
subtracts η times the gradient from every weight simultaneously. If ∂L/∂wᵢ is positive, the loss is rising in the wᵢ direction, so the update decreases wᵢ to reduce the loss. If ∂L/∂wᵢ is negative, the update increases wᵢ. The sign of the partial derivative tells the optimiser which way to push each weight.
What gradient descent cannot guarantee
Reaching ‖∇L(w)‖ ≈ 0 confirms that no single-step gradient adjustment reduces the loss further from this point — but it does not confirm that the point is the global minimum. Loss surfaces for deep networks contain many saddle points where the gradient is zero but the loss could still decrease by moving in a different direction. Modern optimisers use momentum and adaptive step sizes to navigate past these flat regions, but the core operation remains: follow the negative gradient.
The gradient is a first-order tool — it tells you the slope of the loss surface but nothing about its curvature. More advanced optimisation methods account for curvature, enabling larger and more accurate steps, but they require substantially more computation and are rarely used for training large neural networks. For almost all practical deep learning, the gradient is sufficient.
The gradient ∇f(x) is the vector of partial derivatives — it points in the direction of steepest ascent and has magnitude equal to the rate of that ascent. The negative gradient −∇f points downhill.
One gradient descent step is w_{t+1} = w_t − η∇L. The learning rate η controls the step size: too small and convergence is slow; too large and the update overshoots the minimum.
In neural networks, backpropagation computes ∇L with respect to every weight in a single backward pass. Training is the process of taking gradient descent steps until ‖∇L‖ is acceptably small.
Check Your Understanding
Three questions on partial derivatives, the gradient, and the gradient descent update. Select an answer, then reveal to see the explanation.
What does the partial derivative ∂f/∂x₁ measure at a point?
The gradient ∇f(x) always points toward the nearest local minimum of f.
In the gradient descent update w_{t+1} = w_t − η∇L, what happens if the learning rate η is set too large?
The current weights are $w = (2, -3)$, the gradient is $\nabla L = (6, -4)$, and the learning rate is $\eta = 0.05$. What are the updated weights after one gradient descent step?