Why do neural networks need derivatives?

Neural networks are trained by minimising a loss function — a number that measures how wrong the network's predictions are. The loss depends on millions of weights, and the network needs to know how to adjust each weight to reduce the loss. That information comes from the derivative of the loss with respect to each weight. Backpropagation computes these derivatives efficiently using the chain rule. Gradient descent then moves every weight slightly in the direction that reduces the loss — the negative gradient direction.

Derivatives Chapter 1 of 3 · tap to browse

01 The Derivative 02 The Gradient 03 The Chain Rule

The Derivative — Rate of Change

Q: What is the difference between a derivative and a gradient?

A derivative is defined for functions of a single variable: it returns a scalar measuring the slope of f at a point. A gradient is the multi-variable extension: for a function f(x₁, x₂, …, xₙ), the gradient is the vector of all partial derivatives ∇f = (∂f/∂x₁, ∂f/∂x₂, …, ∂f/∂xₙ). Each partial derivative measures the slope in one coordinate direction while holding the others fixed. The gradient points in the direction of steepest ascent of the function; its negative points downhill.

The slope of a curve at a single point — and why that slope drives every training algorithm

Every time a neural network takes a training step, it computes the derivative of the loss with respect to every weight in the network. That collection of derivatives — the gradient — tells the optimiser which direction to adjust each weight to reduce the loss. Without derivatives, there is no gradient descent, no backpropagation, and no learning.

Learning Objectives

1 Explain what a derivative measures geometrically: the slope of the tangent line to the function's graph at a specific point, representing the instantaneous rate of change of the output per unit of input.
2 Identify what a zero derivative means at a point on a curve and explain why it does not necessarily indicate a minimum — the point could be a local maximum or a saddle point.
3 Connect the derivative to loss minimisation in gradient descent: gradient descent follows the negative derivative — the direction of steepest descent — at each step until it reaches a point where the derivative is zero in every direction.

¶ Narrative

The Slope at a Point

The slope of a training run

A neural network is training. At epoch 10, the loss is 2.41. At epoch 11, the loss is 2.38. At epoch 20, the loss is 1.12. At epoch 21, the loss is 1.09. The loss dropped by 0.03 in both cases — the same absolute change. But the training is behaving very differently at those two moments.

At epoch 10, the loss is still falling steeply — the curve has a sharp downward tilt. At epoch 20, the rate of improvement has slowed dramatically. The number that captures this difference is not “how much did the loss change over this interval” but “how steeply is the loss falling at this exact moment.” That number is the derivative.

The concept extends to any quantity that changes. A straight line has the same slope everywhere. A curve changes its slope at every point — the slope at the vertex of a parabola is different from the slope at its edges. The derivative is the slope of a curve at a single point.

Consider y = x². At x = 0, the curve is flat — the slope there is exactly 0. At x = 1, the curve is rising. At x = 3, it is rising steeply. The derivative tells you the slope at any particular x, precisely.

A secant line connects two points on a curve. As the second point slides toward the first, the secant rotates and converges on the tangent — the line that best approximates the curve's direction at exactly that point.

The limit definition

To find the slope at exactly x = 1 on y = x², measure the slope between x = 1 and a nearby second point, then make that second point closer and closer.

Second point x	Slope of secant
2	(4 − 1) / (2 − 1) = 3.000
1.5	(2.25 − 1) / (1.5 − 1) = 2.500
1.1	(1.21 − 1) / (1.1 − 1) = 2.100
1.01	(1.0201 − 1) / (1.01 − 1) = 2.010
1.001	(1.002001 − 1) / (1.001 − 1) = 2.001

The slopes are converging to 2. The derivative of x² at x = 1 is exactly 2.

In the general case, call the gap between the two points h, so the second point is x + h. The slope between x and x + h is (f(x + h) − f(x)) / h. We cannot set h = 0 — that gives 0/0 — so instead we take the limit as h approaches 0:

f^{'} (x) = h \to 0 lim \frac{f ( x + h ) - f ( x )}{h}

The secant-to-tangent animation above shows this process geometrically — the formula and the figure are describing exactly the same thing.

Computing derivatives — the power rule and common functions

Computing every derivative from the limit definition is possible but slow. Mathematicians derived shortcut rules by applying the limit definition to general families of functions. The most important is the power rule:

\frac{d}{d x} x^{n} = n \cdot x^{n - 1}

Applied to the example above: d/dx x² = 2x, which at x = 1 gives 2(1) = 2 — confirming the numeric result from the table.

f(x)	f′(x)	In plain English	Where this appears in ML
xⁿ	n · xⁿ⁻¹	Bring down the exponent, reduce it by 1	L2 regularisation penalty: d/dw (w²) = 2w
eˣ	eˣ	The exponential is its own derivative	Softmax, exponential learning rate schedules
ln x	1/x	The slope shrinks as x grows; undefined at x ≤ 0	Cross-entropy loss: d/dw (−ln p) = −1/p · dp/dw
constant c	0	A constant never changes — its derivative is zero	Bias terms contribute zero to the gradient

You will use the power rule, the exponential, and the logarithm constantly throughout this course — they appear in loss functions, activation functions, and regularisation terms.

Where the derivative is zero — critical points

The derivative is zero wherever the tangent line is horizontal. These locations — called critical points — are the candidates for local minima and maxima: the peaks and valleys of the function’s graph.

Both x² and x³ have a zero derivative at x = 0 — the tangent is horizontal at that point. The difference only becomes visible when you look at the neighbourhood: x² curves back up on both sides (a minimum), while x³ keeps going in the same direction (a saddle point, not a minimum).

A critical point is a location where f′(x) = 0. Critical points include local minima, local maxima, and saddle points.

⚠️ Warning

A zero derivative is necessary for a minimum or maximum, but not sufficient. The function f(x) = x³ has f′(0) = 0, yet x = 0 is neither a minimum nor a maximum — the function is still increasing on both sides. Context (the second derivative, or simply looking at nearby values) is required to distinguish minima, maxima, and saddle points.

The derivative in machine learning

Everything so far has been a single-variable function. Neural networks have millions of variables — millions of weights. But the derivative of a single-variable function is the foundation everything else builds on.

Consider L(w) — the loss of a model as a function of a single weight w, with all other weights held fixed. The derivative L′(w) is the slope of the loss at the current value of w — exactly the slope concept developed in the sections above. When L′(w) is large and negative, the loss is falling steeply as w increases, and w should increase. When L′(w) is large and positive, w should decrease. When L′(w) is approximately zero, the model has converged.

A training loss curve with the derivative annotated at three points. Early in training the slope is steep — a large negative derivative means the loss is falling quickly. As training progresses the slope flattens. Near convergence the derivative is approximately zero — the model has stopped improving.

💡 Insight

Consider a model with a single weight w and a loss function L(w) = (w − 3)². The derivative is L′(w) = 2(w − 3). Suppose the current weight is w = 5. Then L′(5) = 2(5 − 3) = 4. The derivative is positive — the loss increases as w increases, which means w should decrease. Gradient descent takes a step in the negative direction: w_new = 5 − η · L′(5). With learning rate η = 0.1: w_new = 5 − 0.1 × 4 = 4.6. The loss at w = 5 was (5 − 3)² = 4. The loss at w = 4.6 is (4.6 − 3)² = 2.56. It decreased.

w_{new} = w - η \cdot L^{'} (w)

Real World

Gradient descent works by computing the derivative of the loss with respect to every weight, then nudging each weight in the direction that reduces the loss. Each training step moves the weights slightly in the negative gradient direction. The process continues until the gradient is near zero everywhere — a point where the loss function is locally flat, meaning no single-step adjustment can reduce the loss further.

Common Mistake

Reaching f′(x) = 0 does not guarantee you have found a minimum. Neural network loss landscapes are high-dimensional and contain saddle points — locations where the gradient is zero but the loss can still decrease by moving in a different direction. Optimisers like Adam use momentum and adaptive learning rates partly to escape these flat regions.

This example has one weight. A real neural network might have hundreds of millions of weights, each contributing to the loss differently. Computing the derivative of the loss with respect to every weight simultaneously — and doing it efficiently — requires extending the derivative to multiple dimensions. That extension is the gradient, and it is the subject of the next chapter.

In this section

What is a derivative?

The derivative of a function f at a point x is the slope of the tangent line to the graph of f at x. It measures how fast the function's output changes per unit increase in its input at that exact location. If you move a tiny step Δx along the x-axis, the output changes by approximately f′(x)·Δx. Formally, the derivative is the limit of the ratio (f(x+h) − f(x))/h as h approaches zero — the slope between two nearby points as the gap between them shrinks.

What does it mean when the derivative is zero?

When f′(x₀) = 0, the tangent line at x₀ is horizontal — the function is momentarily neither increasing nor decreasing. This condition is necessary for a local minimum or maximum, but not sufficient. x³ at x = 0 has f′(0) = 0, yet the function is still increasing on both sides — x = 0 is a saddle point, not a minimum. To confirm which case you have, you need either the second derivative (f′′ > 0 means minimum, f′′ < 0 means maximum) or an inspection of function values in a neighbourhood of x₀.

What is the difference between a derivative and a gradient?

A derivative is defined for functions of a single variable: it returns a scalar measuring the slope of f at a point. A gradient is the multi-variable extension: for a function f(x₁, x₂, …, xₙ), the gradient is the vector of all partial derivatives ∇f = (∂f/∂x₁, ∂f/∂x₂, …, ∂f/∂xₙ). Each partial derivative measures the slope in one coordinate direction while holding the others fixed. The gradient points in the direction of steepest ascent of the function; its negative points downhill.

Key Terms

Derivative Tangent Line Critical Point

◎ Intuition

Pick a point on the parabola y = x² — say, somewhere to the right of the vertex. Before you drag the point in the explorer, predict: will the tangent slope be positive, negative, or zero at that location? Now pick a point to the left of the vertex. What sign do you expect the slope to have there? What happens to the slope as you slide the point toward x = 0? Now switch to f(x) = x³. Move to x = 0. What is the slope of the curve at that point — positive, negative, or zero? And here is the harder question: if the slope is zero at x = 0, does that mean x = 0 is the lowest point of x³? Look at the curve on both sides of x = 0 before you answer. A zero slope and a minimum are not the same thing.

↺ Reflection

Slope, Zero, and Descent

What the derivative measures at a point

The derivative f′(x) is the slope of the tangent line to the graph of f at x — the instantaneous rate of change of the output per unit of input at that exact location. If f′(2) = 3, a small increase of 0.01 in x near x = 2 produces an increase of approximately 0.03 in f(x). If f′(2) = −3, the same increase in x produces a decrease of approximately 0.03 in f(x). The sign gives direction; the magnitude gives steepness.

Why f′(x) = 0 is not enough to conclude a minimum

A zero derivative marks a horizontal tangent — the function is momentarily flat. For a local minimum or maximum, f′(x₀) = 0 is necessary: the function must stop increasing and start decreasing, or vice versa, at that point. But the function x³ at x = 0 illustrates that zero derivative alone is insufficient. At x = 0, f′(x) = 3x² = 0 and the tangent is horizontal. Yet the function is still increasing on both sides — it passes through x = 0 without reversing. Zooming in confirms this: f(−0.1) = −0.001 and f(0.1) = 0.001, so the function is increasing through zero. This is a saddle point: the derivative is zero, but it is neither a minimum nor a maximum.

The second derivative provides a more reliable test. If f′(x₀) = 0 and f′′(x₀) > 0, the function is curving upward at x₀ — like a bowl — and x₀ is a local minimum. If f′(x₀) = 0 and f′′(x₀) < 0, the function is curving downward — like an inverted bowl — and x₀ is a local maximum. If f′′(x₀) = 0, the test is inconclusive and x₀ might be a saddle point. This is called the second derivative test, and it will appear again when the course covers optimisation in higher dimensions.

The loss minimisation connection

A neural network’s loss function is a surface over a space with millions of dimensions — one for each weight. Gradient descent navigates this surface by, at each step, computing the gradient (the vector of partial derivatives of the loss with respect to every weight) and moving every weight slightly in the negative gradient direction. That step moves the weights to a lower point on the loss surface.

The process continues until the gradient is zero — or acceptably close to zero — in every direction. At that point no first-order adjustment reduces the loss further. This is why understanding the derivative is foundational for machine learning: the entire training procedure is a structured search for a point where the gradient vanishes.

What gradient descent cannot guarantee is that the zero-gradient point it finds is the global minimum. Like x³ at x = 0, many deep networks have saddle points where the gradient is zero but the loss could still decrease. Modern optimisers combine gradient information with momentum and adaptive step sizes to navigate past these regions, but the core operation — follow the negative gradient — remains unchanged.

Key Points

The derivative f'(x) is the slope of the tangent line at x — how fast the function's output is changing at that exact point.

A zero derivative means the function is momentarily flat — but this could be a minimum, a maximum, or a saddle point. The function x³ has f'(0) = 0 but x=0 is neither a minimum nor a maximum.

Gradient descent finds the minimum of a loss function by repeatedly stepping in the direction the derivative points downhill — training a neural network is the process of finding where the loss surface's derivative reaches zero.

✓ Checkpoint

Check Your Understanding

Three questions on derivatives, critical points, and their role in machine learning. Select an answer, then reveal to see the explanation.

What does f′(x₀) measure geometrically?

If f′(x₀) = 0, then x₀ must be a local minimum of f.

Why does gradient descent require computing derivatives during neural network training?