Derivatives and Gradients

A derivative measures how fast a function's output changes as its input changes. That single idea — the slope of a curve at a point — underlies every training algorithm in machine learning. Gradient descent moves in the direction the derivative points downhill; backpropagation computes derivatives through every layer; second-order optimisers use curvature. This topic builds the geometric intuition from the ground up.

Prerequisites:

Vectors Required — The gradient is a vector of partial derivatives — you need to understand vectors first
Matrices and Transformations — The Jacobian is a matrix of derivatives — useful context for understanding how derivatives extend to vector-valued functions

intermediate

◷ 40 min total ① 3 chapters ⬡ 3 playgrounds

Chapters

intermediate ◷ 12 min

The Derivative — Rate of Change

Most functions curve — they speed up, slow down, or reverse direction. The derivative measures exactly how fast a function's output is changing at any given input: it is the slope of the tangent line at that point. Where the derivative is zero, the function is momentarily flat — a condition that marks every minimum, maximum, and saddle point. Because training a neural network means minimising a loss function, finding where the derivative is zero is the central task of every optimiser.

5 sections Start →

intermediate ◷ 12 min

The Gradient — Derivative in Multiple Dimensions

A neural network's loss depends on millions of weights simultaneously. To minimise it you need to know how the loss changes as each weight changes — but you need all those rates of change at once, packaged as a single object you can act on. That object is the gradient: the vector of partial derivatives, one per dimension. The gradient points in the direction of steepest ascent; its negative points downhill. One subtraction — w ← w − η∇L — is every gradient descent step ever taken.

5 sections Start →

intermediate ◷ 16 min

The Chain Rule — How Backpropagation Works

The chain rule is the single mathematical fact that makes training neural networks possible. It says how to differentiate a composition of functions: multiply the derivatives at each stage. A neural network is a long composition — input through weights and activations, repeated layer by layer. Backpropagation is simply the chain rule applied in reverse from the loss back to every weight, multiplying local gradients at each step. Understanding the chain rule explains both how networks learn and why they can fail to learn when those local gradients become very small.

5 sections Start →