Derivatives and Gradients
A derivative measures how fast a function's output changes as its input changes. That single idea — the slope of a curve at a point — underlies every training algorithm in machine learning. Gradient descent moves in the direction the derivative points downhill; backpropagation computes derivatives through every layer; second-order optimisers use curvature. This topic builds the geometric intuition from the ground up.
- Vectors Required — The gradient is a vector of partial derivatives — you need to understand vectors first
- Matrices and Transformations — The Jacobian is a matrix of derivatives — useful context for understanding how derivatives extend to vector-valued functions
The Derivative — Rate of Change
Most functions curve — they speed up, slow down, or reverse direction. The derivative measures exactly how fast a function's output is changing at any given input: it is the slope of the tangent line at that point. Where the derivative is zero, the function is momentarily flat — a condition that marks every minimum, maximum, and saddle point. Because training a neural network means minimising a loss function, finding where the derivative is zero is the central task of every optimiser.
The Gradient — Derivative in Multiple Dimensions
A neural network's loss depends on millions of weights simultaneously. To minimise it you need to know how the loss changes as each weight changes — but you need all those rates of change at once, packaged as a single object you can act on. That object is the gradient: the vector of partial derivatives, one per dimension. The gradient points in the direction of steepest ascent; its negative points downhill. One subtraction — w ← w − η∇L — is every gradient descent step ever taken.
The Chain Rule — How Backpropagation Works
The chain rule is the single mathematical fact that makes training neural networks possible. It says how to differentiate a composition of functions: multiply the derivatives at each stage. A neural network is a long composition — input through weights and activations, repeated layer by layer. Backpropagation is simply the chain rule applied in reverse from the loss back to every weight, multiplying local gradients at each step. Understanding the chain rule explains both how networks learn and why they can fail to learn when those local gradients become very small.