SeeingML

The Learning Rate

One number that changes everything

When GPT-4 was trained, engineers spent weeks just tuning this single number before the main run began.

Learning Objectives
  1. 1 Explain how the learning rate controls the size of each weight update and why it appears as a scalar multiplier in the gradient descent update rule.
  2. 2 Select an appropriate learning rate for a given training scenario by reasoning about convergence speed, stability, and the curvature of the loss surface.
  3. 3 Diagnose whether a training run is suffering from a too-small or too-large learning rate by examining the shape of the loss curve and the optimisation path.
¶ Narrative

What Is the Learning Rate?

Training a neural network is an act of repeated correction. The network makes a prediction, measures how wrong it is, and then adjusts its weights in a direction that should make the next prediction slightly less wrong. The learning ratelearning-rate is the dial that controls how large those adjustments are.

Without a learning rate, the update would move the weights by exactly the raw gradient — a derivative whose magnitude can vary enormously across different parameters and different points in training. The learning rate scales that signal down to a manageable step size.

The Update Rule

Every gradient descent step follows the same formula:

Gradient Descent Update

Reading it left to right: the new weights w_{t+1} equal the old weights w_t minus the gradient ∇_w L(w_t) multiplied by η (the Greek letter eta, pronounced “ay-ta”).

Each symbol has a precise role:

  • w_t — the current weights at iteration t. These are the parameters being optimised: every number in every layer of the network.
  • ∇_w L(w_t) — the gradient of the loss with respect to the weights. It points in the direction that would increase the loss most steeply. Subtracting it moves the weights downhill.
  • η — the learning rate. A single positive number, typically between 0.0001 and 1.0. It multiplies the gradient before the subtraction, controlling the size of the step.

That single scalar η has an outsized effect on everything that follows. Set it too small and training crawls: you might need ten times as many iterations to reach the same loss value, burning compute without making proportional progress. Set it too large and each step overshoots the minimum — the parameters jump past the optimal point and the loss bounces or climbs instead of falling.

Why Tuning It Matters

Real World

Adaptive optimisers like Adam, AdaGrad, and RMSProp were invented specifically because hand-tuning the learning rate is so painful. Each maintains a running estimate of gradient magnitudes and uses it to rescale the update per-parameter automatically, making training far less sensitive to the global η setting. Despite this, even teams using Adam invest significant effort in choosing the initial rate and its decay schedule — particularly for large language models, where a diverged run can waste days of compute. The learning rate remains one of the most consequential hyperparameters in any training run.

Three Regimes

The consequences of η fall into three broad categories:

Too SmallJust RightToo Large
Very slow — many iterations to convergeSteady progress — loss falls smoothly each stepUnstable — loss oscillates or grows
Wastes compute on redundant fine-grained stepsEfficient use of compute budgetWastes compute recovering from each overshoot
Stable but prone to getting trapped in poor local minimaStable convergence toward a useful minimumUnstable — may diverge entirely on non-convex surfaces

A Common Misconception

Common Mistake

It is tempting to think that a smaller learning rate is always the safer choice. It is not. A learning rate that is too small causes updates to become negligibly tiny. The optimiser inches toward the minimum so slowly that it can exhaust the compute budget before arriving, or become permanently trapped in a poor local minimum that a slightly larger step would have escaped. Safety and smallness are not the same thing. The goal is the right size — and that depends on the curvature of the surface being optimised.

In this section

What is the learning rate in gradient descent?

The learning rate (η) is a positive scalar that multiplies the gradient before it is subtracted from the current weights. At each step, the update rule is w_{t+1} = w_t − η · ∇L(w_t). It controls how far the parameters move in the direction of steepest descent. A larger η means bigger steps; a smaller η means smaller steps. The gradient gives direction; the learning rate gives magnitude.

Why does a learning rate that is too large cause divergence?

When the learning rate is larger than the inverse of the loss surface's curvature at that point, each step overshoots the minimum. The next gradient then points back across the valley in the opposite direction, and the parameters oscillate with growing amplitude. In convex problems this produces visible oscillation in the loss curve. In non-convex settings — like those typical in deep networks — overshooting often sends the parameters into a region where the loss is dramatically higher, causing the run to diverge entirely.

Do I need to tune the learning rate when using Adam or other adaptive optimisers?

Adam and similar adaptive optimisers (AdaGrad, RMSProp) maintain per-parameter estimates of gradient magnitudes and use them to rescale the effective step size automatically. This makes training far less sensitive to the global learning rate. In practice you still provide an initial rate — 1e-3 is the conventional default for Adam — but the range of workable values is much wider than with vanilla SGD. For large models, learning rate schedules (warmup followed by cosine or linear decay) remain important even with adaptive optimisers, because the right rate early in training differs significantly from the right rate late in training.

◎ Intuition

The visualisation below shows a loss surface — a landscape whose height represents the error of a model at each combination of two weights. Gradient descent starts at the marker in the upper right and takes a sequence of steps, each one moving downhill by following the slope of the surface. The learning rate controls the length of each step. Before you move any sliders, take a moment to read the contour lines: the tightly packed rings near the centre indicate a steep descent toward the minimum, while the widely spaced contours at the edges indicate a gentler slope. Where would a ball placed at the starting point naturally roll?

↺ Reflection

What You Just Saw

The gradient descent update rule is w_{t+1} = w_t − η · ∇L(w_t). The gradient ∇L tells the optimiser which direction to move; the learning rate η determines how far. These two roles are separate and both matter.

At a low learning rate, the path to the minimum is smooth but long. Each step is small relative to the curvature of the surface, so the update always moves toward lower loss. The cost is iteration count: a rate of 0.01 on a standard bowl surface takes roughly five times as many steps to converge as a rate of 0.1.

A learning rate above approximately 0.5 on a convex surface causes the optimisation path to overshoot the minimum on every step. Instead of converging, the parameters land on the far side of the valley, where the gradient points back across in the opposite direction. The path oscillates, and if the rate is high enough, the amplitude of each oscillation grows — a condition called divergence. On non-convex surfaces the situation is worse: an overshooting step can carry the parameters out of a useful basin of attraction entirely and into a region of much higher loss.

The saddle point surface demonstrates a separate failure mode: regions where the gradient vanishes because the curvature is negative in one direction and positive in another. At a saddle point, gradient descent stalls regardless of learning rate, because there is no gradient signal to follow.

For real training runs, this is why learning rate warmup exists: starting small and increasing gradually lets the optimiser settle into a promising region before taking larger steps.

Key Points

The gradient gives direction. The learning rate gives magnitude.

A learning rate above ~0.5 on a convex surface causes divergence because each step overshoots the minimum.

Adaptive optimisers like Adam remove the need to tune this manually in most cases.

Checkpoint

Check Your Understanding

Three questions on the learning rate and gradient descent. Answers are revealed when you submit each question — there is no score.

1

If the gradient at a point is very large, what does that tell you about the loss surface there?

2

A smaller learning rate always leads to better final model performance.

3

Put the steps of one gradient descent iteration in the correct order.

  1. 1.Compute the loss on a batch of training data
  2. 2.Calculate the gradient of the loss with respect to the weights
  3. 3.Multiply the gradient by the learning rate
  4. 4.Subtract the scaled gradient from the current weights