SeeingML
Understanding Data Chapter 3 of 5 · tap to browse

Probability Distributions

The shape of data before any model sees it

A London weather station records daily rainfall — most days produce a few millimetres or nothing, but a handful of days produce heavy downpours, and the distribution of these values explains why forecasting rain is fundamentally different from forecasting temperature.

Learning Objectives
  1. 1 Name the defining parameters of the Normal and Exponential distributions and describe what each parameter controls.
  2. 2 Explain why the exponential distribution is appropriate for modelling measurements like daily rainfall, while the normal distribution suits measurements that vary symmetrically around a mean.
  3. 3 Given a description of a variable and its measurement context, identify which distribution is most appropriate and justify the choice.
  4. 4 Identify when a wrong distributional assumption will cause model predictions to be systematically biased, and describe what histogram shape signals the mismatch.
¶ Narrative

What Is a Probability Distribution?

A weather station on the roof of a building in central London has been recording measurements automatically for forty years. Every twenty-four hours it produces a number: today’s rainfall, today’s high temperature, today’s peak wind speed, today’s total sunshine. Each measurement is different from yesterday’s, and different again from the same day last year. But the measurements are not random in the sense of being uniformly scattered across all possible values. They cluster. They concentrate. They have shapes.

The shape that describes how measurements cluster across their possible values is called a probability distributionprobability-distribution. A distribution does not tell you what the next measurement will be. It tells you what values are plausible, how likely extreme values are, and where the bulk of observations will land over many measurements. Two variables with the same mean can have completely different distributions — and those differences determine which mathematical operations are meaningful, which models apply, and what predictions are trustworthy.


The Normal Distribution

Temperature in London varies around a long-run average. Most days are near the mean. Cold snaps and heat waves happen but are unusual, and extremely cold or extremely hot days are rarer still. The further a temperature is from the average, the less likely it becomes — and this decrease in likelihood is approximately symmetric: a day that is 8°C above average is about as rare as a day 8°C below average.

This pattern — symmetric, bell-shaped concentration around a central value — is the normal distributionnormal-distribution. It is characterised by two parameters:

  • μ (mean): the centre of the bell, where the distribution peaks
  • σ (standard deviation): the width of the bell — the typical distance from the mean

A concrete way to read σ: about 68% of values fall within μ ± σ, and about 95% fall within μ ± 2σ. If a city’s daily high temperature averages μ = 15°C with σ = 4°C, most days (roughly two-thirds of the year) land between 11°C and 19°C, and nearly all days sit between 7°C and 23°C. Swap σ for a larger value and the same “most days” range widens in lockstep. Swap it for a smaller one and the days cluster tightly around the average.

The bell curve is fully determined by μ and σ — no other information is needed. A smaller σ produces a taller, narrower bell. A larger σ produces a wider, flatter bell. The total area under the curve always equals 1, regardless of the shape.

The normal distribution's shape is controlled by σ. As σ increases from 2 to 9, the bell curve flattens and widens. The area under the curve stays exactly 1 throughout — a taller peak means narrower width, and a flatter peak means wider spread.
📖 History

The normal distribution was formalised independently by Carl Friedrich Gauss and Pierre-Simon Laplace in the early nineteenth century while studying measurement errors in astronomy. Gauss showed that if measurement errors are independent and arise from many small additive sources, their distribution converges to a bell curve. The central limit theorem — which makes this precise — explains why the normal distribution appears so often in nature: many real-world measurements are the sum of many independent small effects.

Linear regression assumes that prediction errors — the residuals between predicted and actual values — follow a normal distribution. When they do, the model’s confidence intervals are correctly calibrated. When they don’t — when residuals are skewed or heavy-tailed — the intervals are too narrow in the tails. Checking the shape of residuals is one of the most fundamental model diagnostics, and the normal distribution is the reference shape every regression analyst checks against.


The Exponential Distribution

Daily rainfall behaves differently. Most days in London produce a few millimetres of rain or nothing at all. Moderate rain events occur but are less common. Heavy downpours are rare. The probability of observing a given rainfall amount does not peak at some central value — it is highest near zero and decreases continuously as values increase. There is no bell shape, no symmetric spread.

This pattern — highest probability near zero, decaying continuously toward higher values — is the exponential distributionexponential-distribution. It is characterised by a single parameter:

  • λ (rate): how steeply the distribution decays. High λ means heavy concentration near zero — most events are small. Low λ means a slower decay — larger values occur with more relative frequency.

A concrete way to read λ: the average value equals 1/λ. If daily rainfall follows an exponential with λ = 0.5 (per millimetre), the mean rainfall is 1/0.5 = 2 mm per day — most days dry or near-dry, with the occasional heavier shower pulling the average up. Double λ and the mean halves: rainfall becomes even more concentrated near zero. Halve λ and the mean doubles: larger amounts become correspondingly more common. Unlike the normal distribution’s two knobs, the exponential has only one — λ alone controls both the shape and the average.

Daily rainfall data from the London weather station. Dots accumulate into a histogram as each day's measurement is recorded. Most days produce little or no rain. Heavy downpours are rare. The dashed curve shows the exponential distribution fitted to this data — it matches the shape closely.

The exponential distribution applies whenever events happen independently at a constant average rate. The time between customers arriving at a service desk. The gap between earthquakes above a threshold magnitude. The amount of rainfall on a randomly chosen day. In each case: small values are common, large values are rare, and the decay from small to large is exponential.

Skewnessskewness is the name for this asymmetry — a distribution is right-skewed when its tail extends to the right. The exponential distribution is maximally right-skewed: the mean is always larger than the median, which is always larger than the mode (zero). Fitting a symmetric model like the normal distribution to skewed data will systematically underestimate the probability of large values while overestimating the probability of negative values that cannot physically exist.

In machine learning, exponential and related distributions model time-to-event data: how long until a customer churns, how long until a machine fails, how long between fraudulent transactions. Survival analysis models — designed specifically for right-skewed, non-negative data — use this structure as their foundation.


Two More Shapes

Uniform distributions appear when all values in a range are equally likely. Daily sunshine hours in London range from zero (overcast all day) to roughly sixteen (summer solstice). Within this range, any amount is approximately equally likely across the year — there is no central clustering. The uniform distribution is perfectly flat: the probability density is the same at every point between the minimum and maximum.

Bimodal distributions have two peaks. Extreme rainfall records sometimes show bimodal structure: one cluster of values around ordinary frontal rainfall and a separate cluster at higher values corresponding to convective storm events. A bimodal histogram signals that the data comes from two distinct sub-populations — fitting a single unimodal model to bimodal data will produce a poor fit at both peaks.

DistributionShapeParametersWhen to use
NormalSymmetric bell curveμ (mean), σ (std dev)Temperatures, measurement errors, heights — continuous values that vary symmetrically around a central average
ExponentialPeaks at zero, decays rightλ (rate)Rainfall intensity, waiting times, failure rates — non-negative values where small is common and large is rare
UniformPerfectly flatmin, maxSunshine hours, random draws from a bounded range — when all values are equally likely
BimodalTwo peaksμ₁, σ₁, μ₂, σ₂, wMixture populations — two-component mixture: each peak has its own centre (μ) and spread (σ), plus a mixing weight w controlling how much each component contributes
python
import numpy as np
import scipy.stats as stats

# Generate 1000 days of simulated London rainfall (exponential)
rainfall = np.random.exponential(scale=3.5, size=1000)

# Fit a normal distribution to this data
mu, sigma = stats.norm.fit(rainfall)

# The fitted normal assigns probability to negative rainfall
prob_negative = stats.norm.cdf(0, loc=mu, scale=sigma)
print(f"Probability of negative rainfall under normal model: {prob_negative:.1%}")
# Output: Probability of negative rainfall under normal model: 23.1%

Why Distribution Shape Matters for Machine Learning

A model that predicts London temperatures as normally distributed will produce calibrated forecasts: it correctly assigns low probability to extreme heat and extreme cold. The same model applied to rainfall will assign meaningful probability to negative rainfall — an impossibility.

Temperature and rainfall have fundamentally different distribution shapes. Temperature is symmetric — days above average are as common as days below average — and can be modelled by a normal distribution. Rainfall is right-skewed — most days are dry — and follows an exponential pattern. A normal model applied to rainfall assigns probability to negative values, which cannot physically exist.
It will also underestimate the probability of heavy rain events, because the normal distribution’s tail decays much faster than an exponential tail.

Common Mistake

Assuming your data follows a normal distribution without checking. Many introductory ML courses use normally distributed toy datasets. Real-world data — income, time-to-event, sensor counts, rare occurrences — is frequently right-skewed, bounded at zero, or multimodal. Fitting a normal-distribution model to skewed data produces residuals that are systematically non-random, which violates the model’s assumptions and produces unreliable prediction intervals.

💡 Insight

Checking distribution shape is one of the first steps in exploratory data analysis. A histogram or kernel density estimate takes seconds to produce and reveals: whether the distribution is symmetric or skewed; whether there are multiple modes; whether there are sharp boundaries at zero or at a maximum value; and whether extreme values are more or less common than a normal distribution would predict. These observations directly constrain which models are appropriate.

Real World

Credit scoring models predict the probability that a loan applicant will default. Default rates are rare events — typically under 5% of applicants. The distribution of risk is heavily right-skewed: most applicants carry very low risk, and a small tail carries very high risk. Models that assume symmetric risk distributions consistently misclassify high-risk applicants by underestimating how extreme the tail is. Specialised approaches — extreme value theory, Poisson regression — exist specifically to handle right-skewed rare-event data.

Think about which shape each measurement will produce — and why.

In this section

What is a probability distribution?

A probability distribution describes how likely different values are for a measurement. For continuous data it is represented by a density function — a curve where the area under the curve over any interval gives the probability of observing a value in that interval. For discrete data it is represented by a mass function assigning a probability to each possible value.

What is the difference between a distribution and a histogram?

A histogram is an empirical approximation built from observed data — it shows how many data points fell into each bin. A distribution is a theoretical model describing the underlying data-generating process. With enough data, a histogram approaches the theoretical distribution shape; with little data it is jagged and may look nothing like the true distribution.

Why does the exponential distribution start high and decrease?

The exponential distribution models non-negative measurements where small values are most common and large values become progressively rarer. Daily rainfall follows this pattern — most days have a little precipitation and very heavy downpours are uncommon. The rate parameter λ controls how steeply the probability decays: high λ means the distribution falls off quickly, concentrating probability near zero.

◎ Intuition

London daily rainfall forms a distribution that is tallest near zero and decays to the right — most days are dry or drizzly, and heavy rain events are increasingly rare. The shape reflects the physical process: rain requires a specific set of atmospheric conditions that usually aren't present. Now imagine daily temperatures instead. Would temperatures follow the same pattern — tallest at low values, decaying toward high? Or would you expect a different shape entirely, and if so, why? Think about what physical process generates temperature — and whether that process would tend to produce extreme values as rarely as extreme rainfall does.

↺ Reflection

Key Ideas

A London weather station produces five measurement types, each with a distinct distribution shape. Daily temperature clusters symmetrically around a long-run average — cold and warm deviations are equally likely, producing the bell-curve shape of the normal distribution. The mean μ controls where the bell is centred; the standard deviation σ controls how wide the bell is. A small σ means most days are close to the mean. A large σ means the weather is highly variable.

Daily rainfall has a fundamentally different structure. Most days record a few millimetres or nothing — the modal value is near zero. Heavy rainfall events occur but become exponentially less probable as amounts increase. The exponential distribution captures this: it peaks at zero and decays to the right. Its single parameter λ (the rate) controls how steeply the decay happens. A high λ concentrates nearly all probability near zero. A low λ allows more probability mass in the higher-value tail.

The distinction between these shapes is not cosmetic. A normal distribution assigns symmetric probability to values equally far above and below the mean. For temperature this is appropriate — a day 10°C above average is as plausible as a day 10°C below average. For rainfall this is not appropriate: a normal distribution with mean 3mm assigns meaningful probability to −3mm, which is physically impossible. It also severely underestimates the probability of days with 20mm or more, because normal distribution tails decay faster than exponential tails.

Sunshine hours distribute approximately uniformly between zero and the seasonal maximum — each amount is roughly as likely as any other, with no central clustering. When a histogram is flat rather than peaked, the uniform distribution is the appropriate model. Extreme rainfall events sometimes show two distinct clusters — ordinary frontal rain and rare convective storms — producing a bimodal histogram. When two peaks appear, fitting a single unimodal model blurs both peaks and misrepresents the data-generating process.

The practical rule: look at the histogram before fitting a model. If the distribution is symmetric and bell-shaped, normal assumptions are reasonable. If it decays from zero, exponential or log-normal models apply. If it is flat, uniform. If it has two peaks, a mixture model is needed. Each shape corresponds to a different generative story about the physical process producing the data — and the generative story constrains which predictions the model can make reliably.

Key Points

The normal distribution is characterised by two parameters — mean μ and standard deviation σ — and is symmetric: values above and below the mean are equally likely at every distance from the centre.

The exponential distribution is characterised by a single rate parameter λ and is right-skewed: probability is highest near zero and decays continuously toward larger values, making small measurements common and large ones rare.

Distribution shape constrains valid modelling choices: fitting a normal distribution to rainfall assigns positive probability to negative values, underestimates the frequency of heavy events, and produces systematically wrong prediction intervals.

The uniform distribution (equal probability across a bounded range) and bimodal distribution (two distinct peaks) appear in measurement data and signal that uniform random draws or mixture populations are involved.

Checkpoint

Check Your Understanding

Four questions on probability distributions, parameter interpretation, and model assumptions. Click a question to reveal the answer — there is no score.

1

A hospital records the time between successive patient arrivals at an emergency room. Arrivals happen independently at a roughly constant average rate of 8 per hour. Which distribution best describes the time between arrivals?

2

The mean of a distribution is always the value that appears most often in the data.

3

A data scientist models daily London temperatures as N(11, 5²) — a normal distribution with mean 11°C and standard deviation 5°C. They then try the same approach for daily rainfall, fitting N(3, 4²). What is the fundamental problem with the rainfall model?

4

Order these distributions from most to least symmetric (most symmetric first):

  1. 1.Exponential
  2. 2.Normal
  3. 3.Uniform
  4. 4.Bimodal (equal peaks)