Understanding Data Chapter 4 of 5 · tap to browse
Descriptive Statistics
Numbers that summarise thousands of values — and the ways they can mislead
When a politician claims 'average salary in this city is £45,000', they almost certainly mean the mean — a number pulled upward by a few very high earners that most residents never come close to earning.
- 1 Define mean, median, variance, and standard deviation in plain English.
- 2 Explain why the mean and median diverge in skewed distributions and which is more representative of the typical case.
- 3 Given a described dataset, choose the appropriate measure of centre and justify the choice.
- 4 Identify how a reported statistic could be technically accurate but misleading, and explain what additional statistic would give a more complete picture.
Mean, Median, Spread
The city housing authority has collected salary records for 1,000 residents. A councillor asks: “What is the typical salary in this city?” The analyst pulls up the data and could truthfully answer £47,200 — or £31,400. Both numbers describe the same 1,000 people. Neither is wrong. But they tell completely different stories about whether residents can afford housing, and choosing between them is not a mathematical question. It is a question about what “typical” should mean.
Measures of Centre
The most familiar summary of a dataset is its centre — the value around which others cluster. Two statistics measure this, and they disagree whenever the data is skewed.
The meanmean is the arithmetic average: sum all values, divide by the count. Its defining property is that every value contributes equally. A single resident earning £2,000,000 adds £2,000 to the mean of a 1,000-person dataset. The mean is mathematically efficient — it uses all the information in the data. But that efficiency is also its weakness: extreme values pull it away from where most data sits.
The medianmedian is the middle value when all values are sorted. For a dataset of 1,000 salaries, the median is the average of the 500th and 501st values after sorting. Adding a billionaire to the dataset moves the median by exactly one position — from the 500th to the 501st value. The median is robust to extremes because it depends only on rank, not on how extreme the extremes are.
Reporting mean income as if it describes what a typical person earns. “Average salary” almost always means the mean, which is almost always higher than what most residents earn in any distribution with high earners. A city where nine residents earn £20,000 and one resident earns £200,000 has a “mean salary” of £38,000 — which no one in the city actually earns. This is not a calculation error. It is a deliberate or careless choice of which statistic to report.
Measures of Spread
The centre of a distribution is only half the story. Two datasets with identical means can look completely different if one is tightly clustered and the other is widely spread. The spread tells you how much values typically deviate from the centre.
The variancevariance is the average squared distance from the mean. For each value in the dataset, calculate how far it is from the mean, square that distance, and average all the squared distances. Squaring serves two purposes: it eliminates negative deviations (a value £5,000 below the mean and one £5,000 above both contribute equally) and it penalises large deviations more than small ones — a salary £30,000 from the mean contributes nine times more than one £10,000 from the mean.
The standard deviationstandard-deviation is the square root of the variance. Taking the square root brings the unit back to the original currency — pounds, not pounds-squared — making it interpretable alongside the mean.
For each salary, calculate how far it is from the mean, square that distance, average all the squared distances, then take the square root. The result is in the same currency as the original salaries. A salary dataset with mean £31,000 and standard deviation £8,000 means typical residents earn between £23,000 and £39,000 — within one standard deviation of the mean.
Percentiles and the IQR
The mean and standard deviation describe where a distribution is centred and how spread out it is — but both are sensitive to outliers. When a dataset has extreme values, a more robust description uses percentiles.
The 25th percentile (Q1) is the value below which 25% of the data falls. The 75th percentile (Q3) is the value below which 75% falls. The interquartile range, or IQR, is the distance between them: IQR = Q3 − Q1. It describes the middle 50% of the distribution — the range where a typical resident’s salary will land.
A single extreme salary has no effect on IQR. The billionaire in the dataset does not change the 25th or 75th percentile of 1,000 values. This robustness makes IQR the preferred measure of spread when the data is skewed or contains outliers — which is most real-world income data.
Use IQR and median when the data is skewed or has outliers. Use mean and standard deviation when the distribution is roughly symmetric and outlier-free. For income, prices, time-to-event data, and website session times, IQR and median almost always give a more honest picture than mean and standard deviation.
How Statistics Can Mislead
The same salary dataset produces three simultaneously true statements:
- “Mean salary is £47,200” — pulled upward by a small number of very high earners
- “Median salary is £31,400” — the value a typical resident falls near
- “Standard deviation is £52,000” — enormous spread, dominated by outlier values
All three are correct. None involves rounding errors or fabrication. But only the median tells an honest story about what a resident looking for housing can expect to earn. The mean overstates by 50%. The standard deviation implies that typical salaries range from zero to over £100,000 — technically spanning most of the actual distribution, but giving no intuition about where most people actually are.
Descriptive statistics are a compression of information. Every compression loses something. The question is always: what did this statistic choose to ignore, and does that matter for the decision being made? A housing authority using mean salary to set affordability thresholds will systematically overestimate what residents can afford.
The UK Office for National Statistics reported in recent years that median household disposable income was around £35,000 while mean household disposable income was around £52,000. Both numbers were correct. The gap — nearly 49% — is explained entirely by the top decile of earners pulling the mean upward. Political debates about living standards routinely cite whichever of these two numbers supports the speaker’s argument.
The playground below lets you reshape the city’s salary distribution and watch what happens to the mean, median, and standard deviation as you add high earners, change the spread, or switch to a bimodal two-group distribution. The mean and median lines update live as residents are added — when they are close together, the distribution is roughly symmetric; when they diverge, the mean is no longer telling an honest story.
What is the difference between mean and median?
The mean is the arithmetic average — sum divided by count. The median is the middle value when all values are sorted. For symmetric distributions they are equal. For skewed distributions the mean is pulled toward the long tail while the median stays near the true centre of the data.
What does standard deviation actually measure?
Standard deviation measures how far typical values are from the mean. A small standard deviation means most values are clustered close to the mean. A large standard deviation means values are spread widely. Roughly 68% of values in a normal distribution fall within one standard deviation of the mean.
Why is variance the square of standard deviation?
Variance is defined as the average squared distance from the mean. Squaring eliminates negative distances and penalises large deviations more than small ones. Taking the square root gives standard deviation — which is in the same units as the original data and easier to interpret.
The playground starts with a roughly symmetric salary distribution — the mean and median lines sit close together, both near £32,000. The standard deviation shading shows the typical range around the mean. Before you interact — what do you think will happen to the mean line when you switch to a distribution with a small number of very high earners? Will the median move by the same amount as the mean, more, or less? And if they diverge, which one more honestly describes what a typical resident earns?
Key Ideas
The housing authority’s 1,000 salary records produce a mean of £47,200 and a median of £31,400. The 50% gap between them is not noise or error — it is the mathematical signature of right skew. A small number of high earners pull the mean upward without moving the median, because the mean responds to how extreme values are while the median responds only to their rank. In a dataset of 1,000 sorted values, one person earning £2,000,000 instead of £50,000 moves the mean by £1,950 but moves the median by less than one position.
Standard deviation squared is variance — the average squared distance from the mean. For a salary dataset with mean £31,000 and standard deviation £8,000, roughly 68% of residents earn between £23,000 and £39,000. This interpretation requires the distribution to be roughly normal. When the distribution is right-skewed, the ±1 SD range extends further right than left — the shading is asymmetric relative to the actual distribution. Standard deviation is calculated symmetrically from the mean, so in a skewed distribution it misleads about where most values sit.
The interquartile range (IQR = Q3 − Q1) avoids this problem. It describes the middle 50% of the actual data regardless of what happens in the tails. In the housing authority dataset, the IQR might be £18,000–£42,000, placing a typical resident’s salary inside that range no matter how many high-earner outliers exist. IQR and median together give a complete, outlier-resistant picture of the distribution.
A bimodal salary distribution — one cluster of low-income residents and a separate cluster of high-income residents — exposes the deepest failure mode of summary statistics. The mean lands between the two peaks, in a region where few residents actually earn. The median lands near the lower peak (which contains more residents), which is better but still describes only one of the two groups. Both statistics describe the combined population accurately but describe neither sub-population well. When a histogram shows two peaks, the honest analysis is to separate the two groups and describe each with its own mean and standard deviation.
For machine learning, right-skewed features cause problems for algorithms that assume symmetric feature distributions or that use Euclidean distance. A feature with mean £32,000 and standard deviation £52,000 has a coefficient of variation above 1.0 — extreme spread relative to the mean. Distance-based algorithms including KNN, K-means, and SVMs treat raw salary values as if the feature is normally distributed. Log transformation compresses the right tail, brings the distribution closer to symmetric, and makes distance calculations meaningful. This is covered in Topic 2 of this domain, Features and Representations.
The mean is the arithmetic average — efficient and accurate for symmetric distributions, misleading for skewed ones where it is pulled toward the long tail by a small number of extreme values.
The median is the middle value when sorted — robust to outliers and skew, always representing the typical case regardless of how extreme the extremes are.
Standard deviation measures how far typical values are from the mean — large SD signals high spread, but when SD is large relative to the mean, a single unimodal distribution may not be the right model.
Choosing the wrong summary statistic produces technically correct but misleading descriptions — reporting mean salary in a right-skewed income distribution overstates what most residents earn.
Check Your Understanding
Four questions on mean, median, spread, and how summary statistics can mislead. Click a question to reveal the answer — there is no score.
A dataset of 1,000 house prices has a mean of £380,000 and a median of £245,000. What does the gap between these two values most likely indicate?
Standard deviation and interquartile range both measure spread, so they can always be used interchangeably.
A data analyst reports that the 'average' time users spend on a webpage is 4 minutes 20 seconds. A colleague argues the median would be more informative. When would the colleague be right?
Order these statistics from most to least robust against the effect of extreme outliers (most robust first):
- 1.Mean
- 2.Standard deviation
- 3.IQR
- 4.Median