SeeingML
Understanding Data Chapter 5 of 5 · tap to browse

Data Quality & Outliers

Garbage in, garbage out — and the garbage is usually invisible

A faulty blood pressure sensor in one hospital ward recorded readings of 0 mmHg for six months. The ML model trained on that data learned that patients with blood pressure 0 had excellent outcomes — because they were all discharged quickly after the error was discovered.

Learning Objectives
  1. 1 Name four types of data quality problems and give one example of each.
  2. 2 Explain the difference between an outlier that should be removed and one that should be kept.
  3. 3 Given a described dataset with quality issues, identify which type of problem each issue represents and propose an appropriate handling strategy.
  4. 4 Explain how a specific data quality issue would corrupt a specific type of ML model's predictions.
¶ Narrative

When Data Lies

The data scientist opens the hospital’s patient records dataset. 47,000 rows. 23 columns. Before writing a single line of model code, they run a quality audit. What they find is typical of real healthcare data: 8% of rows have at least one missing value. Twelve patients have a recorded age of 0. Three have blood pressure readings of 0 mmHg. One patient appears 14 times with slightly different name spellings. A cluster of patients from one ward have systematically lower temperature readings than clinically plausible — a miscalibrated thermometer, later confirmed. None of these issues crashed the data pipeline. All of them would have silently corrupted any model trained on the raw data.

Types of Data Quality Problems

Every ML pipeline encounters the same four categories of data quality problems. Each has a different cause and requires a different response.

Missing values. A cell in the dataset has no recorded value. Equipment failed, the field was optional, the patient refused to answer, or the data entry system had a bug. Missing values are visible — a null check finds them instantly — but handling them correctly requires understanding why they are missing. A sensor that randomly malfunctioned is different from a patient who deliberately left the pain severity field blank.

Outliers. Values far from the rest of the distribution. A 95-year-old patient is unusual in a hospital dataset, but real. A patient with recorded age −3 is impossible. A blood pressure reading of 0 mmHg is a sensor error or data entry mistake. The challenge is distinguishing genuine extreme cases from errors — and the distinction matters enormously, because the correct responses are opposite. Outliersoutlier that are real must be kept. Outliers that are errors must be removed.

Duplicate records. The same entity appearing multiple times. In healthcare, the same patient under different name spellings or hospital ID numbers. In e-commerce, the same transaction recorded twice during a system synchronisation. Duplicates inflate counts and bias any model toward patterns that appear in the repeated records, which may be entirely accidental.

Systematic errors. A consistent, directional bias introduced by faulty equipment or a flawed data collection protocol. The miscalibrated thermometer records every reading 1.4°C too low. A scale reads 2 kg too heavy. These errors are the most dangerous because they are consistent — the model learns the wrong pattern with high confidence, not with uncertainty. There is no statistical test that can detect a systematic error from within the dataset alone; you need external reference data or domain knowledge.

Four data quality problem types in a single patient dataset. The amber cluster sits above where it should be — shifted by a consistent calibration error. Red dots are impossible values. Hollow circles are missing. The main cluster is what any algorithm should see, but rarely does.
Problem typeDetectable byTypical causeHandling approach
Missing valuesNull checks, completeness reportsOptional fields, equipment failure, patient refusalImputation, row removal, or algorithm that handles missingness natively
OutliersIQR rule, z-score > 2.5–3Measurement error or genuine extreme caseDomain-knowledge review — remove errors, keep genuine extremes
Duplicate recordsExact or fuzzy matching across recordsData entry errors, system merges, sync failuresDeduplication with entity resolution
Systematic errorsCross-source validation, domain expertiseFaulty equipment, flawed protocolRecalibration or removal of the affected time window

Outlier Detection

Two statistical methods flag values for human review: the z-score and the IQR rule. The z-score measures how many standard deviations a value sits from the mean. A z-score above 3 is unusual. The IQR rule defines outliers as values below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR. Both methods flag the same thing: values that are statistically unusual. Neither method decides what to do with them.

Statistical outlier detectionoutlier identifies candidates for review, not candidates for deletion.

⚠️ Warning

A z-score of 3.5 does not mean a value is wrong. It means a value is unusual. A 95-year-old patient has an unusual age for a hospital dataset — but they are real and their data is valid. Statistical outlier detection identifies candidates for review, not candidates for deletion. The decision requires domain knowledge.

Common Mistake

Automatically removing all statistical outliers. This discards genuine extreme cases, which are often the most important data points in medical and fraud detection datasets. A fraud detection model that removes unusual transactions during training will be blind to fraud in production — because fraud is, by definition, an unusual transaction. Outlier removal without domain knowledge is a reliable way to destroy the information content of a dataset.

Missing Data Strategies

Three strategies handle missing values, each with different assumptions and trade-offs.

Deletion removes every row that has a missing value. It is simple and introduces no imputed values. It is only safe when missingness is completely random and the deleted rows are representative of the rest of the dataset. When sicker patients are more likely to have missing measurements — which is common in healthcare — deletion produces a dataset that systematically underrepresents the cases that matter most.

Simple imputation replaces missing values with the mean, median, or mode of the non-missing values for that column. It is fast and produces a complete dataset. It assumes missing values look like present values — an assumption that is explicitly false whenever missingness is related to the value itself.

Model-based imputation predicts the missing value from other features in the same row. A patient with missing blood pressure but known age, weight, and diagnosis may have their missing BP predicted reasonably well. This is more accurate than mean imputation but more complex, and it can amplify errors if the predictor features are themselves noisy.

Missing datamissing-data that is related to the missing value itself is called missing not at random (MNAR). In MNAR situations, the missingness pattern is itself informative — it tells you something about what the missing value probably was. Imputing with the mean destroys this signal.

Imputationimputation always introduces bias. The question is whether the bias introduced by imputation is smaller than the bias introduced by leaving values missing or deleting rows.

📖 History

The framework for classifying missing data was formalised by Donald Rubin in 1976 in his foundational paper “Inference and Missing Data.” Rubin distinguished three mechanisms — missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) — based on whether the probability of missingness is related to the missing value. This classification is still used by every statistical package today and is the reason data scientists think carefully about why data is missing, not just how much is missing.

Data Leakage

Data leakage is when information that would not be available at prediction time contaminates the training data. The hospital example: including the number of post-discharge follow-up appointments as a feature when building a model to predict whether a patient will be readmitted. Patients who are readmitted have not yet had post-discharge appointments at the time the prediction needs to be made. The model trains on a feature that is only measurable after the prediction window has closed. It learns a spuriously perfect pattern that does not exist at inference time.

💡 Insight

Data leakage is the most seductive data quality problem because it makes your model look good. A leakage-affected model can achieve 99% accuracy in testing and 50% in production. The test results are not wrong — they just measure the wrong thing. High test accuracy is not always a good sign: it is sometimes a sign that the training and test data have both been contaminated with future information.

Data leakagedata-leakage is particularly common when features are engineered from the full dataset before the train/test split. Any aggregation that includes test examples — such as computing mean age across all patients before splitting — leaks information from the test set into the training process.

Two extreme outliers flatten a regression line that would otherwise correctly recover a positive trend. The left panel shows what the algorithm should learn. The right panel shows what it actually learns when outlier contamination is present.

Bridge to the Playground

The playground presents the hospital patient dataset as a 2D scatter plot of age versus blood pressure. Starting from a clean baseline — a recognisable cluster with a slight upward trend — you can introduce each type of quality problem and watch in real time what any pattern-finding algorithm would see. Noise spreads the cluster. Injected errors place impossible measurements at extreme positions. Missing values hollow out the dataset. Systematic bias shifts an entire ward. Each problem has a distinct visual signature — and a distinct failure mode for any model trained on the corrupted data.

In this section

Should outliers always be removed?

No. Outliers fall into two categories: errors (a blood pressure reading of 0 is impossible and should be removed) and genuine extremes (a patient who spent 180 days in hospital is unusual but real and important). Removing genuine extremes discards valuable information. The decision requires domain knowledge — not just statistical thresholds.

What is the difference between missing at random and missing not at random?

Missing at random means the probability of a value being missing is unrelated to the missing value itself — a sensor randomly failed. Missing not at random means missingness is related to the value — patients with very high pain levels may leave the pain field blank because they find the question distressing. The second type is much harder to handle because the missingness itself is informative.

What is imputation?

Replacing missing values with estimated values. Simple imputation replaces with the mean or median. More sophisticated methods use other features to predict the missing value. Imputation introduces its own biases — it assumes missing values look like non-missing values, which may not be true.

◎ Intuition

The playground starts with a clean patient dataset — age and blood pressure measurements cluster in a sensible region, with a slight upward trend that reflects real physiology. Before you add any noise — look at the overall shape of the cluster. If a doctor told you a reading of age = 0 and blood pressure = 220 appeared in this dataset, would you treat it as a genuine patient or as a data error? What would you look for to decide — and what would it take to convince you either way?

↺ Reflection

Key Ideas

A hospital dataset with 47,000 patient records and eight percent missing values will train a model without a single error message. The algorithm will silently exclude or impute the missing rows, fit to the remaining data, and report a performance metric. Whether that metric reflects anything real depends entirely on whether the data quality problems were caught and handled before training began.

Missing values become dangerous when missingness is correlated with the outcome. In healthcare, the sickest patients are most likely to have incomplete records — they may be unable to complete questionnaires, their condition may prevent certain measurements, or their care may be more chaotic. A model that imputes their missing values with the population mean assigns healthy-looking inputs to the unhealthiest patients. The model then learns that patients with healthy-looking imputed inputs tend to have bad outcomes — the opposite of what the underlying data says about healthy patients. Deletion makes this worse: removing rows with missing values drops the most informative patients from the training set entirely.

Statistical outlier detection identifies patients whose age or blood pressure falls more than 2.5 standard deviations from the cluster centre. A blood pressure of 0 mmHg and a blood pressure of 195 mmHg are both flagged. The first is a sensor failure — every patient with a recorded blood pressure of 0 was discharged quickly after the error was discovered, which means the model learned that BP = 0 predicts good outcomes. The second is a hypertensive crisis — a real and clinically important value that should be kept, studied, and given appropriate weight in the model. Automated deletion of both, based on the same statistical threshold, discards valid clinical information while it tries to remove noise. The z-score is a flag, not a verdict.

Systematic errors do not look like errors at all. A blood pressure cuff in one ward that reads 18 mmHg too high produces a cluster of patients with elevated readings — and an ML model that confidently learns that patients from that ward have higher blood pressure. The model is not wrong given its training data. The training data is wrong given reality. Statistical tests cannot detect this from within the dataset alone because the error is perfectly consistent: the ward’s readings are systematically wrong, but consistently wrong. Only cross-validation against another device, or domain knowledge that the readings are clinically implausible, reveals the problem.

Data leakage produces the most misleading test results. A model predicting whether a patient will be readmitted within 30 days, trained on features that include the number of specialist consultations scheduled after discharge, will appear to achieve very high accuracy. At training time, patients who were readmitted have no scheduled specialist consultations — those consultations were never booked because the patients returned to hospital. At inference time, this signal does not exist for new patients whose future is not yet known. The model has learned a pattern that perfectly describes the training data and perfectly fails to generalise to new patients.

Every subsequent topic in this domain — Features and Representations, Dimensionality, Data Splits and Evaluation — assumes the data has been through a quality audit. Data quality is not a preliminary step that can be deferred until the modelling work begins. Models that are built on dirty data and then evaluated on clean data will look good in testing and fail in production. Models built on dirty data and evaluated on equally dirty data will look good in testing and fail in production in a different way. The only safe path is to address data quality before any algorithm sees the data.

Key Points

Data quality problems — missing values, outliers, duplicates, and systematic errors — are the norm in real datasets, not the exception. ML models process dirty data without raising errors, learning wrong patterns silently.

Outliers require domain knowledge to handle correctly: an impossible age of -3 is an error to remove, while a genuine 95-year-old patient is real data to keep. Statistical detection flags candidates — it does not make the decision.

Systematic errors are the most dangerous quality problem because they produce confident wrong models. A miscalibrated sensor creates a pattern the algorithm learns with high certainty — a certainty that is entirely misplaced.

Data leakage — when future information contaminates training data — produces models that appear excellent in testing and fail completely in production. High test accuracy is not always a good sign.

Checkpoint

Check Your Understanding

Four questions on missing values, outlier detection, systematic errors, and data leakage. Click a question to reveal the answer — there is no score.

1

A fraud detection dataset has 0.1% fraudulent transactions. One feature — merchant category code — is blank on 4% of all transactions. A data engineer proposes replacing every blank merchant code with the most common code (mode imputation). What is the main problem with this approach?

2

A model that achieves 98% accuracy on a held-out test set is guaranteed to perform well in production.

3

A hospital dataset contains blood pressure readings. One ward's readings are all exactly 15 mmHg higher than clinically expected — a miscalibrated cuff, later confirmed. The rest of the dataset has random measurement noise of ±5 mmHg. Which problem is more dangerous for a model trained on this data, and why?

4

Order these data quality problems from easiest to hardest to detect automatically, without domain knowledge (easiest first):

  1. 1.Duplicate records
  2. 2.Impossible values (e.g. age = −3, blood pressure = 0)
  3. 3.Missing values
  4. 4.Systematic calibration errors