SeeingML
Evaluation Chapter 4 of 4 · tap to browse

Imbalanced Data

When 99% accuracy is meaningless — and what to measure instead

A fraud team trains a random forest on 100,000 transactions. Training accuracy: 99.8%. Validation accuracy: 99.7%. Delighted, they deploy. Three weeks later they discover the model has flagged zero fraudulent transactions in production — exactly as it did in training. The model was correct 99.7% of the time by predicting 'not fraud' for every single transaction, because 99.7% of the transactions really were not fraud. The 0.3% of genuine fraud cost the business £2.3M. Every mechanism in this chapter exists to catch this failure before the model ships.

Learning Objectives
  1. 1 Explain why accuracy is a misleading metric for imbalanced classification: a trivial classifier predicting the majority class achieves accuracy equal to the majority's proportion, regardless of model quality.
  2. 2 Read a confusion matrix and compute precision (TP / (TP + FP)), recall (TP / (TP + FN)), and F1 (harmonic mean), and explain what each measures about a classifier's performance on the positive class.
  3. 3 Compare precision-recall curves against ROC curves for imbalanced data, explain why PR curves are more honest at extreme imbalance (ROC is inflated by the large negative class), and identify which curve's AUC to optimise for a given problem.
  4. 4 Select an imbalance remedy — threshold adjustment, class weighting, oversampling / SMOTE, undersampling — based on the cost structure of false positives versus false negatives and the training-compute budget available.
¶ Narrative

The Metric That Lies

A fraud team at a payments company trains a random forest on 100,000 labelled transactions. Training accuracy comes back at 99.8%, validation accuracy at 99.7%. They congratulate each other, ship the model, and move on. Three weeks into production they notice something strange: the model has flagged zero fraudulent transactions. On inspection it is correctly predicting “not fraud” for every single transaction it sees — and getting 99.7% accuracy by doing so, because 99.7% of transactions really are not fraud. The real fraud, 0.3% of traffic, has cost the company £2.3M in three weeks, and the model has done nothing to catch it.

The model was not buggy. It optimised exactly the quantity it was told to optimise — accuracy — and it chose the rational strategy for that objective on a heavily imbalanced dataset: always predict the majority class. This chapter is about recognising that accuracy is the wrong metric for imbalanced classification and knowing what to measure instead.


The accuracy paradox

With any imbalance above roughly 80/20, accuracy becomes uninformative. On an 8% default rate loan dataset, predicting “never defaults” for every applicant gets 92% accuracy. On a 1% fraud dataset, predicting “not fraud” gets 99%. These numbers look like model quality but they measure only how imbalanced the data was.

Two loan-screening models on the same 200 applicants (8% default rate). Model A predicts 'everyone repays' — no machine learning, no model, just a constant output. It achieves 92% accuracy because 184 of 200 applicants actually repay. Model B is a real classifier: 87% accuracy, but it correctly flags 11 of 16 defaulters while approving 164 of 184 legitimate applicants. On headline accuracy Model A wins by 5 points; on every metric the bank actually cares about (avoided losses, recall, F1), Model B is dramatically better.

The lesson is not that accuracy is useless — it is that accuracy conflates two entirely different questions: ‘did the model get the majority class right?’ (almost always yes if there is an imbalance) and ‘did the model get the minority class right?’ (usually what you actually care about). For imbalanced problems the right metrics separate these two questions.


The confusion matrix is the primary diagnostic

Every classification metric starts from the confusion matrix — a 2×2 table counting how many examples fell into each combination of (actual, predicted) for a binary problem.

The four quadrants of a confusion matrix for the loan scenario at threshold 0.50. The top-left cell (true positives) counts defaulters the model caught. The bottom-right cell (true negatives) counts applicants correctly approved. The top-right cell (false negatives) counts defaulters the model missed — the expensive mistakes. The bottom-left cell (false positives) counts false alarms — applicants the model rejected who would have repaid. Hover on any cell to see its plain-English meaning and approximate cost to the bank.

From the four counts you derive every relevant metric:

  • Accuracy = (TP + TN) / total. For imbalanced data this is dominated by the large TN count, so it says little.
  • Precision = TP / (TP + FP). ‘Of the alerts I raised, how many were real?’ At 11 TP and 20 FP: 35%. Only a third of flagged applicants were actual defaulters.
  • Recall = TP / (TP + FN). ‘Of the real defaulters, how many did I catch?’ At 11 TP and 5 FN: 69%. The model catches about two-thirds of actual defaulters.
  • F1 = 2 · Precision · Recall / (Precision + Recall). The harmonic mean of precision and recall. At 0.35 and 0.69: F1 = 0.46. Useful as a single summary but it hides the trade-off between the two.

The precision/recall pair is the core metric pair for imbalanced classification. They often conflict: a more aggressive classifier (more things flagged as positive) catches more real positives but includes more false alarms, so recall goes up and precision goes down.

DomainCost of false positiveCost of false negativePreferred metric
Loan defaultsForgone interest on a rejected safe borrower (~£500)Unrecovered principal on a defaulter (~£15,000)Recall (missing defaults is catastrophic)
Email spamLegitimate email routed to spam folder (user frustration)Spam in inbox (user minor annoyance)Precision (few false-flagged emails)
Disease screeningUnnecessary follow-up test (stress, cost)Missed diagnosis (potentially fatal)Recall (never miss a true positive)
Content moderationLegitimate post removed (free speech, user backlash)Harmful content stays up (harm, reputation)Balance — often weighted F_β with β>1 if harm dominates

PR curves vs ROC curves

As you sweep the classifier’s decision threshold from 0 to 1, both precision and recall change — and so does another pair, true positive rate (= recall) and false positive rate (FP / (FP + TN)). These produce two different curves when plotted:

Left: precision-recall curve for the loan scenario. Precision falls toward the 8% base rate (dashed) as recall rises — the natural trade-off. PR-AUC = 0.42, a meaningful summary. Right: ROC curve for the same classifier. It looks dramatically better because the large true-negative count makes the FPR denominator huge — almost any classifier looks close to the top-left corner. ROC-AUC = 0.91, impressive-sounding but nearly uninformative for this degree of imbalance. The dashed diagonals on the ROC panel show random-guess baselines.

The mechanism: in imbalanced data the negative class is much bigger than the positive class, so the FPR denominator (TN + FP) stays large even when FP is substantial. Small changes in FP move FPR very little, which inflates every point on the ROC curve. PR curves use precision, whose denominator (TP + FP) depends only on predicted positives — a much tighter quantity. On a problem where imbalance exceeds roughly 10:1, prefer PR-AUC as your single headline metric over ROC-AUC.


Three knobs for rebalancing

Once you have the right metrics, the question becomes ‘how do I train a model that scores well on them?’ There are three operational approaches.

Threshold adjustment. Classifiers typically output a probability; the usual 0.5 decision threshold is arbitrary and rarely optimal for imbalanced data. Lowering the threshold (say, to 0.3) catches more positives at the cost of more false positives — higher recall, lower precision. Raising it does the opposite. This is the cheapest intervention and often sufficient: just pick the threshold that maximises the metric you care about on the validation set. The playground in this chapter lets you drag the threshold and watch the confusion matrix recompute live.

Class weighting in the loss. Most training algorithms accept a per-class weight. Setting the minority-class weight to, say, 12× the majority weight tells the optimiser to treat each minority example as if it appeared 12 times. This shifts the model’s learned decision boundary toward catching more of the minority class. In scikit-learn this is the class_weight='balanced' parameter (auto-scales by inverse frequency); in neural networks it appears in the cross-entropy loss as explicit weight factors.

Resampling — oversample minority, undersample majority, or SMOTE. Rather than re-weighting the loss, physically change the training-data distribution before fitting. Random oversampling duplicates minority examples; random undersampling discards majority examples; SMOTE (Synthetic Minority Over-sampling TEchnique) generates new minority examples by interpolating between real ones and their k-nearest minority neighbours. Resampling is popular because it works with any algorithm that does not support class weights natively, but it has risks — duplicated examples encourage overfitting, SMOTE can generate synthetic examples that do not resemble real data.

python
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support
from imblearn.over_sampling import SMOTE  # pip install imbalanced-learn

# X: 10,000 × 50 loan features; y: binary default label (8% positive rate)
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42, stratify=y,
)

# Option A: class weighting — no data changes
model_a = LogisticRegression(class_weight='balanced', max_iter=1000)
model_a.fit(X_train, y_train)

# Option B: SMOTE oversampling — inflate minority to match majority
smote = SMOTE(random_state=42)
X_train_bal, y_train_bal = smote.fit_resample(X_train, y_train)
model_b = LogisticRegression(max_iter=1000)
model_b.fit(X_train_bal, y_train_bal)

# Option C: threshold adjustment on a plain model
model_c = LogisticRegression(max_iter=1000)
model_c.fit(X_train, y_train)

def evaluate(model, label, threshold=0.5):
  probs = model.predict_proba(X_test)[:, 1]
  preds = (probs >= threshold).astype(int)
  tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
  prec, rec, f1, _ = precision_recall_fscore_support(
      y_test, preds, average='binary', zero_division=0,
  )
  print(f"{label:22s} prec={prec:.2f}  rec={rec:.2f}  F1={f1:.2f}  (tp={tp}, fp={fp}, fn={fn})")

evaluate(model_a, 'class-weighted',       threshold=0.5)
evaluate(model_b, 'SMOTE-oversampled',    threshold=0.5)
evaluate(model_c, 'plain, thr=0.3',       threshold=0.3)
evaluate(model_c, 'plain, thr=0.5 default', threshold=0.5)

# Typical output — all three rebalancing approaches dominate the default:
# class-weighted         prec=0.31  rec=0.68  F1=0.43  (tp=109, fp=245, fn=51)
# SMOTE-oversampled      prec=0.33  rec=0.70  F1=0.45  (tp=112, fp=230, fn=48)
# plain, thr=0.3         prec=0.30  rec=0.72  F1=0.42  (tp=115, fp=268, fn=45)
# plain, thr=0.5 default prec=0.72  rec=0.14  F1=0.23  (tp=22,  fp=8,   fn=138)

The final row in that output is the uncorrected baseline — 72% precision but 14% recall means the model is nearly useless for catching real defaulters. The three remedies above all push recall above 65% while keeping precision reasonable. In practice you would run this comparison on your own data and pick the approach that scores best on your chosen metric (F1 if you are neutral, recall if false negatives are expensive, precision if false alarms are).


Where this chapter sits

This is the last chapter of the Data Splits and Evaluation topic. Everything so far — train/val/test splits, cross-validation, overfitting diagnostics — assumed that accuracy on the validation set was the quantity to optimise. For imbalanced data that assumption breaks, so this chapter replaces ‘accuracy’ with a small family of metrics (precision, recall, F1, PR-AUC) appropriate for rare-positive prediction. With those in place, everything from the earlier chapters still applies: you still split before training, you still cross-validate if the data is small, you still diagnose overfitting by the gap between training and validation performance. The only change is that ‘performance’ is now a precision/recall pair rather than a single accuracy number.

Domain 2 closes with this chapter. The natural next step is Classical ML — the first domain where learners actually see algorithms in action: linear regression, logistic regression, decision trees, random forests, k-NN. Each one builds on the data-preparation foundation laid across Domain 2, and each has its own interaction with the imbalance and overfitting issues discussed in this topic.

In this section

Why is accuracy misleading when my classes are imbalanced?

Accuracy weights every example equally. If 99% of examples are class 0 and 1% are class 1, a classifier that predicts 'class 0' for everything hits 99% accuracy while catching zero class-1 examples. That 99% looks impressive but carries no information about the model's ability to identify the minority class — which is usually the class you actually care about (fraud, disease, default). The rule: if class imbalance exceeds roughly 80/20, accuracy is no longer a reliable metric on its own.

What is the difference between precision and recall?

Precision answers 'of the examples I predicted as positive, what fraction actually are?' — TP / (TP + FP). It penalises false alarms. Recall answers 'of the examples that are actually positive, what fraction did I catch?' — TP / (TP + FN). It penalises missed positives. They often trade off: a classifier that flags more examples as positive will catch more real positives (higher recall) but include more false alarms (lower precision). The right balance depends on the relative cost of each error type.

When do I prefer precision and when do I prefer recall?

Prefer recall when missing a positive example is expensive — disease screening (never miss a cancer), fraud detection (every missed fraud costs real money), safety-critical alerts. Prefer precision when false alarms are expensive — spam filtering (avoid flagging legitimate email), automated content moderation (do not silence real users), medical diagnosis that triggers invasive procedures. F1 is a reasonable default when neither cost clearly dominates, but it hides the trade-off — always look at precision and recall separately before reporting a single number.

◎ Intuition

A hospital screening program is evaluating two cancer-detection models on 10,000 patients. Base rate of cancer in this population is 0.5%. Before reading further: - Model X reports 99.8% accuracy. What is the single most important number to check next, and why is 99.8% alone insufficient to decide whether the model works? - Model Y reports 99.2% accuracy but 94% recall. Model Z reports 99.7% accuracy but 38% recall. Which would you deploy for cancer screening, and what value would you look for in one more metric before deciding? - The radiology team asks which threshold to use. How would you choose between threshold = 0.3 (catches more cancers, more follow-up biopsies) and threshold = 0.6 (fewer false alarms, more missed cases) — and what cost information would you need from the clinicians to make that choice rigorously?

↺ Reflection

Pick the Right Lens, Then the Right Fix

The fraud model that shipped with 99.7% accuracy and caught zero fraud is the clean version of a recurring real-world failure. The model did not malfunction; it optimised exactly what the team told it to. The problem was the metric — accuracy weights every example equally, and when a dataset is 99.7% negatives, predicting ‘negative’ for everything is the rational strategy. On any imbalanced problem where the minority class is what you actually care about predicting, accuracy collapses into a useless report of the class imbalance itself. The rule that falls out is surprisingly simple: if one class is less than 20% of the data, accuracy is no longer a reliable headline metric.

The confusion matrix is the diagnostic you should look at first. Its four counts — true positives, false positives, true negatives, false negatives — contain every bit of information about binary classification performance, and every other metric is a ratio derived from them. Precision answers ‘of the alerts I raised, what fraction were real?’ and is penalised by false positives. Recall answers ‘of the real positives, what fraction did I catch?’ and is penalised by false negatives. These two usually trade off — a classifier that flags more aggressively catches more real positives (higher recall) at the cost of more false alarms (lower precision), and vice versa. F1 is a useful single-number summary (harmonic mean of precision and recall) but it hides the trade-off, and the trade-off is often the most important thing to know. Look at precision and recall separately before reporting any single number.

Which of precision and recall to prioritise is not a modelling decision — it is a business-context decision. In loan-default prediction, missing a defaulter costs the bank the unrecovered principal (£10k–£20k per case) while a false alarm costs forgone interest (£200–£500 per case). So the rational objective is to prioritise recall at the expense of precision. In spam filtering, a false positive routes a legitimate email into the spam folder, which is a user experience failure; a false negative puts spam in the inbox, which is a minor annoyance. So spam filters prioritise precision. In disease screening, missing a cancer can be fatal; an unnecessary follow-up test is unpleasant but recoverable. So screening systems prioritise recall. There is no universally correct answer — the model optimises what you tell it to, so the first step is to decide explicitly which failure mode costs more and weight your objective accordingly.

For metric curves across thresholds, prefer precision-recall over ROC when imbalance exceeds roughly 10:1. ROC’s FPR denominator (TN + FP) is huge on imbalanced data, which makes almost any classifier look good — ROC-AUC 0.91 on heavily imbalanced data is usually an optical illusion. PR-AUC keeps the problem’s difficulty visible: the PR curve of a random classifier on 8% positive data sits flat at precision 0.08, and any meaningful improvement over that baseline is explicit. The three operational remedies — threshold adjustment, class weighting, and resampling — are essentially equivalent ways to let the model know that the minority class matters more than its raw frequency suggests. Threshold adjustment is the cheapest (it is applied at inference, needs no retraining) and usually tried first. Class weighting happens during training and is built into most library defaults (class_weight='balanced' in scikit-learn). Resampling — random oversampling, random undersampling, SMOTE — is popular for algorithms that do not natively support class weights, with the caveat that synthetic samples may not resemble real ones. All three exist to steer the optimisation toward the rare class; choosing between them is a matter of training convenience and modest empirical differences. The topic closes here — the next natural step is Classical ML, where learners meet the actual algorithms (linear regression, decision trees, k-NN) that all the data-preparation and evaluation machinery from Domain 2 was building up to.

Key Points

Accuracy is a misleading default metric for imbalanced classification: a trivial classifier that always predicts the majority class scores accuracy equal to the majority's proportion (92%/98%/99.9% on 8%/2%/0.1% positive data) while catching zero positives.

The confusion matrix is the primary diagnostic — every other useful metric (precision, recall, F1, specificity) is derived from its four counts, so read the matrix first and the summary numbers second.

Precision answers 'of the flags, how many were real?' and recall answers 'of the real positives, how many did I catch?' — they usually trade against each other, and choosing which to prioritise is a business-context decision, not a modelling one.

PR curves are more honest than ROC curves for heavily imbalanced data: ROC is inflated by the large true-negative count, while PR depends only on quantities that directly matter for rare-positive prediction.

Three operational remedies exist for imbalance — threshold adjustment (cheapest, applied at inference), class weighting in the loss function (during training), and resampling (oversample minority, undersample majority, SMOTE) — chosen based on the cost structure of false positives versus false negatives.

Checkpoint

Check Your Understanding

Answer these questions about imbalanced classification scenarios covered in this chapter.

1

A fraud detection model reports 99.5% accuracy on a dataset where 0.3% of transactions are fraudulent. Which statement is most defensible?

2

A classifier's confusion matrix is: TP = 40, FP = 60, FN = 10, TN = 890. What are the precision, recall, and F1 scores?

3

A classifier's true positive rate is 0.9 and false positive rate is 0.05. The positive class is 0.5% of the dataset (5,000 positives out of 1,000,000 examples). Put these four metrics in order from the most optimistic (highest) to the most honest (most reflective of actual performance).

  1. 1.Precision = TP / (TP + FP) = 4,500 / (4,500 + 49,750) ≈ 0.08
  2. 2.Accuracy = (TP + TN) / total ≈ 94.5%
  3. 3.Recall (= TPR) = 0.9
  4. 4.ROC-AUC, bounded above by and near 1 since TPR is high and FPR is low
4

A medical imaging team has a cancer-screening model with precision 0.4 and recall 0.85 at threshold 0.5. They believe missing a cancer is roughly 30× more costly than a false positive (unnecessary biopsy). Lowering the threshold to 0.3 — even though it will reduce precision further and inflate false positives — is a defensible choice because the cost asymmetry prioritises catching every real positive.