SeeingML
Dimensionality Chapter 2 of 4 · tap to browse

Correlation and Redundancy

Two features that rise and fall together carry almost the same information — and almost certainly hurt your model

A property analyst trains a linear regression on ten housing features and reports a coefficient of −£12,400 per room. A colleague retrains the same model on a 90% resample of the data and reports a room coefficient of +£18,100. Same data, same model, same software — but the sign has flipped. The problem is not a bug; it is multicollinearity: sqft and rooms are correlated at r = 0.91 and the model cannot tell which of them matters. The diagnosis and the fix both come from looking at the correlation matrix.

Learning Objectives
  1. 1 Explain what the Pearson correlation coefficient r measures, how r² represents the share of one feature's variance explained by another, and how to read strong, moderate, and weak correlations off a scatter plot.
  2. 2 Read a correlation matrix heatmap to identify clusters of redundant features, and choose which features to keep or drop based on correlation strength and domain relevance.
  3. 3 Explain multicollinearity as the mechanism that destabilises linear-model coefficients, and connect coefficient instability to the variance-inflation factor as a diagnostic.
  4. 4 Apply practical thresholds for feature correlation — under 0.7 is usually safe, above 0.85 is a strong signal to drop or combine features — and adjust the threshold based on model type and sample size.
¶ Narrative

When Features Carry the Same Signal

A property analyst has a dataset of 8,000 houses with ten features per row: square footage, number of rooms, number of bedrooms, number of bathrooms, lot size, garages, floors, age, year built, and neighbourhood crime rate. They fit a linear regression to predict price. The coefficient on rooms comes back as −£12,400 — each additional room reduces the predicted price by twelve thousand pounds. They retrain on a 90% resample of the same data and the coefficient flips to +£18,100. Same data, same model. The problem is not stochastic; it is structural. The dataset contains multiple features that are near-copies of each other — sqft and rooms correlate at r = 0.91, rooms and bedrooms at r = 0.89 — and the regression has no way to decide which of them the price actually depends on.

This chapter is about that structural problem: how to measure shared information between features, how to read a correlation matrix to spot it, and how to act on what you find.


Pearson correlation, one number with a lot of meaning

The Pearson correlation coefficient r measures how linearly two features move together. It sits between −1 and +1:

  • r = +1: perfect positive linear relationship — when feature A goes up, feature B goes up by a proportional amount.
  • r = 0: no linear relationship. The scatter is a cloud with no diagonal tilt.
  • r = −1: perfect negative linear relationship — when A goes up, B goes down proportionally.
The same 80 Gaussian samples drawn with four different correlation levels. At r = 0 the points form a round cloud — no relationship. At r = 0.5 a diagonal tilt is visible but the cloud is still wide. At r = 0.85 the cloud is a narrow tilted ellipse and predicting one feature from the other becomes reliable. At r = 0.99 the two features are near-duplicates — any prediction you can make with one, you can make with the other.

The squared correlation has a direct interpretation: it is the fraction of variance in one feature that is explained by the other. If r = 0.91 then r² = 0.83: 83% of the variation in rooms can be predicted from sqft alone. The remaining 17% is what rooms adds independently. For most modelling purposes, 17% is not worth the cost of carrying a second near-duplicate feature through the pipeline.

A worked example. Suppose you have these five rows for sqft vs rooms (standardised to z-scores):

Rowsqft (z)rooms (z)
1−1.20−1.08
2−0.60−0.70
30.000.05
40.800.75
51.000.98

The Pearson formula is r = cov(x, y) / (σ_x · σ_y). For z-scored data, σ_x = σ_y = 1 (by construction), so r is just the average of x·y across rows: ((−1.20)(−1.08) + (−0.60)(−0.70) + 0 + (0.80)(0.75) + (1.00)(0.98)) / 5 = (1.296 + 0.420 + 0 + 0.600 + 0.980) / 5 = 3.296 / 5 = 0.66. In the real dataset of 8,000 houses the same calculation gives r = 0.91 because the relationship is tighter at larger sample sizes.


The correlation matrix as a diagnostic tool

A single r tells you about two features. A correlation matrix tells you about all pairs at once. For a dataset with ten features, the correlation matrix is a 10×10 table where cell (i, j) holds the correlation between feature i and feature j. The diagonal is always 1 (a feature with itself). The matrix is symmetric: cell (i, j) equals cell (j, i).

Pairwise Pearson correlations for the 10 housing features. The top-left block — sqft, rooms, bedrooms, baths — is heavily inter-correlated at r > 0.7 (cells outlined in white). These four features effectively measure the same thing: 'house size'. The age/year_built cells in the middle are near-perfectly negatively correlated by definition. Crime rate correlates weakly with everything — an independent feature carrying unique information.

Reading this matrix is a two-minute exercise that every data scientist does before training a linear model. The patterns to look for:

  1. Cluster of highly correlated features. sqft, rooms, bedrooms, baths all correlate at r > 0.7 with each other. They are measuring the same underlying concept — “house size” — from different angles. A linear model given all four will struggle; give it one of them, or give it a principal component that combines them.
  2. Near-perfectly negative correlations. age and year_built are at r ≈ −0.98 because they are literally redundant: year_built = current_year − age. Including both is equivalent to including the same feature twice.
  3. Isolated independent features. crime_rate has weak correlations with everything else, meaning it carries information none of the other features supply. That makes it a good candidate to keep regardless of how you reduce the size of the feature set.

Cluster 1 is the subtle case. Cluster 2 (age/year_built) and the isolated feature (crime_rate) are easy. The hard judgement is at cluster 1: which of sqft, rooms, bedrooms, baths to keep? In practice, domain knowledge wins: keep sqft (most granular measurement), or keep rooms (most interpretable for stakeholders), and drop the rest — or combine all four into one principal component and use that.


Multicollinearity and coefficient instability

The reason the property analyst’s rooms coefficient flipped between +£18,100 and −£12,400 is a phenomenon called multicollinearity. When a regression model has multiple features that are nearly linear combinations of each other, the optimisation has no way to uniquely attribute the predicted target to any single feature. Many coefficient combinations give the same predictions; small changes in the data pick different ones.

Two scatter panels contrasting a redundant and an independent feature pair. Left: sqft vs rooms at r = 0.91 — a tight diagonal cloud where 83% of each feature's variance is shared. Right: sqft vs crime_rate at r = 0.08 — a broad scatter with no linear pattern, so crime_rate contributes 99% unique information. The bars below each panel show the shared (accent) and unique (muted) share of variance in a single-feature-can-predict-the-other sense.

The formal diagnostic is the variance inflation factor (VIF). For a feature i, VIF_i = 1 / (1 − R²_i), where R²_i is the r² you get from regressing feature i against all the other features combined. VIF below 5 is generally fine. Between 5 and 10 is a warning. Above 10 is evidence that the feature is nearly a linear combination of the others — the coefficient on it will be very unstable.

r (with another feature)VIFSeverity
0.300.091.10Negligible — no issue
0.700.491.96Mild — fine for most models
0.850.723.57Moderate — watch the coefficient
0.900.815.26High — drop or combine
0.950.9010.00Severe — coefficient unusable
0.990.9850.00Catastrophic — features are near-duplicates

The reason the thresholds look steep is that VIF grows non-linearly. Going from r = 0.70 to r = 0.85 only doubles the VIF. Going from 0.90 to 0.95 doubles it again. Going from 0.95 to 0.99 multiplies it by five. Correlations above 0.9 are rare in deliberate feature engineering and common in accidental feature engineering (for example, creating both a raw income column and an income-percentile column, or both month-as-string and month-as-integer).

💡 Insight

In trees and gradient-boosted models, multicollinearity causes a different symptom: unstable feature importances rather than unstable coefficients. If sqft and rooms correlate at 0.91, both are roughly equally good splitters, and the training process will split credit between them in ways that change under small perturbations of the data. Two models fit on resamples might report radically different “most important features”. This is why post-hoc feature-importance analysis on trees should always check for correlated features before drawing conclusions.


Practical rules of thumb

Not every correlation is a problem. A few operational guidelines:

  • Below r = 0.7: usually safe for any model. VIF under 2. Do not spend time removing features at this level unless you have a specific reason.
  • r between 0.7 and 0.85: watch linear-model coefficients. If you see large coefficient swings between resamples or very wide confidence intervals, drop one feature and see if performance holds.
  • r above 0.85: strong signal to act. Drop one feature, combine them into a principal component, or use a regularised model (ridge regression) that tolerates collinearity.
  • r above 0.95: do not include both. If domain knowledge insists both matter, there is almost certainly a derivation that combines them into one feature (ratio, difference, sum).
python
import numpy as np
import pandas as pd

# Build a correlation matrix for a dataset df
# All numeric features standardised beforehand for clean r values
correlations = df.corr()

# Find all pairs with |r| > 0.85 (excluding the diagonal)
high_corr = []
for i, col_i in enumerate(correlations.columns):
  for j, col_j in enumerate(correlations.columns):
      if i < j and abs(correlations.iloc[i, j]) > 0.85:
          high_corr.append((col_i, col_j, correlations.iloc[i, j]))

for a, b, r in sorted(high_corr, key=lambda x: -abs(x[2])):
  print(f"{a:20s} vs {b:20s}  r = {r:+.2f}  r² = {r**2:.2%}")

# Output on the housing dataset:
# sqft                 vs rooms                r = +0.91  r² = 82.8%
# rooms                vs bedrooms             r = +0.89  r² = 79.2%
# age                  vs year_built           r = -0.98  r² = 96.0%

The script above is roughly twenty lines and runs in seconds on a dataset with hundreds of columns. It is the first thing to produce when you inherit a dataset whose columns you did not choose. The output tells you which features are worth keeping separately, which are near-duplicates, and — by what is not in the list — which features each bring something independent to the model.

The next chapter develops the tool that combines correlated features systematically: principal component analysis, which finds the directions along which the data varies most and expresses the dataset as a small number of those directions rather than a large number of correlated raw features.

In this section

What exactly does a Pearson correlation of 0.85 mean?

r = 0.85 means two things simultaneously. First, the two features move together strongly in the linear sense — scatter them and you see a tight diagonal cloud. Second, r² = 0.72 means 72% of the variance in one feature is explained by the other. The remaining 28% is what the second feature would add beyond the first. For many modelling purposes this extra 28% is not worth the cost of including a second highly correlated column.

Why does high correlation destabilise linear regression coefficients?

When two features move almost identically, a linear model has no way to attribute credit — any combination of the two coefficients that sums to the same effect on the target is equally good. Small changes in the training data can push that sum toward one coefficient or the other. Statistically: the variance-inflation factor (VIF) for a feature is 1/(1 − r²) where r is its correlation with other features; at r = 0.9 the VIF is 5.3, at r = 0.99 it is 50. High VIF means the coefficient's standard error explodes, so the coefficient's sign and magnitude change dramatically with every resample.

Does correlation hurt tree-based models too?

Less than linear models. A tree splits on one feature at a time, so highly correlated features compete to be the split — whichever one produces a marginally better split wins, and the other is effectively ignored in that branch. Random forests and gradient boosting tolerate correlated features well in terms of predictive accuracy, but feature-importance scores become unreliable: the importance of two correlated features gets split between them somewhat arbitrarily. Interpreting 'which feature matters most' is harder with correlated inputs regardless of model family.

◎ Intuition

A colleague hands you a banking dataset with 40 features. The features include `account_balance_usd`, `account_balance_eur` (at the same reporting date, at a fixed exchange rate), `account_balance_log`, `income_percentile`, `income_raw`, `age_in_years`, and `age_in_months`. Before touching the model: - Of those seven features, how many genuinely independent pieces of information do you expect? What simple operation reveals the near-copies? - If you kept all seven and fitted a linear regression predicting default, would the coefficients on `account_balance_usd` and `account_balance_eur` be (roughly) equal in magnitude, opposite in sign but similar magnitude, or unstable and hard to interpret? Explain your reasoning in one sentence. - `income_percentile` and `income_raw` carry related but not identical information. Would you expect their correlation to be closer to 0.5, 0.8, or 0.99 — and what would that correlation depend on?

↺ Reflection

Shared Variance, Wasted Capacity

The property analyst’s coefficient on rooms flipped between −£12,400 and +£18,100 across two resamples of the same dataset not because of randomness in the data or a bug in the software, but because the dataset contained multiple features — sqft, rooms, bedrooms, baths — that are near-copies of a single underlying concept: house size. A linear regression cannot uniquely attribute price to any one of these features when they move together. The coefficients are mathematically determined, but they are determined by a slightly different balance of evidence every time the training data changes, so their individual values carry no meaning. The sum of their contributions is stable; the attribution to each one is not.

Pearson correlation is the one-number measure of this problem. It captures linear co-movement on a scale that is easy to read: 0 is no relationship, 1 is perfect agreement, and the squared version r² tells you the share of variance that is already explained. The practical insight is that r² collapses the joint information content of two features. At r = 0.91 you are not getting two features worth of information from a pair; you are getting one feature with an extra 17% of noise beyond it. For many modelling purposes that 17% is not worth the cost — in unstable coefficients for linear models, in split-credit noise for tree-based models, in extra multiplication for distance-based methods, in every form of sparsity the previous chapter documented.

The correlation matrix is the natural diagnostic. A 10×10 heatmap shows every pairwise correlation at once and reveals both the obvious problems (age and year_built at r = −0.98 are literally the same feature) and the subtle ones (sqft, rooms, bedrooms, baths all inter-correlating at r > 0.7 form a redundancy cluster even though no individual pair is a near-duplicate). The matrix is the starting point for feature-selection judgement calls. It does not answer ‘which feature to keep’ — that requires domain knowledge or a principled method like regularisation or PCA — but it answers ‘which features deserve attention’, which is usually the harder question.

Multicollinearity has quantitative diagnostics (variance inflation factor) and qualitative symptoms (coefficient sign flips, importance-score noise, unstable predictions near the decision boundary). The qualitative symptoms are usually what alerts you; the quantitative diagnostics tell you whether to drop, combine, or regularise. The thresholds are guidelines, not laws — below r = 0.7 the VIF is under 2 and there is rarely anything to do, between 0.7 and 0.85 is where watchful waiting pays off, and above 0.85 is where action is usually warranted. Above 0.95 the features are close enough to being the same column that combining them into one is the honest description of what is happening. The next chapter develops the tool for doing that combination systematically: principal component analysis identifies the directions along which the data varies most and expresses the whole dataset as a small number of those directions — turning a correlation problem into a compression problem with a known geometric solution.

Key Points

Pearson correlation r measures linear co-movement of two features on a scale from -1 to +1; r² is the share of one feature's variance explained by the other, so r=0.91 means 83% of one feature's variation is already captured by the other.

A correlation matrix reveals redundant feature clusters at a glance — the sqft/rooms/bedrooms/baths block in the housing dataset correlates within itself at r > 0.7, meaning those four features effectively carry one concept (house size) in four columns.

Multicollinearity destabilises linear-model coefficients because many coefficient combinations produce the same predictions on correlated features; small data perturbations flip the sign of coefficients and make individual-feature interpretation unreliable.

The variance inflation factor VIF = 1/(1-r²) quantifies the damage — VIF below 5 is safe, VIF above 10 is catastrophic — and grows non-linearly so that the gap between r=0.95 and r=0.99 is a 5× worsening of coefficient variance.

Practical action thresholds: below r=0.7 do nothing, from 0.7 to 0.85 monitor linear coefficients under resampling, above 0.85 drop one feature or combine them into a principal component, above 0.95 treat the features as duplicates and keep only one.

Checkpoint

Check Your Understanding

Answer these questions about the housing-dataset correlation scenarios covered in this chapter.

1

Two features have Pearson correlation r = 0.85. Which single statement most accurately describes what this number means?

2

A data analyst examines the 10×10 housing correlation matrix and finds that `sqft`, `rooms`, `bedrooms`, and `baths` all inter-correlate at r > 0.7. They need to pick ONE course of action before training a linear regression on house price. Which is most defensible?

3

Put these features in order from highest Variance Inflation Factor to lowest (most unstable coefficient to most stable), given their correlation with the rest of the features in the housing dataset.

  1. 1.garages (r ≈ 0.42 with other features)
  2. 2.crime_rate (r ≈ 0.20 with other features)
  3. 3.year_built (r ≈ 0.98 with age)
  4. 4.rooms (r ≈ 0.91 with sqft)
4

A data scientist reports that their random forest model's top three most-important features are sqft (importance 0.22), rooms (importance 0.19), and bedrooms (importance 0.15), each with confidence intervals of roughly ±0.05. Since these three features correlate at r > 0.7 with each other, the importance scores should be interpreted as precise rankings of which feature matters most for price.