Features Chapter 6 of 6 · tap to browse
Feature Selection
Not every measurement earns its place.
A hospital readmission model trained on 40 carefully selected features consistently outperformed one trained on all 200 available measurements — not because data was discarded, but because noise was removed.
- 1 Explain what feature selection is and why including irrelevant or redundant features can reduce model performance.
- 2 Distinguish between filter, wrapper, and embedded selection methods by their computational cost and relationship to the learning algorithm.
- 3 Identify which features carry predictive signal by observing class separation on a two-dimensional scatter plot.
- 4 Recognise redundancy between correlated features and explain why removing one loses little information when the other is retained.
- 5 Judge a feature's value using its correlation score with the target and its linear relationship with other selected features.
What Is Feature Selection?
The Problem with More Features
A hospital informatics team spent months building a readmission prediction model. They had access to 200 fields per patient: lab values, diagnostic codes, vital sign histories, medication records, scheduling metadata. They fed all 200 into a gradient-boosted classifier and achieved reasonable training accuracy. Then they tried a second version, trained on just 40 features selected by a domain expert. The 40-feature model was more accurate on held-out patients, faster to run, and far easier for clinical staff to audit.
The result confused the team at first. More data is almost always better — that much is well established. But more features is not the same thing as more data. The number of patients was the same in both experiments. What changed was the dimensionality of the input space.
Irrelevant features introduce noise. A feature with no statistical relationship to readmission risk does not contribute zero to the model — it contributes random patterns. On any finite training set, an irrelevant feature will appear to correlate with the outcome by chance in some minor way. The model fits that spurious correlation. When the model encounters new patients, the spurious pattern vanishes and accuracy drops. This is one pathway to overfitting that does not require high model complexity: it only requires noisy inputs.
Redundant features create a different problem. Two features that track the same underlying variable both appear informative when scored individually. But once the model has learned from one, the other adds relatively little new signal. Including it anyway bloats the model, destabilises coefficient estimates, and can cause the learning algorithm to split its attention between two descriptions of the same phenomenon.
Feature selection vs. dimensionality reduction. Feature selection keeps a subset of the original features unchanged — the retained features are still the original measurements and can be named, explained, and audited. Dimensionality reduction techniques like Principal Component Analysis construct new features by combining the originals; the new features compress more information but may no longer correspond to any single interpretable measurement. This topic covers feature selection. Topic 3 covers dimensionality reduction.
Irrelevance and Redundancy — Two Distinct Problems
The distinction between irrelevance and redundancy matters because the solutions differ. Irrelevance is about the absence of signal. Redundancy is about the duplication of signal that already exists elsewhere in the feature set.
Irrelevant features have no statistical relationship to the target. In the hospital dataset, the sum of the digits in a patient’s administrative ID number is an example: it derives from an arbitrary numbering system with no connection to clinical outcomes. Any pattern a model discovers involving this feature in the training data is an artefact of the specific ID assignments in that dataset. Those patterns will not generalise to patients registered in a different year or transferred from another hospital system.
Redundant features do carry genuine signal. In the hospital dataset, the number of prior admissions and the number of medications on the discharge prescription both predict readmission — patients with complex chronic diseases accumulate both prior episodes and extensive medication regimens. Both features earn high correlation scores with the outcome. But they also correlate strongly with each other. Once a model has learned the pattern from prior admissions, medications adds less new information than its individual score suggests. In a linear model trained on both, the coefficient estimates can become unstable: small changes in the training sample shift credit back and forth between the two correlated predictors, even though the model’s overall prediction does not change much. The practical effect is an unreliable model that is hard to audit.
Filter Methods — Score First, Model Later
Filter methods rank features by their statistical relationship to the target. No model is trained. The scores are computed directly from the data, then features below a threshold are removed before any learning begins.
Pearson correlation measures the linear relationship between a continuous feature and the target. For a feature and target across samples:
The result lies in . For feature selection, the absolute value is used: features are ranked by . A score near 1 means the feature moves in lockstep with the target. A score near 0 means they are linearly unrelated.
Pearson correlationpearson-correlation is fast and easy to interpret, but it only detects linear relationships. A feature with a strong U-shaped relationship to the outcome will receive a low Pearson score and risk being discarded.
Mutual information removes the linearity constraint. It measures any statistical dependency between the feature and the target, regardless of its shape.
The intuition: if knowing tells you nothing about , the joint probability equals the product of the marginals everywhere, making every log term zero. Any statistical dependency — linear or not — produces a positive score. Mutual informationmutual-information can detect non-linear patterns that Pearson misses, but estimating the joint distribution requires more data and is noisier on small samples.
import numpy as np
from sklearn.feature_selection import mutual_info_classif
scores = mutual_info_classif(X, y, random_state=42)
ranked_features = np.argsort(scores)[::-1]
The interaction blind spot. Filter methods evaluate each feature in isolation. Two features can each have near-zero correlation with the target individually while perfectly predicting it together. The classic case is XOR: knowing feature A or feature B alone gives no information, but knowing both reveals the class exactly. A filter method would discard both features before any model ever sees them. Wrapper and embedded methods evaluate features in the context of what else is present — they can detect these interactions.
Wrapper Methods — Let the Model Decide
Wrapper methods use a model’s validation performance as the selection criterion. Rather than scoring features statistically, they ask: does adding this feature to the current selected set improve the model on held-out data?
Forward selection starts with an empty set and adds features greedily. At each step, every remaining candidate is tested by training the model with the current selected set plus that one feature. The candidate that produces the best validation score is added permanently. The process continues until adding any remaining feature fails to improve performance.
Backward elimination reverses the process: start with the full feature set and remove the feature whose loss causes the smallest drop in performance, repeating until further removal hurts.
Both approaches require roughly model training runs for features — expensive for large feature sets, but powerful because the model itself evaluates whether each feature adds genuine signal given what is already included. The results are also model-specific: switching from logistic regression to a random forest may produce an entirely different selected set, because the two models extract different structure from the same features.
Embedded Methods — Selection During Training
Embedded methods weave feature selection into the training process itself. The model learns which features are useful and which are not simultaneously with learning the prediction task.
Lasso regularisation adds an L1 penalty to the training loss:
The term is proportional to the sum of the absolute values of the weights. The key property of the L1 norm is geometric: its feasible region (the set of weight vectors the penalty allows) has corners on the coordinate axes. When gradient descent minimises the sum of the prediction loss and this penalty, the solution tends to land on a corner — meaning some weights are driven to exactly zero, not merely small. Features with zeroed weights are eliminated entirely. Increasing tightens the penalty, zeroing out more weights and producing a sparser model. Lassolasso is computationally efficient: one training run selects features and fits the model simultaneously.
Decision tree feature importance takes a different approach. Trees split on whichever feature reduces impurity (Gini or entropy) most at each node. After training, each feature’s importance is the total weighted reduction in impurity it contributes across all its splits in all trees. Features that drive large, clean splits early in many trees score highest. Features that are never selected, or that only appear in noisy shallow splits on small data subsets, score near zero. Ensemble methods like random forests and gradient boosting expose this score directly.
Choosing a Method
No single method is always best. The practical choice depends on the number of features, the available compute, and how much model-specific guidance is needed.
When a dataset has more than 100 candidate features and the relationship between features and target is unknown, filter methods are the right starting point: they are cheap, model-agnostic, and eliminate the most obviously uninformative features quickly. This cuts the search space before any expensive model training begins.
When the feature count is manageable (roughly 10–50) and the model type is already decided, embedded methods like Lasso or tree importance give a principled answer in a single training run. They account for some redundancy: Lasso penalises having two highly correlated features with non-zero weights; tree importance assigns low scores to features made redundant by another that splits earlier.
Wrapper methods are reserved for smaller, performance-critical problems where the cost of many training runs is acceptable and detecting feature interactions is important.
In production, the most common pattern chains the approaches: a filter pass removes obviously irrelevant features, then an embedded or wrapper method handles redundancy among the survivors.
A high individual filter score does not guarantee that two features are non-redundant with each other. If two features both receive a Pearson correlation of 0.70 with the target but also correlate at 0.90 with each other, they are largely capturing the same signal. Together they contribute much less than 2 × 0.70 worth of independent information. Always check inter-feature correlation after filter scoring — a feature that scores well but closely tracks an already-selected feature should be treated as a redundancy candidate.
- ✓Feature selection removes irrelevant features (no signal) and redundant features (duplicated signal) before or during model training.
- ✓Filter methods score each feature independently using statistics like Pearson correlation or mutual information — fast, model-agnostic, and blind to feature interactions.
- ✓Wrapper methods evaluate candidate subsets using model validation performance — interaction-aware but computationally expensive at O(p²) training runs.
- ✓Lasso drives feature weights to exactly zero through the geometric property of the L1 penalty — selection and training happen in a single pass.
- ✓Decision tree importance scores features by their weighted impurity reduction across all splits — embedded in ensemble training with no extra cost.
- ✓In practice, filter and embedded methods are often chained: filter first to cut the search space cheaply, then embedded or wrapper to handle redundancy among survivors.
What is feature selection in machine learning?
Feature selection is the process of choosing a subset of input variables from a larger candidate set to use when training a model. The goal is to remove features that carry no signal (irrelevant features) or that duplicate information already captured by another feature (redundant features). A smaller, cleaner feature set often produces a more accurate, faster, and more interpretable model.
Why does adding more features sometimes hurt a model?
Features with no relationship to the target introduce random correlations into the training data that a model may fit instead of genuine patterns — a form of overfitting. More features also expand the input space, requiring proportionally more training data to cover it reliably. This is sometimes called the curse of dimensionality, which is covered in the next topic.
What is the difference between feature selection and dimensionality reduction?
Feature selection keeps a subset of the original features unchanged. Dimensionality reduction techniques like PCA construct new features that are mathematical combinations of the originals. Feature selection preserves interpretability — you can still name and explain the retained features. Dimensionality reduction can compress more information into fewer dimensions but produces features that may be harder to explain.
Two hundred hospital patients. Six measurements each. Only some predict who gets readmitted within 30 days. Before touching any controls: which single axis — horizontal or vertical — do you think will do the most to separate the two groups? And which measurement do you predict will be completely useless?
What Feature Selection Actually Does
Irrelevance and redundancy are different problems that require different thinking. An irrelevant feature has no statistical relationship to the target at all — including it introduces a dimension of pure noise into the input space. A redundant feature does carry signal, but that signal is already present in another feature the model can see. The difference matters because the corrective action differs: irrelevant features should be removed because they contribute nothing; redundant features should be removed because they contribute the same thing twice, at additional cost.
The harm from each type is also distinct. Irrelevant features cause a model to overfit noise — to learn correlations that are specific to the particular dataset and will not generalise to new observations from the same process. Redundant features harm linear models more subtly: when two correlated features are both present, their coefficients become unstable. A small change in the training data shifts credit between the two predictors even when the overall prediction barely changes. This makes models harder to interpret and less reliable when the input distribution shifts slightly.
Filter methods are the fastest path to removing irrelevant features at scale. Pearson correlation scores the linear alignment between a feature and the binary or continuous target — it is fast and interpretable but misses non-linear relationships. A feature with a strong U-shaped relationship to the outcome will receive a low Pearson score despite being genuinely informative. Mutual information captures any statistical dependency, linear or not, but requires estimating the joint distribution from samples and becomes unreliable with small datasets. Both methods share the same blind spot: they evaluate features one at a time and cannot detect features that are individually uninformative but jointly predictive. Two features with near-zero individual correlation can perfectly predict the target through an XOR-type interaction; filter methods would discard both before any model ever sees them.
Lasso addresses this partly by evaluating features in the context of the model’s full parameter space during training. The L1 penalty added to the loss function has a geometric property that causes weight vectors to settle at corners of the feasible region — points where some weights are exactly zero rather than merely small. This is unlike the L2 (ridge) penalty, which shrinks all weights toward zero but never zeros them out. Increasing the regularisation strength is effectively a control on how many features survive: at low , most features keep non-zero weights; at high , only features with strong enough predictive contribution to outweigh the penalty cost retain non-zero coefficients. Decision tree feature importance works differently — it rewards features that produce large, clean splits across many training examples, regardless of their correlation with the target. The two approaches can produce different selected sets from the same data, which is useful as a cross-check.
The practical conclusion is that fewer, better features consistently outperform more, noisier ones when both are available. Every feature retained is a dimension the model must navigate in the input space. Every irrelevant dimension added is a dimension of noise the model must learn to discount during training — and may fail to discount fully, especially when training data is limited. Chaining a fast filter pass with an embedded or wrapper method captures the strengths of both: the filter removes the clearly uninformative features cheaply, and the second method handles the redundancy and interactions among whatever survives.
Feature selection solves two distinct problems: irrelevance (a feature has no statistical relationship to the target) and redundancy (a feature's information is already captured by another feature in the set). Both types should typically be removed, but for different reasons.
Filter methods score each feature independently using statistics like Pearson correlation or mutual information. They require no model training and scale to large feature sets, but they cannot detect features that are individually weak but jointly informative.
Wrapper methods (forward selection, backward elimination) train a model on candidate subsets and use validation performance as the selection criterion. They account for feature interactions but require O(p²) or more training runs and produce results specific to the model used.
Lasso regularisation drives feature weights to exactly zero during training through the geometric property of the L1 penalty. Increasing λ removes more features — at high enough λ, only the strongest predictors survive with non-zero coefficients.
Decision tree feature importance scores each feature by the total weighted reduction in impurity it contributes across all splits. Features never used, or used only in noisy shallow splits, score near zero.
In practice, filter methods and embedded or wrapper methods are often chained: a fast filter pass removes obviously irrelevant features, then a model-aware method handles redundancy among the remaining candidates.
Check Your Understanding
These questions test the concepts covered in the chapter. Select an answer, then reveal the explanation.
A dataset includes both 'number of prior hospital admissions' and 'number of chronic conditions on record'. Both correlate with the readmission outcome. The two features also correlate strongly with each other. What type of feature problem does this represent?
A team runs Pearson correlation between each of 500 candidate features and a binary classification target, discards the bottom 400, and trains a model on the top 100. Which statement accurately describes a limitation of this approach?
Wrapper methods are computationally faster than filter methods because they only need to train the model once on the full feature set.
A Lasso regression is trained on a dataset with 20 features. The regularisation parameter $\lambda$ is increased from 0.01 to 0.5. What is the most likely effect on the trained model?
Arrange the steps of forward selection in the correct order from first to last.
- 1.Evaluate each remaining feature by training a model with the current selected set plus that one feature
- 2.Add the feature that produced the best validation performance to the selected set
- 3.Begin with an empty selected feature set
- 4.Stop when no remaining feature improves validation performance above the threshold
Two features both receive a Pearson correlation score of 0.71 against the target. After selecting one of them, why might adding the second provide only marginal improvement?