Features Chapter 5 of 6 · tap to browse
Feature Engineering
Raw inputs rarely arrive in the form a model needs. Feature engineering is the craft of reshaping data so that its structure becomes legible to algorithms.
Zillow's Zestimate and Rightmove's automated valuations do not use raw square footage, raw year built, and raw bedroom count as independent inputs. They use log-transformed prices as the target variable, era-coded build periods as categorical features, and computed ratios like price-per-bedroom and bathroom-to-bedroom ratio as engineered inputs — because raw features structurally mismatch what regression and gradient boosting trees need.
- 1 State what log transformation, binning, and interaction terms each do to a feature.
- 2 Explain why right-skewed distributions benefit from log transformation before linear regression.
- 3 Given a set of raw property features, identify which transformation is most appropriate for each and justify the choice.
- 4 Given two feature sets — raw and engineered — explain which would produce better model performance and why.
Reshaping Raw Data
The raw data problem
A property platform wants to predict house prices in London. The data team has a clean dataset: 12,000 sold properties, each with a price, a build year, a bedroom count, a bathroom count, and a square footage. Everything is numerical. A linear regression model should work.
It does not work well. The predictions are systematically wrong for inexpensive flats, wildly wrong for expensive detached houses, and the model never learns that a three-bedroom house with one bathroom is a different product from a three-bedroom house with three bathrooms.
The features are real. The data is clean. The problem is that the features are in the wrong form for the algorithm to read them.
feature-engineeringFeature engineering is the step between raw data and a trained model. It does not change what data was collected — it changes how that data is expressed. The same 12,000 sold properties, expressed differently, produce a model that predicts within 8% of the true price. Expressed as raw values, the same algorithm predicts within 28%.
Log transformation
House prices in London follow a right-skewed distribution. Most properties sell between £200,000 and £600,000. A smaller number sell between £600,000 and £1.5 million. A tail of unusual properties sells above that. The distribution is not symmetric — the right side extends much further than the left.
Linear regression assumes that the residuals — the errors between predicted and actual values — are roughly normally distributed and constant in size. A right-skewed target variable violates both assumptions. Predictions for expensive properties will have much larger absolute errors than predictions for cheap properties, not because the model is worse at expensive houses, but because a proportional error (say, 10%) is a much larger number at £1.5M than at £250k.
log-transformationThe log transformation fixes this by changing the scale. Instead of predicting price in pounds, the model predicts log₁₀(price). A 10% price difference anywhere on the scale corresponds to the same change in log price: log₁₀(1.1) ≈ 0.041 regardless of the base price. Proportional differences become additive differences in log space. Residuals become constant in size across the price range.
import numpy as np
import pandas as pd
df = pd.read_csv('london_properties.csv')
# Raw price: right-skewed, range £100k – £3M
df['price_log'] = np.log10(df['price'])
# log10(200000) ≈ 5.30, log10(1500000) ≈ 6.18
# The distribution is now approximately Gaussian
# Apply the same transform to any right-skewed input features
df['sqft_log'] = np.log10(df['square_footage'])
Log transformation requires all values to be strictly positive. House prices satisfy this naturally. Features that could be zero (e.g., number of previous renovations) require a shifted log: log(x + 1), which equals zero when x = 0 rather than being undefined.
After training, predictions in log space must be inverted: a model that outputs 5.72 is predicting 10^5.72 ≈ £524,000.
Binning
The dataset includes the build year of each property, ranging from 1850 to 2022. Build year is continuous — but its relationship with price is not a smooth, monotonic function of the year.
A property built in 1895 is a Victorian terrace: high ceilings, solid brick, conservation-area regulations, expensive to maintain, culturally desirable in certain markets. A property built in 1955 is a post-war semi: lower ceilings, concrete block construction, different market segment. A property built in 1998 is a modern conversion with double glazing and a boiler less than thirty years old. A property built in 2020 is a new build with developer warranties.
These are categorically different products. The year 1895 is not “slightly less valuable” than 1905 — they are in the same pre-war era. The year 1939 is not meaningfully different from 1940 — but 1938 and 1941 straddle the war and land in different regulatory and construction-quality eras.
binningimport pandas as pd
def assign_era(year):
if year < 1939:
return 'pre_war'
elif year < 1979:
return 'post_war'
elif year < 2000:
return 'modern'
else:
return 'new_build'
df['property_era'] = df['year_built'].apply(assign_era)
# One-hot encode the era for a linear model
era_dummies = pd.get_dummies(df['property_era'], prefix='era')
df = pd.concat([df, era_dummies], axis=1)
Estate agents use informal era categories every day: “a Victorian conversion,” “a 1970s semi,” “a new-build flat.” These categories encode domain knowledge about construction standards, maintenance costs, and buyer preferences that would take a neural network thousands of examples to discover from a raw year feature alone. Encoding domain knowledge directly is almost always more efficient than hoping the model learns it.
The choice of bin boundaries is a design decision. The thresholds used here — 1939, 1979, 2000 — correspond to genuine discontinuities in construction regulations, materials, and market expectations. Arbitrary quantile-based binning (splitting into four equal-size groups by year) would lose this structure.
Avoid creating too many bins. Ten bins for build year would partially recover the signal but produce sparse categories for rare years and make the feature harder to interpret. The goal of binning is to capture genuine structural discontinuities in the relationship between the feature and the target — not to preserve every detail of the continuous value.
Interaction terms
The dataset has bedroom count and bathroom count as separate features. A linear model with these two features learns a coefficient for each: “each additional bedroom adds £X to the price; each additional bathroom adds £Y.”
This is wrong. A five-bedroom house with one bathroom is a budget family home. A five-bedroom house with four bathrooms is a luxury residence. The price difference is not simply the sum of four extra bathroom coefficients — it is a fundamentally different property type. The effect of bathrooms depends on the number of bedrooms. The effect of bedrooms depends on the number of bathrooms. These features interact.
interaction-term# Ratio interaction: bathrooms per bedroom
df['bath_bed_ratio'] = df['bathrooms'] / df['bedrooms']
# 5 beds, 1 bath → ratio = 0.2 (budget family home)
# 5 beds, 4 baths → ratio = 0.8 (luxury residence)
# Product interaction: total room count proxy
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
# Size-per-bedroom: does the house feel spacious?
df['sqft_per_bedroom'] = df['square_footage'] / df['bedrooms']
The bathroom-to-bedroom ratio is not a feature that was collected. It was derived from two collected features. Its relationship with price is stronger than either constituent feature alone because it captures a market signal — how well-appointed a property is for its size — that neither feature captures independently.
Interaction terms multiply the number of features quickly. With 10 base features, you have 45 possible pairwise interactions. Most will not improve the model. Choose interaction terms based on domain knowledge — features that genuinely modify each other’s effect — rather than by computing all possible interactions and selecting those with the highest correlations. The latter introduces data leakage and overfitting.
Putting it together
The three transformations address three different structural mismatches between raw data and model assumptions:
Log transformation fixes a distributional mismatch. The raw price distribution is right-skewed; linear regression assumes roughly symmetric residuals. Log price satisfies this assumption.
Binning fixes a relationship mismatch. Build year has a step-function relationship with price, not a smooth linear one. Categorical era features capture the steps directly.
Interaction terms fix a feature-independence mismatch. Bedroom count and bathroom count jointly determine a market segment that neither determines alone. Their ratio encodes the joint effect as a single feature.
None of these transformations require more data. They require understanding what the algorithm needs and reshaping the available data accordingly.
When should I apply log transformation versus square-root transformation?
Log transformation is appropriate when values span several orders of magnitude and the distribution has a long right tail — house prices, income, population counts. The log compresses large values aggressively and is the natural transformation when percentage differences are more meaningful than absolute ones. Square-root transformation is gentler — useful when the distribution is moderately skewed but not extreme, or when zero values must be preserved (log is undefined at zero, but √0 = 0). A rough guide: if the ratio of maximum to minimum is more than 100:1, use log. If it's less than 10:1, square root may suffice.
Does feature engineering introduce data leakage?
It can, if done incorrectly. Log transformation on a feature computed from input values only is safe. But some transformations require statistics computed from the dataset — binning thresholds derived from quantiles, scaling parameters, interaction terms that include the target variable. If those statistics are computed on the full dataset (including the test set) before splitting, information from the test set leaks into training. The correct approach is to compute all transformation parameters on the training set only, then apply those same parameters to transform the test set.
How do I choose which interaction terms to create?
Domain knowledge is the most reliable guide. In property pricing, the bedroom-to-bathroom ratio is a known market signal — estate agents use it explicitly. In e-commerce, the product of session duration and pages viewed captures engagement better than either alone. Without domain knowledge, tree-based models (gradient boosting, random forests) discover interactions automatically — you can inspect feature importances and split patterns to identify which pairs interact strongly, then engineer those explicitly if using a linear model.
The playground is about to let you apply log transformation, a safe-for-zero variant, binning by era, and squaring to the raw house price distribution. Before you interact — a property selling for £500,000 is exactly twice the price of one selling for £250,000. After log transformation, what happens to that gap? Does it get larger, smaller, or stay the same? And if a linear model is trained to predict log(price), what extra step does it need to produce a price in pounds?
Key Ideas
Feature engineering is not about adding complexity — it is about removing the mismatch between raw measurements and model assumptions.
A linear regression model makes two assumptions about the target variable: that prediction errors are roughly the same size across the full range, and that the relationship between each feature and the target is linear. Raw house prices violate the first assumption: a 10% prediction error at £1.5M is six times larger in absolute terms than the same proportional error at £250k. Log transformation converts proportional differences into additive ones, so a 10% price gap becomes the same numerical gap in log space regardless of the base price. The residuals become comparable in size across the full range.
Binning is the recognition that some continuous features have a step-function relationship with the target. Build year 1938 and 1941 straddle a genuine discontinuity in construction era, regulation, and market segment. A linear coefficient on raw year would treat the 1938-to-1941 change as three times less significant than the 1910-to-1919 change — which is backwards. Categorical era features, chosen by domain knowledge, capture the discontinuities directly.
Interaction terms are the recognition that some features do not have independent effects. The number of bedrooms does not have a fixed contribution to price — it has a contribution that depends on how many bathrooms are in the same property. A ratio captures this dependency as a single feature. The model no longer needs to discover the interaction from data; it is encoded directly in the feature set.
The common thread is that each transformation makes implicit structure explicit. Log price makes proportional relationships explicit. Era categories make construction-era discontinuities explicit. Bathroom-to-bedroom ratio makes the market-segment signal explicit. An algorithm that receives these engineered features starts with a structural advantage over one that must infer the same structure from raw inputs.
Log transformation compresses right-skewed distributions into approximately symmetric ones — a £500k house and a £250k house are not twice as far apart in log space as a £1M house and a £500k house, because proportional differences become additive differences after log.
Binning converts a continuous feature into ordered categories, encoding domain knowledge about genuine structural discontinuities — pre-war construction is categorically different from post-war regardless of the exact year.
Interaction terms capture the joint effect of two features that modify each other — the bathroom-to-bedroom ratio encodes market segment information that neither feature carries independently.
Feature engineering does not require more data. It requires understanding the structural mismatch between what raw features express and what the algorithm assumes, then reshaping the features to close that gap.
Check Your Understanding
Four questions on log transformation, binning, and interaction terms. Click a question to reveal the answer — there is no score.
A linear regression model is trained to predict log₁₀(house price). On the test set, a property with a true price of £500,000 gets a predicted log₁₀ value of 5.65. What is the model's actual price prediction in pounds?
Binning a continuous feature always reduces model performance because it discards information by replacing exact values with group labels.
A dataset has two features: income (£20k–£200k) and loan amount requested (£10k–£50k). A logistic regression model predicts whether a loan will be approved. Which engineered feature is most likely to improve predictive performance?
The chapter describes three structural mismatches between raw data and model assumptions, each addressed by a different transformation. Which pairing is correct?