Features & Representations

An algorithm cannot learn from raw text, images, or categorical labels — it can only learn from numbers. This topic covers how to convert arbitrary observations into feature vectors: encoding categories numerically, normalising scales so no feature dominates, measuring similarity and distance in feature space, and engineering new features that expose structure the original measurements hide. Good feature representation is often the difference between a model that works and one that fails.

Prerequisites:

Understanding Data Required — Features are the columns of a dataset — you need to understand datasets first
Vectors Required — Feature vectors are vectors in a high-dimensional space — geometric intuition is essential

foundational

◷ 124 min total ① 6 chapters ⬡ 6 playgrounds

Chapters

foundational ◷ 20 min

Feature Vectors

A feature vector is a list of numbers that describes one entity — one song, one patient, one email, one house. Each number is one measurable property: tempo, age, word count, square footage. Together they locate that entity as a point in feature space, where similar entities sit close together and different entities sit far apart. The choice of which features to include determines everything an algorithm can learn — features it cannot see are patterns it cannot find.

5 sections Start →

foundational ◷ 20 min

Categorical Encoding

Categorical features — genre, country, blood type, device type — cannot enter a feature vector as text. They must be encoded as numbers. But the encoding choice matters enormously: integer encoding implies ordering and magnitude that do not exist. One-hot encoding avoids this by creating a binary column per category. Target encoding and embeddings offer more sophisticated approaches for high-cardinality features. The wrong encoding produces models that find patterns that do not exist — and miss patterns that do.

5 sections Start →

foundational ◷ 20 min

Normalisation and Scaling

Feature vectors combine measurements from entirely different scales: tempo in beats per minute, energy as a fraction, duration in seconds. When a KNN or SVM computes distances between feature vectors, the raw numerical magnitudes determine how much each feature contributes. A feature with a range of 500 units contributes thousands of times more than a feature with a range of 1 unit — regardless of how informative each feature actually is. Scaling resolves this by transforming all features to a common numerical range before any distance is computed.

5 sections Start →

intermediate ◷ 22 min

Similarity & Distance

Distance metrics are the mathematical definition of similarity. Euclidean distance measures straight-line distance between feature vectors — sensitive to magnitude differences across all dimensions. Manhattan distance sums absolute differences — more robust to outliers, interpretable as city-block distance. Cosine similarity measures the angle between vectors — insensitive to magnitude, capturing directional similarity. Each metric embeds a different assumption about what similarity means. The choice determines which songs are recommended, which points are clustered together, and which examples are used as nearest neighbours.

5 sections Start →

intermediate ◷ 21 min

Feature Engineering

Feature engineering transforms raw measurements into representations that models can learn from more effectively. Log transformation compresses right-skewed distributions — turning a house price axis spanning hundreds of thousands of pounds into a roughly Gaussian one — so that outliers stop dominating every distance and regression coefficient. Binning converts continuous values into ordered categories, capturing domain knowledge that a neural network would take thousands of examples to learn: a 1920s pre-war flat behaves differently on the market from a 1990s modern conversion, regardless of where exactly in those windows it was built. Interaction terms combine two features into a single composite that captures their joint effect: a five-bedroom house with one bathroom is a different product from a five-bedroom house with three bathrooms, and that ratio predicts price far better than bedrooms or bathrooms alone.

5 sections Start →

intermediate ◷ 21 min

Feature Selection

Real datasets arrive with far more candidate features than any model needs. Most are redundant, irrelevant, or pure noise. Feature selection is the principled process of deciding which features to keep — and, crucially, why. This chapter covers the three families of selection methods, the distinction between irrelevance and redundancy, and how to match the right method to the right situation.

5 sections Start →