SeeingML
Features Chapter 2 of 6 · tap to browse

Categorical Encoding

Turning labels into numbers — without lying to your algorithm

When Netflix encodes film genre as Action=1, Comedy=2, Drama=3, Romance=4, it is telling its recommendation algorithm that Romance is four times Action and that Comedy is exactly between Action and Drama — relationships that mean nothing about what films people actually enjoy.

Learning Objectives
  1. 1 Name three categorical encoding methods and describe when each is appropriate.
  2. 2 Explain why integer encoding of unordered categories creates false mathematical relationships.
  3. 3 Choose the appropriate encoding method for a described categorical feature and implement it.
  4. 4 Given an encoding choice in a described ML pipeline, identify what false relationships it introduces and which algorithm types are most affected.
¶ Narrative

From Labels to Numbers

The music streaming platform’s feature vector currently has five numerical measurements per song: tempo, energy, danceability, acousticness, loudness. The recommendation engine uses these five numbers to compute distances in feature space and return the nearest songs. The engineering team wants to improve results by adding genre. A song is jazz, pop, classical, or rock — but these are text labels. They have no natural numerical representation. Before genre can enter the feature vector, it must be converted to numbers. The way that conversion is done determines whether the algorithm learns something real about musical relationships or something invented.

Integer Encoding — and Why It Fails

The simplest approach is to assign each genre a number. Jazz gets 1, pop gets 2, classical gets 3, rock gets 4. This is integer encodinginteger-encoding — also called label encoding when the categories are unordered, or ordinal encoding when they have a genuine order.

For unordered categories, integer encoding is always wrong. The reason is mathematical: assigning jazz=1 and rock=4 tells every distance-based algorithm that rock is three times as far from jazz as pop is. That classical (3) is equidistant between pop (2) and rock (4). That jazz and pop are exactly as close as pop and classical, and classical and rock. None of these relationships exist in musical reality. They are artifacts of the numbering.

A KNN recommendation engine given this encoding will compute that a jazz song (genre=1) is far from a rock song (genre=4) and close to a pop song (genre=2) — not because pop sounds more like jazz than rock does, but because 2 is closer to 1 than 4 is on the number line. The algorithm is following false information the encoding invented.

Integer encoding places genres on a number line. The gaps between positions are equal — implying equal musical distance between adjacent genres. The assignment is completely arbitrary: any permutation produces a different set of false distances. The algorithm cannot know the assignment is meaningless.
Common Mistake

Using integer encoding for unordered categorical features is the most common categorical encoding mistake. It is invisible at training time — the code runs, the model trains, and accuracy metrics look reasonable. The bug lives in what the model learned: fake musical relationships that happen to partially correlate with real ones (jazz and classical do both tend toward acoustic instruments, so the encoding accidentally preserves some signal). The partial correlation makes the bug hard to detect — the model is not obviously wrong, just subtly corrupted.

integer-encoding

One-Hot Encoding

The correct approach for unordered categories is to create one binary column per category. A song is either jazz or not jazz — a clean binary question with a 0 or 1 answer. Repeat the question for every genre. Four genres produce four binary columns.

One-hot encoding. Every genre gets its own column. Every song gets exactly one 1, in the column for its genre, and 0 everywhere else. No genre is numerically closer to or farther from any other — jazz and rock are as different from each other as jazz and pop.

This encoding makes no mathematical claim about relationships between genres. In feature space, every genre is exactly distance √2 from every other genre — they are equal and orthogonal. No genre is numerically adjacent to any other. No false proximity, no false ordering. The encoding is honest about what it knows: these categories are different from each other, and there is no principled reason to treat any pair as closer than any other.

one-hot-encoding
💡 Insight

One-hot encoding is not just a technical choice — it is a statement about what you believe is true. By using it, you are telling the algorithm: these categories have no inherent ordering or proximity. The algorithm should not infer any mathematical relationship between them beyond same or different. Integer encoding makes the opposite claim: that the categories lie on a line with meaningful distances. If that claim is false, the algorithm will learn the false distances as if they were real.

The Cardinality Problem

One-hot encoding works well for low-cardinalitycardinality features — those with a small number of distinct values, typically 2 to 20. For high-cardinality features, it creates a different problem. Country of origin has 195 possible values: 195 binary columns, almost all zero for any given song. User ID for a streaming platform has millions of values. Product SKU in an e-commerce catalogue has hundreds of thousands. One-hot encoding these features produces feature matrices that are enormous, mostly zero, and computationally expensive to process.

High-dimensional sparse representations also cause distance measures to behave poorly. When most features are zero for all points, distances become dominated by the few non-zero entries — a different form of the same false-distance problem.

For high-cardinality features, two alternatives are more appropriate: target encodingtarget-encoding and embeddingsembedding.

MethodPreserves orderHandles high cardinalityRiskBest for
Integer encodingYesYesImplies false ordering for unordered categoriesOrdinal features only
One-hot encodingNoNo — sparse at high cardinalityFeature explosion for many categoriesLow-cardinality unordered categories
Target encodingNoYesTarget leakage if not cross-validated carefullyHigh-cardinality with strong target relationship
EmbeddingsNo — learnedYesRequires sufficient training dataVery high cardinality: words, users, items

Which Algorithms Care Most

The impact of encoding choice is not uniform across algorithm types. Tree-based algorithms — decision trees, random forests, gradient boosted trees — split on one feature at a time. They can ask “is genre == jazz?” as a binary split without needing to assign a number to jazz at all. These algorithms are largely insensitive to categorical encoding choice.

Distance-based algorithms compute distances between full feature vectors. KNN, support vector machines, and neural networks all depend on distances being meaningful. If genre is encoded as an integer and the genre dimension contributes distance 3 between jazz and rock songs, that contribution flows directly into every similarity calculation the algorithm makes. Wrong encoding corrupts every distance calculation, and therefore every prediction.

Real World

Gradient boosted tree libraries — XGBoost, LightGBM, CatBoost — now support native categorical features, handling encoding internally. CatBoost in particular was designed specifically to handle high-cardinality categorical variables correctly using ordered target encoding. This is part of why these libraries dominate Kaggle competitions involving tabular data: the algorithm handles categorical variables correctly without manual encoding decisions from the practitioner.

Bridge to the Playground

The playground shows the same 120 songs from Chapter 1, but now adds genre as an encoded feature on the vertical axis. A select control switches between integer encoding (jazz=1, pop=2, classical=3, rock=4) and two one-hot proxy views (using energy or acousticness as the y-axis to show genre-based clustering without the artificial integer ladder). Switching between them shows directly what integer encoding tells the algorithm — four genre bands at equal vertical intervals, implying equal distance — compared to what one-hot encoding represents. A toggle shows centroid distance lines labelled with their lengths. A reference genre control highlights one song and draws nearest-neighbour lines to its three closest songs under the current encoding.

In this section

When is integer (ordinal) encoding appropriate?

Integer encoding is only appropriate when the categories have a genuine natural order with meaningful gaps — satisfaction ratings (poor=1, fair=2, good=3, excellent=4), education level, or pain scales. For unordered categories like genre, country, or blood type, integer encoding is always wrong because it implies ordering that does not exist.

What is the curse of one-hot encoding?

One-hot encoding creates one binary column per category. A feature with 1,000 unique values becomes 1,000 columns — almost all zeros. This high-dimensional sparse representation is computationally expensive and causes distance measures to behave poorly. For high-cardinality features, target encoding or embeddings are more appropriate.

What is target encoding?

Target encoding replaces each category with the mean value of the target variable for that category. Genre encoded by average user rating: jazz → 4.1, pop → 3.8, classical → 4.3, rock → 3.6. This captures a real relationship between category and target, but leaks target information into features, which can cause overfitting if not done carefully with cross-validation.

◎ Intuition

The playground is about to show you what happens when genre is encoded two different ways and added to the feature space alongside an audio feature. Before you switch encodings — if jazz is assigned 1 and rock is assigned 4, how far apart do you think the algorithm will treat them compared to jazz and pop (jazz=1, pop=2)? Is that difference real, or is it a product of the numbering?

↺ Reflection

Key Ideas

Genre is a categorical feature: it takes one of a fixed set of text labels — jazz, pop, classical, rock. Algorithms cannot use text in distance calculations, so genre must be converted to numbers before entering a feature vector. The conversion choice is consequential. Assigning jazz=1, pop=2, classical=3, rock=4 encodes a specific mathematical claim: that rock (4) is four times as large as jazz (1), that classical (3) is exactly halfway between pop (2) and rock (4), and that the distance from jazz to pop equals the distance from pop to classical and from classical to rock. Not one of these relationships has any basis in musical reality. They are artifacts of the numbering.

One-hot encoding avoids these false claims by replacing the single genre column with four binary columns — one per genre. Each song has a 1 in exactly one column and 0 in all others. A jazz song is described as [1, 0, 0, 0]. A rock song is [0, 0, 0, 1]. The Euclidean distance between any two genres in this encoding is always √2 — all genres are equidistant from each other. The encoding makes no claim about which genres are more similar. It asserts only that these categories are different from each other, which is the only thing that is actually known.

The choice matters most for distance-based algorithms. A KNN recommendation engine computes the Euclidean distance between every song’s feature vector and the query song’s vector. If genre is encoded as an integer, the distance between a jazz song (genre=1) and a pop song (genre=2) contains a genre contribution of 1. The distance between a jazz song and a rock song (genre=4) contains a genre contribution of 3. The algorithm will systematically treat jazz and pop as more similar than jazz and rock — not because they sound more similar, but because 2 is closer to 1 than 4 is. A model trained on this data will produce subtly wrong recommendations that are difficult to debug, because the error is in the representation, not the algorithm.

Tree-based algorithms — decision trees, random forests, gradient boosted trees — are largely unaffected. They split on one feature at a time and can represent “genre is jazz” as a binary condition without ever computing a distance between genre values. This insensitivity to encoding is part of why tree-based methods dominate tabular ML benchmarks: they are robust to the kinds of representation errors that corrupt distance-based methods.

For features with hundreds or thousands of distinct values — country (195), user ID (millions), product SKU (hundreds of thousands) — one-hot encoding produces matrices that are enormous and mostly zero. High-dimensional sparse representations increase memory usage and cause distance measures to behave poorly, because the few non-zero dimensions dominate all distance calculations. Target encoding (replace each category with the mean target value for that category) and learned embeddings (dense numerical representations trained alongside the model) are more appropriate for these cases.

The same principle applies to the next chapter’s topic. Normalisation and scaling address a related distortion: when a feature like tempo (60–200 BPM) sits alongside a feature like acousticness (0–1), the tempo dimension contributes much larger raw numbers to every distance calculation, regardless of how much information it actually carries. The fix — rescaling all features to a common range — resolves a false-distance problem with the same structure as integer encoding: arbitrary numerical magnitude distorting the geometry of feature space.

Key Points

Integer encoding of unordered categories (jazz=1, pop=2, classical=3, rock=4) implies ordering and magnitude that do not exist — telling the algorithm that rock is four times jazz and that pop is equidistant between jazz and classical.

One-hot encoding creates one binary column per category — every song has a 1 in exactly one column and 0 in all others. All genres are equidistant from each other. The encoding makes no mathematical claim beyond same or different.

Distance-based algorithms (KNN, neural networks, SVMs) are maximally sensitive to encoding choice — false distances from wrong encoding corrupt every distance calculation. Tree-based algorithms are largely insensitive.

High-cardinality features (thousands of categories) make one-hot encoding impractical — target encoding and embeddings are more appropriate tools for these cases.

Checkpoint

Check Your Understanding

Four questions on categorical encoding, integer encoding, and one-hot encoding. Click a question to reveal the answer — there is no score.

1

An e-commerce platform encodes product category as: Electronics=1, Clothing=2, Books=3, Food=4, Toys=5. A KNN recommendation engine uses this encoding. What is the most likely consequence?

2

One-hot encoding always produces better model performance than integer encoding for categorical features.

3

A dataset has a 'country' feature with 180 unique values. A data scientist applies one-hot encoding. What problem does this create?

4

Order these encoding methods from most to least appropriate for encoding 'star rating' (1 star, 2 stars, 3 stars, 4 stars, 5 stars) as a feature in a linear regression model:

  1. 1.Integer encoding (1–5)
  2. 2.One-hot encoding
  3. 3.Target encoding
  4. 4.Embeddings