Is cosine similarity a distance metric?

Technically no — a proper distance metric must satisfy the triangle inequality and return 0 for identical points. Cosine similarity returns 1 for identical direction, 0 for perpendicular, and -1 for opposite directions. It can be converted to cosine distance (1 minus cosine similarity), which is a proper distance measure. In practice, the distinction rarely matters — cosine similarity is used directly for ranking nearest neighbours.

What is the dot product's relationship to cosine similarity?

Cosine similarity is the dot product of two vectors divided by the product of their magnitudes. When vectors are normalised to unit length, cosine similarity equals the dot product directly. This is why normalised embeddings in neural networks can use fast dot product computation as a proxy for cosine similarity — the magnitudes are all 1, so the dot product and cosine similarity are identical.

Features Chapter 4 of 6 · tap to browse

01 Feature Vectors 02 Categorical Encoding 03 Normalisation 04 Distance Metrics 05 Engineering Features 06 Feature Selection

Similarity & Distance

Every recommendation, every classifier, every clustering algorithm has an implicit answer to: what does 'similar' mean?

Spotify's recommendation system uses cosine similarity in its embedding space — two songs are similar if their feature vectors point in the same direction, regardless of whether one has uniformly higher values than the other. A quiet acoustic song and a loud acoustic song can be very similar by cosine, very different by Euclidean.

Learning Objectives

1 State the formula for Euclidean distance, Manhattan distance, and cosine similarity.
2 Explain what geometric property each metric measures and when each is appropriate.
3 Given two feature vectors, identify which metric would rate them as most similar and justify why.
4 Given a described recommendation or classification system, identify which distance metric is most appropriate and explain what the alternatives would get wrong.

¶ Narrative

Defining Similar

Two songs are playing on a streaming platform. Song A: jazz piano, tempo 95 BPM, energy 0.31, acousticness 0.88. Song B: acoustic folk, tempo 102 BPM, energy 0.28, acousticness 0.82. Song C: electronic pop, tempo 128 BPM, energy 0.91, acousticness 0.05. Which song is most similar to Song A? Intuitively Song B — both are acoustic, low energy, unhurried. But “intuitively similar” cannot run on a server. It must be converted to a number. That number is a distance metric — a function that takes two feature vectors and returns a single value representing how different they are. The choice of metric encodes an assumption about what similarity means, and different assumptions produce different answers.

Euclidean distance

Imagine the straight-line distance between two points on a map. Euclidean distance extends this idea to any number of dimensions — it is the shortest possible path between two points in feature space.

d (a, b) = i = 1 \sum n (a_{i} - b_{i})^{2}

Square the difference in each feature, sum them, take the square root. Song A and Song B are 0.07 apart — close. Song A and Song C are 1.02 apart — far. Song B is the correct recommendation.

Euclidean distance is the straight-line distance between two points. Song B sits close to Song A — similar energy and acousticness. Song C is far away — high energy, low acousticness, the opposite musical character.

Squaring the differences means that one large gap between songs hurts more than many small ones. A song that differs from the reference by a lot in just one feature — say, tempo — gets pushed far away even if energy, acousticness, and danceability all match closely.

Manhattan distance

Imagine navigating a city grid — you can only walk along streets, never cut diagonally through a block. Manhattan distance measures that grid path: the total distance travelled when you move one feature at a time.

d (a, b) = i = 1 \sum n ∣ a_{i} - b_{i} ∣

Take the absolute difference in each feature and sum them all — no squaring. For Song A and Song B: |0.31 − 0.28| + |0.88 − 0.82| = 0.03 + 0.06 = 0.09.

Manhattan distance follows the grid — horizontal then vertical. The dashed line shows the shorter Euclidean shortcut. Manhattan treats each dimension separately.

Because Manhattan does not square differences, a song that is very different in one feature but nearly identical in all others stays closer under Manhattan than under Euclidean. It is more forgiving of a single outlier dimension.

Property	Euclidean	Manhattan
Path shape	Straight line (diagonal)	Grid path (horizontal + vertical steps)
Formula	√(Σ differences²)	Σ \|differences\|
Effect of large single difference	Amplified (squared)	Linear — no amplification
Sensitive to outlier dimensions	More sensitive	Less sensitive
Typical use	Low dimensions, spherical clusters, general default	High dimensions, robust to outlier features, sparse data

Cosine similarity

Two songs can sound identical but one was recorded quietly and one loudly. Cosine similarity ignores overall volume — it only asks: do these two songs point in the same direction in feature space? A song with the same ratios of energy to acousticness to tempo is maximally similar by cosine even if its raw values are twice as large.

cos (θ) = \frac{a \cdot b}{∥ a ∥ \cdot ∥ b ∥}

Compute the dot product of the two feature vectors, divide by the product of their lengths. The result is between −1 and 1: songs pointing in the same direction score near 1; songs pointing in perpendicular directions score near 0.

Cosine similarity measures direction, not magnitude. Two songs pointing the same way are maximally similar by cosine even if one is louder overall.

Real World

Word embeddings in NLP use cosine similarity to measure semantic similarity between words and documents. The words “king” and “queen” have feature vectors pointing in similar directions in embedding space. Their magnitudes may differ, but cosine ignores magnitude. This is why cosine similarity is the standard metric for text retrieval, document similarity, and embedding-based recommendation: in all these domains, direction carries meaning but magnitude is an artifact.

python

import numpy as np

def euclidean(a, b):
  return np.sqrt(((a - b) ** 2).sum())

def manhattan(a, b):
  return np.abs(a - b).sum()

def cosine_similarity(a, b):
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Features: [tempo_scaled, energy, acousticness] — z-score normalised
song_a = np.array([0.95, 0.31, 0.88])  # jazz
song_b = np.array([1.02, 0.28, 0.82])  # acoustic folk — similar
song_c = np.array([1.28, 0.91, 0.05])  # electronic pop — different

print(euclidean(song_a, song_b))         # small — songs are close
print(euclidean(song_a, song_c))         # large — songs are far
print(cosine_similarity(song_a, song_b)) # near 1 — same direction
print(cosine_similarity(song_a, song_c)) # lower — different direction

The three metrics are not interchangeable. Choosing Euclidean when cosine is appropriate, or Manhattan when Euclidean matches the data geometry, changes which songs are recommended — and the error is silent. The algorithm returns k results regardless of whether those results are the genuinely most similar songs or just the ones that happen to be numerically close by the wrong definition of close. The playground makes this concrete: switch metrics and watch which songs appear as nearest neighbours.

In this section

When should I use cosine similarity instead of Euclidean distance?

Use cosine similarity when magnitude differences are not meaningful and directional similarity is what matters. In text analysis, a short document and a long document on the same topic should be considered similar — Euclidean distance would rate them as different because the long document has higher word counts across all dimensions. Cosine similarity captures their shared direction. In music, a quiet acoustic song and a loud acoustic song share the same musical character — cosine captures this, Euclidean penalises the loudness difference.

Does normalisation affect which metric to use?

Yes, significantly. After z-score normalisation, Euclidean and Manhattan distances become much more comparable because all features contribute on a similar scale. Cosine similarity is largely unaffected by normalisation because it already ignores magnitude. If you have normalised features and are choosing between Euclidean and Manhattan, Euclidean penalises large differences more because it squares them, while Manhattan treats all differences linearly.

What is the curse of dimensionality and how does it affect distance metrics?

In high dimensions, all points become approximately equidistant from each other. The ratio of maximum to minimum pairwise distance approaches 1 as dimensions increase. This makes distance-based algorithms unreliable in very high dimensions — nearest neighbours become meaningless when everything is equally far. Dimensionality reduction techniques address this by projecting data into a lower-dimensional space where distances are more meaningful.

◎ Intuition

The playground is about to show you a reference jazz song and its nearest neighbours under three different distance metrics. Before you switch — a quiet acoustic jazz song and a loud acoustic jazz song have similar musical character but different overall energy levels. Do you think Euclidean distance would rate them as similar or different? What about cosine similarity? Why might these two metrics disagree on the same pair of songs?

↺ Reflection

Key Ideas

Every algorithm that asks “which songs are most similar?” needs a definition of similar. That definition is the distance metric. It is not a neutral mathematical tool — it is a decision about what properties of a song matter for similarity. The metric you choose changes which songs get recommended.

Euclidean distance is sensitive to large differences in any single feature. A song that differs from the reference by a lot in just tempo will be pushed far away — even if energy, acousticness, danceability, and loudness all match closely. Squaring the differences amplifies any single large gap. Use Euclidean when you want songs that are close across every measured feature simultaneously.

Manhattan distance treats all differences proportionally — no squaring, no amplification. A song that is very different in one feature but nearly identical in four others stays closer under Manhattan than Euclidean. Cosine similarity ignores overall loudness and volume entirely — it only asks whether two songs have the same ratios of energy to acousticness to tempo. A quiet jazz song and a loud jazz song are maximally similar by cosine. They may be far apart by Euclidean. When magnitude is an artifact — recording volume, document length — cosine is the honest metric.

The same reference jazz song, the same 120-song catalogue, three metrics: they produce partially overlapping but distinct lists of nearest neighbours. Songs that are genuinely acoustically similar appear in all three lists. Songs that are similar in character but differ in loudness appear only in the cosine list. Neither list is wrong — they answer different questions. The engineering decision is which question your system should be answering.

Key Points

Euclidean distance is sensitive to large differences in any single feature — squaring amplifies any one large gap and can push songs far apart even when most features match.

Manhattan distance treats all differences proportionally — a song that is very different in one feature but similar in four others stays closer under Manhattan than under Euclidean.

Cosine similarity ignores overall magnitude — two songs with the same musical character recorded at different volumes have cosine similarity 1.0, because magnitude is an artifact here, not meaningful content.

Metric choice is a design decision about what similarity means for the specific problem. The same data, three different metrics, three partially different lists of nearest neighbours — all correct, each answering a different question.

✓ Checkpoint

Check Your Understanding

Four questions on Euclidean, Manhattan, and cosine distance, plus metric selection. Click a question to reveal the answer — there is no score.

A document similarity system compares news articles. Article A is 500 words about climate change. Article B is 5,000 words about climate change. Article C is 500 words about football. Which metric best captures that A and B are about the same topic?

After z-score normalisation, Euclidean distance and cosine similarity will always produce the same ranking of nearest neighbours.

A fraud detection system uses 200 binary features — each transaction either has (1) or does not have (0) each of 200 transaction characteristics. Which distance metric is most natural and interpretable for this data?

Order these recommendation scenarios from most to least appropriate use case for cosine similarity (most appropriate first):

1.Finding similar news articles — where article length varies widely but topic matters
2.Finding similar houses — where square footage, bedrooms, and price all have meaningful magnitudes
3.Finding similar songs in a catalogue — where recording volume varies but musical character matters
4.Finding similar patients — where age, blood pressure, and test results are all meaningful absolute values