Vectors Chapter 2 of 3 · tap to browse

01 What Is a Vector? 02 Vector Operations 03 Vector Spaces and Basis

Vector Operations

Q: Can cosine similarity be negative?

Yes. Cosine similarity equals the cosine of the angle between two vectors, so it ranges from −1 (opposite directions) through 0 (perpendicular) to +1 (identical directions). In NLP embeddings, a negative cosine similarity indicates semantically opposing concepts.

Addition, scaling, and the dot product — three operations that power all of ML

When you search for 'machine learning basics' in a semantic search engine, the system computes the cosine similarity between your query vector and millions of document vectors, then returns the most aligned results. This chapter shows exactly what that computation means and why it works.

Learning Objectives

1 Explain what the dot product measures geometrically and how to compute it algebraically from two vectors.
2 Calculate vector addition, scalar multiplication, and cosine similarity for given 2D vectors.
3 Distinguish between dot product and cosine similarity, and explain why cosine similarity is magnitude-invariant.

¶ Narrative

Operations on Vectors

Knowing what a vector is — a directed arrow with magnitude and direction — gets you started. Knowing how to operate on vectors is what lets you do ML.

Three operations cover most of what you will ever need: addition, scalar multiplication, and the dot product.

Vector Addition

Two vectors can be added together by placing the tail of the second at the tip of the first. The result is the single arrow that goes from the start of the first vector to the tip of the second.

Tip-to-tail vector addition. Vector A draws first. Then B is placed with its tail at A's tip. The resultant — A + B — runs from A's original tail to B's tip.

Algebraically, addition is component-wise:

Vector Addition

a + b = (a_{1} + b_{1} a_{2} + b_{2})

The order does not matter — vector addition is commutative: a + b = b + a.

For example: a hiker walks (2, 3) — two kilometres east and three north — then turns and walks (1, −1) — one east and one south. The net displacement is (2 + 1, 3 + (−1)) = (3, 2). The single arrow from start to finish is the sum — regardless of which leg of the journey you take first.

This exact operation appears in every residual network and every Transformer. A residual connection adds a layer’s input directly to its output: y = F(x) + x. That addition — literally vector addition applied at every layer — is one of the most important architectural decisions in modern deep learning. It allows gradients to flow through very deep networks without vanishing, and it is why models like ResNet and GPT can be trained at all.

Scalar Multiplication

A scalar is a plain number with no direction. Multiplying a vector by a scalar scales its length without changing its direction (unless the scalar is negative, which reverses the direction).

Scalar Multiplication

c v = (c \cdot v_{1} c \cdot v_{2})

Multiplying v = (2, 1) by the scalar 3 gives (6, 3) — the same direction, three times the length. Multiplying by −1 flips the vector to point the opposite way.

Scalar multiplication appears directly in the gradient descent update rule: w ← w − η∇L. The term η∇L scales the gradient vector by the learning rate η. A small η produces a short step in the gradient’s direction; a large η produces a long step in the same direction. The direction of the update is determined by the gradient; the magnitude of the step is determined by η.

The vector v = (2, 1) scaled by different factors. Positive scalars stretch or shrink the arrow without changing direction. A negative scalar reverses direction. All positive-scaled versions point the same way — only the length changes.

The Dot Product

The dot product takes two vectors and returns a single number — a scalar — that captures how aligned they are. The sign of the result tells you the relationship between the directions.

When two vectors point in the same general direction — angle less than 90° — the dot product is positive. When they are perpendicular — exactly 90° apart — the dot product is zero. When they point in opposing directions — angle greater than 90° — the dot product is negative.

Computing the dot product in two dimensions: multiply matching components and add. For vectors (3, 1) and (1, 3):

a \cdot b = 3 \times 1 + 1 \times 3 = 3 + 3 = 6

Positive — the vectors point in the same general direction. For (3, 0) and (0, 3):

a \cdot b = 3 \times 0 + 0 \times 3 = 0

Zero — the vectors are perpendicular. For (3, 0) and (−3, 0):

a \cdot b = 3 \times (- 3) + 0 \times 0 = - 9

Negative — the vectors point in opposite directions.

In 2D: a · b = a₁b₁ + a₂b₂. In higher dimensions, add more terms: a₁b₁ + a₂b₂ + a₃b₃ + … one term per dimension. In a 768-dimensional embedding space, the dot product has 768 terms — but the meaning is identical: multiply matching components and sum.

The geometric formula connects the algebraic calculation to the angle θ between the two vectors:

Dot Product — geometric

a \cdot b = ∥ a ∥ ∥ b ∥ cos θ

Both formulas give the same number. The geometric form makes the sign rule transparent: cos θ is positive for acute angles, zero at 90°, and negative for obtuse angles.

Vector A is fixed. As vector B rotates, the dot product changes sign. When the angle is acute the dot product is positive. At 90° it is exactly zero. When obtuse it is negative.

Common Mistake

The dot product is not the same as the magnitude. Taking v · v gives ‖v‖², not ‖v‖. The dot product of a vector with itself equals the sum of its squared components — a useful identity, but it is a scalar, not a vector, and not the same as the length.

Every neuron in a neural network computes a dot product: it takes the dot product of its weight vector with the input vector, adds a bias, and passes the result through an activation function. A linear layer with 512 output neurons computes 512 dot products simultaneously — one per output neuron. The dot product is the fundamental arithmetic operation performed billions of times per second during every forward pass of every neural network in production.

Cosine Similarity

If you want to know whether two vectors point the same way — without caring how long they are — divide the dot product by both magnitudes. The result is always between −1 and +1 and depends only on the angle, not the scale.

Rearranging the geometric dot product formula gives:

Cosine Similarity

cos θ = \frac{a \cdot b}{∥ a ∥ ∥ b ∥}

This quantity is called cosine similarity. It ranges from −1 (opposite directions) through 0 (perpendicular) to +1 (identical directions), and it is magnitude-invariant: scaling either vector up or down does not change it.

python

import numpy as np

a = np.array([3, 1])
b = np.array([1, 3])

dot = np.dot(a, b)                                         # 6
cos_sim = dot / (np.linalg.norm(a) * np.linalg.norm(b))   # ≈ 0.6

Vector A grows from length 1 to length 5 while keeping its direction fixed. The dot product with B increases proportionally. The cosine similarity stays identical — it measures only the angle, which has not changed.

That invariance is exactly why ML systems prefer cosine similarity over the raw dot product when comparing embeddings.

Real World

Semantic search encodes each document and query as a high-dimensional vector. Comparing them with cosine similarity means that a short tweet and a long article about the same topic can score nearly 1.0 — their scale differs but their direction matches.

Recommendation systems represent users and items as vectors. A cosine similarity close to +1 between a user vector and a film vector means the film is likely to be well-received regardless of how long the user’s watch history is.

Word embeddings like Word2Vec or GloVe place semantically similar words at similar angles. The classic demonstration: the cosine similarity between “king − man + woman” and “queen” is very close to 1.0, showing the embedding geometry captures real-world relationships.

In this section

What does a negative dot product mean?

A negative dot product means the two vectors point in generally opposite directions — the angle between them is greater than 90°. The more directly they oppose each other, the more negative the value. A dot product of zero means the vectors are exactly perpendicular.

Why do ML models use cosine similarity instead of Euclidean distance?

Cosine similarity measures only the angle between two vectors, ignoring their magnitudes. This matters in ML because two embeddings can represent the same semantic concept at very different scales — one document may have many words and another few. Euclidean distance would penalise the difference in scale; cosine similarity does not.

Is the dot product the same as multiplying two vectors?

No. Multiplying two vectors component-by-component gives a new vector (the Hadamard product). The dot product sums those component products into a single number — a scalar. The dot product collapses two vectors into one number that measures their alignment.

Key Terms

Dot Product Cosine Similarity

◎ Intuition

Two vectors can point in the same direction, in completely opposite directions, or anywhere in between. What single number captures exactly how aligned they are — without caring how long either one is?

↺ Reflection

What the Operations Reveal

Vector addition combines two displacements into one. Placing (1, 2) and (3, −1) tip-to-tail produces (4, 1) — the single arrow you would need to take instead of taking both. In neural networks, adding vectors this way underlies residual connections, where a layer’s output is added to its input to let gradients flow more easily during training.

The dot product goes further. When a = (3, 0) and b = (0, 3), the dot product is 3·0 + 0·3 = 0 — the vectors are perpendicular and share no component in each other’s direction. When a = (3, 0) and b = (3, 0), the dot product is 9 — they are perfectly aligned. When a = (3, 0) and b = (−3, 0), the dot product is −9 — they directly oppose each other.

A positive dot product indicates the angle between the vectors is less than 90° — they point in the same general direction and partially reinforce each other. A dot product of exactly zero means the vectors are perpendicular — they share no directional component whatsoever. A negative dot product means the angle exceeds 90° — the vectors oppose each other, and one has a component pointing directly against the other.

But the magnitude of the dot product depends on the lengths of both vectors, not just the angle. Scaling a from (1, 0) to (100, 0) multiplies every dot product involving a by 100. This is often undesirable when the goal is comparison rather than combination.

Cosine similarity removes this scale dependence by dividing by both magnitudes. For a = (3, 1) and b = (1, 3), the cosine similarity is approximately 0.6 — they point in similar but not identical directions. Scaling a to (300, 100) leaves the cosine similarity unchanged at approximately 0.6, because the increased magnitude cancels out exactly.

This is why semantic embedding models are evaluated with cosine similarity rather than the raw dot product. A model trained on Wikipedia generates embedding vectors of varying norms for different words and documents. The angle between those embeddings encodes meaning; their magnitudes are an artefact of training dynamics. Cosine similarity discards the artefact and keeps the signal.

Key Points

The dot product a·b = a₁b₁ + a₂b₂ + ... collapses two vectors into a scalar. It is positive when the vectors point in the same general direction, zero when perpendicular, and negative when they point in opposing directions.

Cosine similarity divides the dot product by both magnitudes, making it scale-invariant. A = (3, 0) and A = (6, 0) have different dot products with any other vector B, but identical cosine similarity values.

ML embedding models rely on cosine similarity precisely because of this invariance. A short document and a long one on the same topic can score near 1.0 — their lengths differ but their directions match.

✓ Checkpoint

Check Your Understanding

Four questions on vector operations. Select an answer, then reveal to see the explanation.

Vectors a = (2, 0) and b = (0, 5) are perpendicular. What is their dot product?

Cosine similarity between two vectors always stays the same if you double the length of one of them.

To compute the cosine similarity between a = (3, 4) and b = (1, 2), put these steps in the correct order:

1.Divide the dot product by the product of the magnitudes
2.Compute the dot product: a₁b₁ + a₂b₂
3.Compute |a| = √(3² + 4²) and |b| = √(1² + 2²)

What is the result of the vector addition (2, 3) + (−1, 4)?