Why does data type matter for machine learning?

Algorithms make mathematical assumptions about their inputs. Distance-based algorithms like KNN assume equal gaps between values. Tree-based algorithms handle categorical data differently from continuous. Using the wrong type causes the algorithm to find patterns that do not exist or miss patterns that do.

A special case of categorical data with exactly two possible values. Smoker/non-smoker, fraud/not-fraud, pass/fail. It gets special treatment because it can be encoded as a single 0/1 number without implying any ordering or magnitude.

Understanding Data Chapter 2 of 5 · tap to browse

01 Datasets 02 Data Types 03 Distributions 04 Descriptive Stats 05 Data Quality

Types of Data

Not all measurements are equal — and algorithms care deeply about the difference

A hospital patient intake form contains every data type simultaneously — age is continuous, blood type is categorical, pain level is ordinal, and smoker status is binary — and each field must be handled differently by any ML system.

Learning Objectives

1 Name the four main data types and give one example of each
2 Explain why ordinal data cannot be treated as continuous
3 Identify the correct data type for each field in a given dataset
4 Identify which data type encoding is wrong in a described ML pipeline and explain what bug it would cause

¶ Narrative

Types of Data

The hospital intake form

A patient arrives at a hospital. Before any treatment begins, a nurse fills out an intake form. Every field is a measurement — but not every measurement is the same kind of thing.

The form collects: age (47 years), body temperature (37.8°C), blood type (AB), pain level (7 out of 10), ward assignment (cardiology), smoker status (yes), and number of previous admissions (2). Seven fields. Four fundamentally different types of data.

An ML system that treats these seven fields identically will make serious errors. The algorithm does not know that 7 on a pain scale is not seven times worse than 1, or that blood type AB is not numerically between A and O. It only sees the numbers it is given — and if those numbers carry false implications, the algorithm will act on them.

python

import pandas as pd

df = pd.read_csv('hospital_intake.csv')
print(df.dtypes)
# age              float64    ← continuous
# temperature      float64    ← continuous
# blood_type       object     ← categorical (stored as text)
# pain_level       int64      ← ordinal (looks like integer!)
# smoker           bool       ← binary
# prior_admissions int64      ← count

Python’s dtype reports how data is stored, not what kind of data it is statistically. A pain score stored as int64 is not truly continuous just because Python treats it as a number — and treating it as one leads to the encoding bugs covered at the end of this chapter.

Continuous data

Age and temperature are continuous measurements. A continuous variable can take any value within a range, including all the decimal fractions in between. A patient can be 47.3 years old. A temperature can be 37.83°C. The number line for these measurements is smooth and unbroken — there are no gaps between valid values.

Patient ages plotted on a number line. Any value is valid — 47.3 years is just as real as 47 or 48. The distribution is smooth and continuous: no gaps, no isolated groups, just a density that rises and falls.

Continuous data is a quantitative measurement that can take any value within a range, including all fractional values. The gaps between any two values are equal and meaningful: the difference between 37.0°C and 38.0°C is the same as the difference between 38.0°C and 39.0°C. Age, weight, temperature, income, and distance are all continuous.

Because the gaps are equal and meaningful, continuous data supports all arithmetic operations. You can meaningfully average two temperatures, subtract one age from another, or compute the distance between two patients in feature space. Algorithms that rely on distance — KNN, SVMs with RBF kernels, neural networks — work directly on continuous data without any transformation.

Categorical data

Blood type and ward assignment are categorical measurements. A patient’s blood type is one of four labels: A, B, AB, or O. These labels have no natural order. O is not “more” than A. AB is not between A and O. Neurology is not halfway between cardiology and oncology. The only relationship between categories is same versus different.

Continuous and categorical data have visually distinct signatures. Age produces a smooth curve — any value between 0 and 120 is valid, and nearby values are genuinely similar. Blood type produces four isolated bars — there is no value between A and B, and no ordering.

Categorical data (also called nominal data) consists of labels that fall into distinct, unordered groups. The only valid relationship between categories is equality: two values either belong to the same category or they do not. No arithmetic is defined on categorical values — you cannot average blood types or subtract one ward from another.

Common Mistake

Encoding blood type as A=1, B=2, AB=3, O=4 and feeding those integers directly into a distance-based algorithm implies that O is four times A, that the distance between A and B equals the distance between B and AB, and that O is further from A than B is. None of those relationships exist in blood type biology — they are invented by the encoding. One-hot encoding (a separate binary column for each category) avoids this entirely.

For categorical features with hundreds or thousands of unique values — zip codes, product IDs, user accounts — one-hot encoding creates extremely wide, sparse vectors. Modern ML systems often learn compact dense embeddings instead, training a short vector for each category value alongside the main model. This approach is covered in the Categorical Encoding chapter of the Features & Representations topic.

Ordinal data

Pain level is ordinal. The values run from 1 to 10 and they have a natural order — a patient reporting 8 is in more pain than one reporting 4. But the gaps between values are not equal or meaningful. The jump from level 3 to level 4 may not represent the same increase in pain as the jump from level 8 to level 9. The numbers encode rank, not magnitude.

Property	Ordinal	Continuous
Has natural order	Yes	Yes
Gaps between values are equal	No	Yes
Subtraction is meaningful	No	Yes
Computing a mean is valid	Debated	Yes
Example	Pain 1–10, satisfaction rating	Temperature, age, income

⚠️ Warning

Many practitioners compute the mean of a pain score — “the average reported pain was 6.4”. This is mathematically possible but statistically questionable. It assumes that the gap between level 6 and level 7 is the same as the gap between level 3 and level 4, which is not guaranteed. Ordinal scales are not rulers. Whether averaging ordinal data is acceptable depends on the research context and is actively debated in statistics.

Ordinal data has a natural order but unequal gaps. You can rank ordinal values (greater, less) but cannot subtract one from another in a meaningful way. Pain scales, satisfaction ratings, education level, and survey responses (Strongly Disagree to Strongly Agree) are all ordinal.

Ordinal data sits in an awkward position between categorical and continuous. Tree-based algorithms (decision trees, random forests, gradient boosting) handle ordinal data naturally — they split at thresholds without assuming equal gaps. Distance-based algorithms treat ordinal values as continuous unless told otherwise, which can introduce errors.

Binary data

Smoker status is binary: yes or no, with no other options. Binary data is a special case of categorical data with exactly two values. It gets its own category because it can be encoded as a single 0/1 column without implying any false mathematical relationship. 1 is not “greater” than 0 in a meaningful sense — 1 simply means “smoker”.

Binary data is categorical data with exactly two possible values. One value is encoded as 0 and the other as 1. Unlike encoding three or more categories as integers (which creates false ordering), binary encoding with 0/1 only implies difference — not magnitude, distance, or ranking.

Real World

Fraud detection datasets are dominated by binary labels: each transaction is either fraudulent or legitimate. The entire problem is predicting one binary value (0/1) from thousands of other measurements. This framing — binary outcome, many input features — is one of the most common structures in commercial ML applications, from credit card fraud to spam filtering to medical diagnosis.

Count data

The number of previous admissions is a count variable: a non-negative integer (0, 1, 2, 3…). Count data resembles continuous data but has important constraints — it cannot be negative, it has no decimal values, and its distribution is often skewed right. Poisson regression and related methods treat count data as its own type.

Why this matters for ML

Every algorithm that uses distance — KNN, SVMs, neural networks, k-means clustering — assumes that numbers close together mean similar things. If you encode blood type as O=4 and AB=3, the algorithm will treat those two types as more similar to each other than A=1 and O=4, because |3−4|=1 while |1−4|=3. This relationship is not biological reality — it is an artefact of the encoding choice.

Feature encoding is the process of converting raw feature values into numerical representations that carry the correct mathematical properties for the learning algorithm. The encoding must preserve the actual relationships in the data and avoid creating false ones.

Integer encoding assigns false distances. Encoding blood types as A=1, B=2, AB=3, O=4 tells the algorithm that O is three times further from A than B is — a relationship that does not exist in reality. One-hot encoding avoids this by treating each category as an independent binary column with no numerical relationship between them.

💡 Insight

The most dangerous ML bugs are the ones that do not crash your code. A wrongly encoded categorical variable will train without errors, produce numerical predictions, and report a training loss that decreases — all while finding patterns that do not exist. The model learns the structure of the encoding, not the structure of the data.

The data type determines the encoding. The encoding determines what the algorithm sees. Getting this wrong is invisible — no error is raised, the model trains, predictions are produced, and the silent bug propagates into every downstream decision.

The shape of data types

Each data type has a visual signature when sampled at scale. Continuous measurements form smooth, bell-shaped or skewed curves — age clusters between 30 and 70, temperature hovers near 37°C, with no empty gaps in the distribution. Categorical fields form discrete, isolated bars — four bars for blood type with no values in between. Binary fields produce exactly two bars. Ordinal fields look like discrete bars with a meaningful left-to-right ordering.

In this section

What is the difference between ordinal and continuous data?

Ordinal data has a natural order but the gaps between values are not equal or meaningful. Pain level 3 is not necessarily twice as bad as pain level 1.5, and the gap between 7 and 8 is not the same as the gap between 1 and 2. Continuous data has equal, meaningful gaps — the difference between 37.0°C and 38.0°C is the same as between 38.0°C and 39.0°C.

Can you use a number to represent a categorical variable?

You can, but only with proper encoding. Assigning blood type A=1, B=2, AB=3, O=4 implies that O is four times A and that AB is between A and O — neither is true. One-hot encoding avoids this by creating a separate binary column for each category.

Is age continuous or ordinal?

It depends on how it is recorded. Raw age in years with decimals is continuous. Age recorded as 'child/teenager/adult/senior' is ordinal. Age recorded as exact years (18, 19, 20...) is technically discrete but usually treated as continuous because the gaps are equal and meaningful.

◎ Intuition

Patient ages form a smooth, hill-shaped distribution when plotted: many values in the middle, fewer at the extremes, and every value in between is valid. A 47-year-old and a 48-year-old are genuinely close — the number reflects a real proximity. Now imagine plotting blood types instead. Would you expect the same smooth hill? What shape would you see, and what does that tell you about how age and blood type are fundamentally different kinds of measurement?

↺ Reflection

Key Ideas

A hospital patient intake form contains every data type simultaneously. Age in years is continuous: any value within a biological range is possible, the gap between 47.0 and 48.0 is exactly the same as the gap between 23.0 and 24.0, and arithmetic operations on ages are fully meaningful. Body temperature is also continuous, clustered narrowly around 37°C with equal gaps at every decimal place.

Blood type is categorical: one of four labels (A, B, AB, O) with no natural ordering. The label O carries no implication of being “more” or “greater” than A. There is no midpoint between B and AB. The only valid question about two blood type values is whether they are the same category or different categories. Ward assignment — cardiology, neurology, oncology, general — is categorical for the same reasons.

Pain level on a 1–10 scale is ordinal. The values have a natural ordering: a patient reporting 8 is in more pain than one reporting 4. But the gap between 4 and 5 is not guaranteed to equal the gap between 8 and 9. The numbers represent ranks, not equally-spaced positions on a ruler. This places ordinal data in a grey zone. Ranking operations (greater-than, less-than) are valid. Arithmetic operations — subtraction, averaging — require the assumption of equal gaps, which ordinal scales do not formally provide. Whether that assumption is acceptable depends on the measurement context and is actively contested in statistics.

Smoker status is binary: yes or no, with no other values. Binary data is a special case of categorical data that receives separate treatment because encoding it as 0 and 1 does not introduce false mathematical structure. Unlike encoding three or more categories as consecutive integers (which implies ordering and equal spacing), a 0/1 encoding for two categories only implies difference.

The practical consequence of misidentifying a data type is invisible at training time. If blood type is encoded as A=1, B=2, AB=3, O=4 and fed into a KNN classifier, the algorithm computes distances between patients using those numbers. Blood type O (encoded 4) becomes numerically far from blood type A (encoded 1), and blood type B (encoded 2) becomes close to blood type A (encoded 1). These numerical relationships do not correspond to any biological or clinical reality — they are artefacts of the encoding. The training loss will decrease, predictions will be produced, and no error will be raised. The bug lives silently in the assumption.

The correct strategy depends on the data type. Continuous data can often be used directly after scaling. Categorical data requires encoding that eliminates false ordering — one-hot encoding creates a separate binary column for each category so that no category is numerically closer to any other. Ordinal data occupies a grey zone where tree-based methods handle it well without arithmetic assumptions, while distance-based methods may require careful treatment. Binary data is encoded as 0 and 1.

The encoding strategies for categorical data — one-hot encoding, ordinal encoding, target encoding, and embeddings — are covered in the Features & Representations topic (Chapter 2: Categorical Encoding), which the reader will reach after completing the current Understanding Data topic.

Key Points

Continuous data has equal, meaningful gaps between values — age and temperature can be averaged, compared by distance, and used directly in most algorithms without transformation.

Categorical data has no natural order — encoding blood type as 1/2/3/4 creates false mathematical relationships that corrupt any algorithm using distance or arithmetic.

Ordinal data has order but unequal gaps — pain level 8 is higher than pain level 4 but not necessarily twice as severe, placing it in a grey zone between categorical and continuous.

The distribution shape is the visual signature of data type — smooth curves for continuous, isolated bars for categorical, whole-number spikes for counts, exactly two bars for binary.

✓ Checkpoint

Check Your Understanding

Four questions on data types, encoding, and what happens when you get them wrong. Click a question to reveal the answer — there is no score.

A dataset contains a 'satisfaction rating' field where customers rate their experience as Poor / Fair / Good / Very Good / Excellent. What data type is this?

A developer encodes the 'department' field in an HR dataset as: Engineering=1, Marketing=2, Sales=3, HR=4, Finance=5. What problem does this create for a KNN classifier?

Binary data is a special case of categorical data and can always be safely encoded as 0 and 1 without implying any false mathematical relationship.

Order these data types from most to least restrictive in terms of what mathematical operations are valid on them (most restrictive first):

1.Continuous — any real value is valid; arithmetic operations are meaningful
2.Ordinal — only ordering comparisons are valid; gaps between ranks are not meaningful
3.Categorical — only equality comparisons are valid; no ordering exists