Understanding Data Chapter 2 of 5 · tap to browse
Types of Data
Not all measurements are equal — and algorithms care deeply about the difference
A hospital patient intake form contains every data type simultaneously — age is continuous, blood type is categorical, pain level is ordinal, and smoker status is binary — and each field must be handled differently by any ML system.
- 1 Name the four main data types and give one example of each
- 2 Explain why ordinal data cannot be treated as continuous
- 3 Identify the correct data type for each field in a given dataset
- 4 Identify which data type encoding is wrong in a described ML pipeline and explain what bug it would cause
Types of Data
The hospital intake form
A patient arrives at a hospital. Before any treatment begins, a nurse fills out an intake form. Every field is a measurement — but not every measurement is the same kind of thing.
The form collects: age (47 years), body temperature (37.8°C), blood type (AB), pain level (7 out of 10), ward assignment (cardiology), smoker status (yes), and number of previous admissions (2). Seven fields. Four fundamentally different types of data.
An ML system that treats these seven fields identically will make serious errors. The algorithm does not know that 7 on a pain scale is not seven times worse than 1, or that blood type AB is not numerically between A and O. It only sees the numbers it is given — and if those numbers carry false implications, the algorithm will act on them.
import pandas as pd
df = pd.read_csv('hospital_intake.csv')
print(df.dtypes)
# age float64 ← continuous
# temperature float64 ← continuous
# blood_type object ← categorical (stored as text)
# pain_level int64 ← ordinal (looks like integer!)
# smoker bool ← binary
# prior_admissions int64 ← count
Python’s dtype reports how data is stored, not what kind of data it is statistically. A pain score stored as int64 is not truly continuous just because Python treats it as a number — and treating it as one leads to the encoding bugs covered at the end of this chapter.
Continuous data
Age and temperature are continuous measurements. A continuous variable can take any value within a range, including all the decimal fractions in between. A patient can be 47.3 years old. A temperature can be 37.83°C. The number line for these measurements is smooth and unbroken — there are no gaps between valid values.
Because the gaps are equal and meaningful, continuous data supports all arithmetic operations. You can meaningfully average two temperatures, subtract one age from another, or compute the distance between two patients in feature space. Algorithms that rely on distance — KNN, SVMs with RBF kernels, neural networks — work directly on continuous data without any transformation.
Categorical data
Blood type and ward assignment are categorical measurements. A patient’s blood type is one of four labels: A, B, AB, or O. These labels have no natural order. O is not “more” than A. AB is not between A and O. Neurology is not halfway between cardiology and oncology. The only relationship between categories is same versus different.
Encoding blood type as A=1, B=2, AB=3, O=4 and feeding those integers directly into a distance-based algorithm implies that O is four times A, that the distance between A and B equals the distance between B and AB, and that O is further from A than B is. None of those relationships exist in blood type biology — they are invented by the encoding. One-hot encoding (a separate binary column for each category) avoids this entirely.
For categorical features with hundreds or thousands of unique values — zip codes, product IDs, user accounts — one-hot encoding creates extremely wide, sparse vectors. Modern ML systems often learn compact dense embeddings instead, training a short vector for each category value alongside the main model. This approach is covered in the Categorical Encoding chapter of the Features & Representations topic.
Ordinal data
Pain level is ordinal. The values run from 1 to 10 and they have a natural order — a patient reporting 8 is in more pain than one reporting 4. But the gaps between values are not equal or meaningful. The jump from level 3 to level 4 may not represent the same increase in pain as the jump from level 8 to level 9. The numbers encode rank, not magnitude.
| Property | Ordinal | Continuous |
|---|---|---|
| Has natural order | Yes | Yes |
| Gaps between values are equal | No | Yes |
| Subtraction is meaningful | No | Yes |
| Computing a mean is valid | Debated | Yes |
| Example | Pain 1–10, satisfaction rating | Temperature, age, income |
Many practitioners compute the mean of a pain score — “the average reported pain was 6.4”. This is mathematically possible but statistically questionable. It assumes that the gap between level 6 and level 7 is the same as the gap between level 3 and level 4, which is not guaranteed. Ordinal scales are not rulers. Whether averaging ordinal data is acceptable depends on the research context and is actively debated in statistics.
Ordinal data sits in an awkward position between categorical and continuous. Tree-based algorithms (decision trees, random forests, gradient boosting) handle ordinal data naturally — they split at thresholds without assuming equal gaps. Distance-based algorithms treat ordinal values as continuous unless told otherwise, which can introduce errors.
Binary data
Smoker status is binary: yes or no, with no other options. Binary data is a special case of categorical data with exactly two values. It gets its own category because it can be encoded as a single 0/1 column without implying any false mathematical relationship. 1 is not “greater” than 0 in a meaningful sense — 1 simply means “smoker”.
Binary data is categorical data with exactly two possible values. One value is encoded as 0 and the other as 1. Unlike encoding three or more categories as integers (which creates false ordering), binary encoding with 0/1 only implies difference — not magnitude, distance, or ranking.binary-dataFraud detection datasets are dominated by binary labels: each transaction is either fraudulent or legitimate. The entire problem is predicting one binary value (0/1) from thousands of other measurements. This framing — binary outcome, many input features — is one of the most common structures in commercial ML applications, from credit card fraud to spam filtering to medical diagnosis.
Count data
The number of previous admissions is a count variable: a non-negative integer (0, 1, 2, 3…). Count data resembles continuous data but has important constraints — it cannot be negative, it has no decimal values, and its distribution is often skewed right. Poisson regression and related methods treat count data as its own type.
Why this matters for ML
Every algorithm that uses distance — KNN, SVMs, neural networks, k-means clustering — assumes that numbers close together mean similar things. If you encode blood type as O=4 and AB=3, the algorithm will treat those two types as more similar to each other than A=1 and O=4, because |3−4|=1 while |1−4|=3. This relationship is not biological reality — it is an artefact of the encoding choice.
Feature encoding is the process of converting raw feature values into numerical representations that carry the correct mathematical properties for the learning algorithm. The encoding must preserve the actual relationships in the data and avoid creating false ones.feature-encodingThe most dangerous ML bugs are the ones that do not crash your code. A wrongly encoded categorical variable will train without errors, produce numerical predictions, and report a training loss that decreases — all while finding patterns that do not exist. The model learns the structure of the encoding, not the structure of the data.
The data type determines the encoding. The encoding determines what the algorithm sees. Getting this wrong is invisible — no error is raised, the model trains, predictions are produced, and the silent bug propagates into every downstream decision.
The shape of data types
Each data type has a visual signature when sampled at scale. Continuous measurements form smooth, bell-shaped or skewed curves — age clusters between 30 and 70, temperature hovers near 37°C, with no empty gaps in the distribution. Categorical fields form discrete, isolated bars — four bars for blood type with no values in between. Binary fields produce exactly two bars. Ordinal fields look like discrete bars with a meaningful left-to-right ordering.
What is the difference between ordinal and continuous data?
Ordinal data has a natural order but the gaps between values are not equal or meaningful. Pain level 3 is not necessarily twice as bad as pain level 1.5, and the gap between 7 and 8 is not the same as the gap between 1 and 2. Continuous data has equal, meaningful gaps — the difference between 37.0°C and 38.0°C is the same as between 38.0°C and 39.0°C.
Can you use a number to represent a categorical variable?
You can, but only with proper encoding. Assigning blood type A=1, B=2, AB=3, O=4 implies that O is four times A and that AB is between A and O — neither is true. One-hot encoding avoids this by creating a separate binary column for each category.
Is age continuous or ordinal?
It depends on how it is recorded. Raw age in years with decimals is continuous. Age recorded as 'child/teenager/adult/senior' is ordinal. Age recorded as exact years (18, 19, 20...) is technically discrete but usually treated as continuous because the gaps are equal and meaningful.
Patient ages form a smooth, hill-shaped distribution when plotted: many values in the middle, fewer at the extremes, and every value in between is valid. A 47-year-old and a 48-year-old are genuinely close — the number reflects a real proximity. Now imagine plotting blood types instead. Would you expect the same smooth hill? What shape would you see, and what does that tell you about how age and blood type are fundamentally different kinds of measurement?
Key Ideas
A hospital patient intake form contains every data type simultaneously. Age in years is continuous: any value within a biological range is possible, the gap between 47.0 and 48.0 is exactly the same as the gap between 23.0 and 24.0, and arithmetic operations on ages are fully meaningful. Body temperature is also continuous, clustered narrowly around 37°C with equal gaps at every decimal place.
Blood type is categorical: one of four labels (A, B, AB, O) with no natural ordering. The label O carries no implication of being “more” or “greater” than A. There is no midpoint between B and AB. The only valid question about two blood type values is whether they are the same category or different categories. Ward assignment — cardiology, neurology, oncology, general — is categorical for the same reasons.
Pain level on a 1–10 scale is ordinal. The values have a natural ordering: a patient reporting 8 is in more pain than one reporting 4. But the gap between 4 and 5 is not guaranteed to equal the gap between 8 and 9. The numbers represent ranks, not equally-spaced positions on a ruler. This places ordinal data in a grey zone. Ranking operations (greater-than, less-than) are valid. Arithmetic operations — subtraction, averaging — require the assumption of equal gaps, which ordinal scales do not formally provide. Whether that assumption is acceptable depends on the measurement context and is actively contested in statistics.
Smoker status is binary: yes or no, with no other values. Binary data is a special case of categorical data that receives separate treatment because encoding it as 0 and 1 does not introduce false mathematical structure. Unlike encoding three or more categories as consecutive integers (which implies ordering and equal spacing), a 0/1 encoding for two categories only implies difference.
The practical consequence of misidentifying a data type is invisible at training time. If blood type is encoded as A=1, B=2, AB=3, O=4 and fed into a KNN classifier, the algorithm computes distances between patients using those numbers. Blood type O (encoded 4) becomes numerically far from blood type A (encoded 1), and blood type B (encoded 2) becomes close to blood type A (encoded 1). These numerical relationships do not correspond to any biological or clinical reality — they are artefacts of the encoding. The training loss will decrease, predictions will be produced, and no error will be raised. The bug lives silently in the assumption.
The correct strategy depends on the data type. Continuous data can often be used directly after scaling. Categorical data requires encoding that eliminates false ordering — one-hot encoding creates a separate binary column for each category so that no category is numerically closer to any other. Ordinal data occupies a grey zone where tree-based methods handle it well without arithmetic assumptions, while distance-based methods may require careful treatment. Binary data is encoded as 0 and 1.
The encoding strategies for categorical data — one-hot encoding, ordinal encoding, target encoding, and embeddings — are covered in the Features & Representations topic (Chapter 2: Categorical Encoding), which the reader will reach after completing the current Understanding Data topic.
Continuous data has equal, meaningful gaps between values — age and temperature can be averaged, compared by distance, and used directly in most algorithms without transformation.
Categorical data has no natural order — encoding blood type as 1/2/3/4 creates false mathematical relationships that corrupt any algorithm using distance or arithmetic.
Ordinal data has order but unequal gaps — pain level 8 is higher than pain level 4 but not necessarily twice as severe, placing it in a grey zone between categorical and continuous.
The distribution shape is the visual signature of data type — smooth curves for continuous, isolated bars for categorical, whole-number spikes for counts, exactly two bars for binary.
Check Your Understanding
Four questions on data types, encoding, and what happens when you get them wrong. Click a question to reveal the answer — there is no score.
A dataset contains a 'satisfaction rating' field where customers rate their experience as Poor / Fair / Good / Very Good / Excellent. What data type is this?
A developer encodes the 'department' field in an HR dataset as: Engineering=1, Marketing=2, Sales=3, HR=4, Finance=5. What problem does this create for a KNN classifier?
Binary data is a special case of categorical data and can always be safely encoded as 0 and 1 without implying any false mathematical relationship.
Order these data types from most to least restrictive in terms of what mathematical operations are valid on them (most restrictive first):
- 1.Continuous — any real value is valid; arithmetic operations are meaningful
- 2.Ordinal — only ordering comparisons are valid; gaps between ranks are not meaningful
- 3.Categorical — only equality comparisons are valid; no ordering exists