How many examples does a dataset need to be useful?

There is no universal minimum. Simple problems with clear patterns can be learned from dozens of examples; complex tasks like image recognition typically require thousands to millions. The more features a dataset has, the more examples are generally needed to estimate their relationships reliably. A dataset is sufficient when adding more examples stops improving the model's performance on data it has not yet seen.

What is the difference between supervised and unsupervised learning in terms of datasets?

A supervised dataset has a label column — each example comes with the output the model should produce. An unsupervised dataset has no label column — the algorithm explores structure in the inputs without guidance about what the correct output should be. Clustering algorithms, for example, group similar examples without being told in advance which group each belongs to.

Understanding Data Chapter 1 of 5 · tap to browse

01 Datasets 02 Data Types 03 Distributions 04 Descriptive Stats 05 Data Quality

What Is a Dataset?

Every ML model starts here — with a table of observations

When a bank decides whether to approve a loan, it has a spreadsheet of past applicants — each row one person, each column one measurement, one column the outcome. That spreadsheet is a dataset.

Learning Objectives

1 Define the terms instance, feature, and label as components of a tabular dataset
2 Explain why the presence or absence of a label column distinguishes supervised from unsupervised learning
3 Describe what a feature space is geometrically — each instance maps to a point, each feature is an axis
4 Identify the features, instances, and label in an unfamiliar tabular dataset

¶ Narrative

Instances, Features, Labels, and Feature Space

The dataset as a table

Machine learning is a discipline of learning from examples. Before any algorithm can learn, those examples must be organised. A dataset provides that organisation.

A dataset is a collection of observations arranged as a table. Each row is one observation — one example of the phenomenon you are studying. Each column is one property that was measured for every observation.

The diagram below shows a small dataset. Each row records one loan applicant: a person who applied to a bank and received either an approval or a rejection.

A loan dataset with five applicants. Each row is one instance — one applicant. The first three columns are features: measurements the model will use as input. The final column is the label: what the model is trying to predict.

python

import pandas as pd

df = pd.read_csv('loan_data.csv')

features = df[['income', 'credit_score', 'employment_years']]
labels = df['approved']

# features.shape → (5000, 3): 5000 instances, 3 features each

Each row in a dataset is called an instance (also: example, sample, or observation). It represents one thing you have measured — one person, one transaction, one photograph, one day of weather readings. Each column (other than the label) is a feature — one measurable property of the instances. Features are sometimes called attributes or input variables. They are the evidence the model will use to make predictions. The label is the column recording the outcome you want the model to predict. It is also called the target variable or dependent variable. A dataset without a label column is used for unsupervised learning; a dataset with one is used for supervised learning.

Not all ML data starts as a table. Images are grids of pixels, text is sequences of words, audio is waveforms. But internally, every ML model converts its inputs into numerical vectors — feature vectors. The feature-space concept applies to all data types, not just spreadsheets.

Real World

When a bank decides whether to approve a loan, it has a record of past applicants — each row one person, each column one measurement, one column the decision. The algorithm studies that record and learns a mapping from the measurements to the outcome. When a new applicant arrives, the model applies that learned mapping to produce a prediction.

Common Mistake

Confusing features with the label is common. A useful test: ask “is this column something the model knows at prediction time, or something it is trying to find out?” If you know it when you make the prediction, it is a feature. If you are trying to discover it, it is the label.

Supervised vs unsupervised data

The presence or absence of a label column defines two fundamentally different kinds of learning.

In supervised learning, every instance comes with a label. The algorithm learns a function that maps features to the label by studying the association between them across many labelled examples. When it encounters a new, unlabelled instance, it applies that function to produce a prediction.

In unsupervised learning, there is no label column. The algorithm cannot be told which instances are “correct” — it must find structure on its own. Clustering algorithms group similar instances together; dimensionality reduction algorithms find compact representations. The algorithm discovers, but does not predict.

Common Mistake

Treating row count as the main measure of dataset quality. A million instances with poorly chosen features may be far less useful than ten thousand instances with informative ones. The information content of the features matters more than the number of rows in many practical situations.

Feature space: the geometry of data

Once a dataset has numerical features, each instance can be represented as a point in a multidimensional space — one dimension per feature. This geometric view is called the feature space.

The feature space is the coordinate space defined by a dataset’s features. Each axis corresponds to one feature; each instance maps to exactly one point. Instances that are numerically similar in their features will be geometrically close in feature space.

Each row of the table becomes one point in feature space. The Income and Credit Score values become x and y coordinates. The label (Approved / Rejected) determines the point's colour. The scatter plot and the table are two views of exactly the same data.

In two dimensions, you can draw the feature space on paper. Real datasets have hundreds or thousands of features — the space is correspondingly high-dimensional and impossible to draw. You can picture a 3-feature dataset as a cloud of points in 3D space — each feature adds one axis. Beyond 3 dimensions, you cannot draw the space, but the mathematics extends identically: distance, clustering, and boundary-drawing all work in any number of dimensions. A 768-feature language model embedding lives in 768-dimensional feature space — the same geometry, just more axes than you can see.

This geometric view explains why feature choice matters so much. If the features you include do not capture the differences between classes, no algorithm can draw a useful boundary between them — the relevant information simply is not in the data.

In this section

What is the difference between a dataset and a database?

A database is an organised system for storing and querying data. A dataset is a specific collection of examples used for a learning task — it is a table extracted from that system. Datasets are typically loaded into memory all at once; databases are queried on demand. In practice, you extract a dataset from a database before training a model on it.

What makes a label different from a feature?

A label is what the model is trying to predict. A feature is what the model uses as evidence to make that prediction. In a loan approval dataset, the loan outcome (approved or rejected) is the label; the applicant's income, age, and credit history are features. Whether a column is a feature or a label depends entirely on what the model is trying to learn — not on any property of the column itself.

Can the same column be a feature in one dataset and a label in another?

Yes. Income might be the label if you are building a model to predict income from years of education and occupation. It might be a feature if you are using it as an input to predict loan approval. The distinction is functional, not intrinsic to the column.

◎ Intuition

Before you explore the playground — if you had to describe a person to a machine using only numbers, which measurements would you choose, and what would you leave out?

↺ Reflection

Key Ideas

A dataset is a table: rows are instances, columns are features, one column (when present) is the label. That structure — compact but precise — is the prerequisite for every learning algorithm.

The geometric interpretation of a dataset is the feature space. Each instance becomes a point; each feature becomes an axis. Two instances that are numerically similar in their feature values sit near each other in feature space. Two instances that differ substantially in feature values sit far apart. Distance in feature space carries meaning — it is the foundation on which every distance-based and boundary-based learning algorithm is built.

Measurement quality governs whether feature separability is achievable. When measurements are precise, instances from different classes cluster in distinct regions of feature space and any boundary placed between those regions will classify new instances reliably. When measurements are imprecise, the same feature values appear in both classes — the clusters overlap. At that point, no algorithm can do better than chance for the overlapping region. This is a data quality problem, not an algorithm problem; the information the algorithm needs is simply absent from the features.

Adding more outcome classes multiplies the number of boundaries needed to partition the feature space. With two classes, one boundary suffices. With three classes, boundaries are needed between every pair. This structural difference — the number of decision surfaces required — is one reason multiclass problems demand different algorithmic treatments than binary ones.

The deeper principle: including uninformative features does not add information — it adds noise. A feature whose distribution is identical across all classes provides no signal and may actively harm learning by increasing the dimensionality the algorithm must navigate. What you leave out of a dataset is as important as what you include.

Key Points

A dataset is a table: each row is one instance, each column is one feature. The label is the column recording the output the model is trained to predict. Its presence distinguishes supervised from unsupervised learning.

The feature space is the geometric interpretation of a dataset. Each instance becomes a point; each feature becomes an axis. Two instances that share similar feature values will be geometrically close in that space.

When measurements are precise, class clusters are compact and well separated — a model can draw a clean boundary with high accuracy. When measurements are imprecise, clusters overlap and no algorithm can achieve perfect separation, because the same feature values appear in both classes.

Feature choice determines what is learnable. A model trained on features that do not distinguish the target classes cannot improve beyond chance, regardless of its architecture or training time.

✓ Checkpoint

Check Your Understanding

Three questions on datasets, feature spaces, and supervised learning. Click a question to reveal the answer — there is no score.

A dataset has columns: Age, Height, Weight, and Salary. A model is trained to predict Salary from the other three columns. Which column is the label?

A dataset without a label column can be used to train a supervised learning model.

Order these steps from first to last in preparing a supervised dataset for training:

1.Identify which column is the label
2.Collect raw observations
3.Verify that all instances have a label value
4.Organise observations into a table with one row per instance

A dataset has two features: petal length and petal width. Each flower is plotted as a dot where petal length is the x-coordinate and petal width is the y-coordinate. What is this 2D plot an example of?