Understanding Data
Every ML system starts with data — but data as a concept is surprisingly slippery. A dataset is a table of observations: rows are examples, columns are measurements, one column (when present) records the outcome the model should learn. This topic builds a precise vocabulary for datasets, examines how data distributions behave, explores data quality and outliers, and develops the statistical intuition you need before any learning algorithm makes sense.
- Vectors Required — Feature vectors are vectors — understanding vector geometry makes feature space intuitive
What Is a Dataset?
A dataset is a structured collection of observations. Each row is one example — an instance. Each column is one measurable property — a feature. When one column records what we are trying to predict, it becomes the label. The entire discipline of supervised machine learning is the study of how to learn a mapping from features to labels.
Types of Data
Data comes in fundamentally different types: continuous measurements that can take any value, categorical labels with no natural order, ordinal rankings with order but unequal gaps, and binary yes/no flags. Each type has different statistical properties, different visualisation approaches, and different encoding requirements before an algorithm can use it. Choosing the wrong encoding for a data type is one of the most common sources of silent bugs in ML pipelines.
Probability Distributions
A dataset is not just numbers — it is numbers drawn from some underlying process. The shape of that process, called its probability distribution, determines which mathematical operations are valid, which models apply, and what the data can and cannot tell you. This chapter introduces the four distributions that appear most often in machine learning: Normal, Exponential, Uniform, and Bimodal.
Descriptive Statistics
Descriptive statistics compress an entire dataset into a handful of numbers: where is the centre, how spread out are the values, how symmetric is the shape. Mean, median, variance, and standard deviation are the core tools. But each statistic makes assumptions about the data, and choosing the wrong one produces summaries that are technically correct but deeply misleading. Understanding what each statistic measures — and what it ignores — is fundamental to any honest data analysis.
Data Quality & Outliers
Real-world datasets are never clean. They contain missing values, measurement errors, duplicate records, and outliers — values so far from the rest that they either reveal something genuinely important or corrupt every analysis that includes them. Data quality issues are the most common reason ML models fail in production. They are also the hardest to catch because most algorithms process dirty data without raising any errors — they simply learn the wrong patterns.