Explore classification and regression — the two pillars of supervised learning — with interactive
visualizations of KNN, SVM, linear and logistic regression.
📘 4.1.1 — Introduction to Supervised Learning
Supervised Learning is the most common form of ML. The algorithm learns from
a labeled dataset — each training example has an input (features) paired
with a known output (label/target).
The goal is to learn a mapping function f: X → Y that can predict the output
Y for new, unseen inputs X.
Algorithms: Linear Regression, Polynomial Regression, Ridge, Lasso
🏷️ 4.1.2 — Classification Model
A classification model assigns input data points to one of several predefined
categories. The model learns decision boundaries that separate
different classes in the feature space.
Types of Classification
Type
Description
Example
Binary
Two possible classes
Spam / Not Spam, Pass / Fail
Multi-class
More than two classes (one-vs-all)
Digit recognition (0-9), Animal classification
Multi-label
Multiple labels per instance
Movie genres (action + comedy + romance)
📝 4.1.3 — Learning Steps
Every supervised learning task follows these core steps:
Collect Data: Gather a labeled dataset with features (X) and target (y)
Split Data: Divide into training set (~80%) and test set (~20%)
Choose Algorithm: Select an appropriate model based on problem type and
data characteristics
Train Model: Fit the algorithm to the training data — it learns
patterns and relationships
Evaluate: Test the model on unseen test data using appropriate metrics
Tune: Adjust hyperparameters to optimize performance
Deploy: Use the final model for predictions on new data
📍 4.2.1 — K-Nearest Neighbor (KNN)
KNN is a simple, intuitive classification algorithm. It classifies a new
data point based on the majority vote of its K nearest neighbors in the
feature space.
How KNN Works
Choose K — the number of neighbors to consider
Calculate distance between the new point and all training points
Select the K nearest training points
Majority vote — assign the most common class among the K neighbors
Distance Metrics
KNN relies on measuring how "close" data points are. Common distance formulas:
Euclidean: d = √(Σ(x_i − y_i)²)
Straight-line distance — most commonly
used.
Manhattan: d = Σ|x_i − y_i|
Sum of absolute differences — works like
walking city blocks.
Small K (e.g., K=1): Sensitive to noise, overfitting — decision
boundary is very complex.
Large K: Smoother decision boundary, but may underfit and is slower.
Odd K: Use odd values to avoid tied votes in binary classification.
Rule of thumb: K ≈ √n (where n = number of training samples).
Pros and Cons
✅ Advantages
Simple and intuitive — easy to understand
No training phase — it's a "lazy learner"
Works for multi-class classification
Non-parametric — no assumptions about data distribution
❌ Disadvantages
Computationally expensive at prediction time
Sensitive to irrelevant features and scaling
Performs poorly with high-dimensional data
Requires feature scaling (normalization)
🎮 Interactive: KNN Visualizer
Click on the
canvas to place a new point (⭐). Adjust K to see how the classification changes based on
nearest neighbors.
⚔️ 4.2.1 — Support Vector Machine (SVM)
SVM is a powerful classification algorithm that finds the optimal
hyperplane — the decision boundary that maximizes the margin
between two classes.
Key Concepts
Hyperplane: A decision boundary that separates data into classes. In 2D
it's a line; in 3D a plane; in higher dimensions a hyperplane.
Support Vectors: The data points closest to the hyperplane — they
"support" the margin. These are the most important points for defining the boundary.
Margin: The distance between the hyperplane and the nearest support
vectors on either side. SVM maximizes this margin.
Maximum Margin Classifier: SVM finds the hyperplane with the
largest possible margin — this gives the best generalization.
SVM Margin Visualization
Hyperplane
← Margin →
● Class A
● Class B
◯ = Support Vectors
Kernel Trick
When data is not linearly separable, SVM uses kernel functions to map data
into a higher-dimensional space where a linear separator exists.
Kernel
Formula
Use Case
Linear
K(x,y) = x·y
Linearly separable data
Polynomial
K(x,y) = (x·y + c)^d
Non-linear boundaries
RBF (Gaussian)
K(x,y) = exp(-γ||x-y||²)
Most common; works well for many problems
📈 4.3.1 — Simple Linear Regression
Simple Linear Regression models the relationship between a single independent
variable (x) and a dependent variable (y) using a straight
line.
y = mx + b (or y = β₀ + β₁x)
Where:
y = predicted value (dependent variable)
x = input feature (independent variable)
m (β₁) = slope — how much y changes per unit change in x
b (β₀) = intercept — value of y when x = 0
Finding the Best Line — Least Squares Method
The algorithm finds the line that minimizes the sum of squared errors (SSE)
— the total squared vertical distance between each actual point and the predicted line.
Average squared error (penalizes large errors more)
RMSE
√MSE
Root of MSE (back to original units)
R² Score
1 − (SS_res / SS_total)
Fraction of variance explained (0 to 1, higher = better)
🔄 Logistic Regression
Despite its name, Logistic Regression is a classification algorithm, not a
regression algorithm. It predicts the probability that an input belongs to
a particular class.
The Sigmoid Function
Logistic Regression uses the sigmoid (logistic) function to map any real
number to a probability between 0 and 1:
σ(z) = 1 / (1 + e^(−z)) where z = β₀ + β₁x₁ + β₂x₂ + ...
How It Works
Compute the linear combination: z = β₀ + β₁x₁ + β₂x₂ + ...
Pass z through the sigmoid function to get probability p
If p ≥ 0.5, predict class 1; otherwise predict class 0
Sigmoid Curve Visualization
The S-shaped curve maps any input to a probability between 0 and 1. The threshold
(dashed line) at 0.5 determines the classification boundary.
When to Use Logistic Regression
Binary classification problems (yes/no, 0/1)
When you need probability estimates, not just class labels
When features are roughly linearly separable
As a baseline model before trying more complex algorithms
💡
Key Difference: Linear Regression outputs a continuous value (any
number). Logistic Regression outputs a probability (0 to 1) and uses a threshold to make
a classification decision.
🃏 Quick Revision — Flashcards
How does KNN classify a new data point?
KNN finds the K nearest training points (using a
distance metric like Euclidean distance) and assigns the class that appears most
frequently among those K neighbors (majority vote). It's a "lazy learner" — no
training phase.
Click to reveal
What are Support Vectors in SVM?
Support Vectors are the training data points that are
closest to the decision boundary (hyperplane). They define the
margin and are the most critical points for the classification. Removing other
points doesn't change the hyperplane, but removing a support vector would.
Click to reveal
What is the Kernel Trick in SVM?
When data isn't linearly separable, the kernel trick maps data
into a higher-dimensional space where a linear separator exists.
Common kernels: Linear, Polynomial, RBF (Gaussian). It avoids the computational cost
of actually computing the transformation.
Click to reveal
What does the Sigmoid function do in Logistic Regression?
The sigmoid function σ(z) = 1/(1+e^(-z)) maps any real-valued
number to a probability between 0 and 1. If the output ≥ 0.5,
predict class 1; otherwise class 0. This makes logistic regression a classifier
despite its name.
Click to reveal
What is R² (R-squared) in regression?
R² measures the proportion of variance in the dependent variable
that is explained by the model. R² = 1 − (SS_res / SS_total). Values range from 0
(no explanatory power) to 1 (perfect fit). It answers: "How well does my model
explain the data?"