Unit 4 — Supervised Learning

📘 4.1.1 — Introduction to Supervised Learning

Supervised Learning is the most common form of ML. The algorithm learns from a labeled dataset — each training example has an input (features) paired with a known output (label/target).

The goal is to learn a mapping function f: X → Y that can predict the output Y for new, unseen inputs X.

Supervised Learning Pipeline

📊 Labeled Dataset

→

⚙️ Training Algorithm

→

🧠 Trained Model

→

🎯 Prediction on New Data

🏷️ Classification

Output is a discrete category/class

Email → Spam / Not Spam
Image → Cat / Dog / Bird
Patient → Diabetic / Not Diabetic

Algorithms: KNN, SVM, Decision Tree, Logistic Regression, Naive Bayes

📈 Regression

Output is a continuous numerical value

House features → Price ($350,000)
Study hours → Exam score (82%)
Temperature → Electricity demand (450 MW)

Algorithms: Linear Regression, Polynomial Regression, Ridge, Lasso

🏷️ 4.1.2 — Classification Model

A classification model assigns input data points to one of several predefined categories. The model learns decision boundaries that separate different classes in the feature space.

Types of Classification

Type	Description	Example
Binary	Two possible classes	Spam / Not Spam, Pass / Fail
Multi-class	More than two classes (one-vs-all)	Digit recognition (0-9), Animal classification
Multi-label	Multiple labels per instance	Movie genres (action + comedy + romance)

📝 4.1.3 — Learning Steps

Every supervised learning task follows these core steps:

Collect Data: Gather a labeled dataset with features (X) and target (y)
Split Data: Divide into training set (~80%) and test set (~20%)
Choose Algorithm: Select an appropriate model based on problem type and data characteristics
Train Model: Fit the algorithm to the training data — it learns patterns and relationships
Evaluate: Test the model on unseen test data using appropriate metrics
Tune: Adjust hyperparameters to optimize performance
Deploy: Use the final model for predictions on new data

📍 4.2.1 — K-Nearest Neighbor (KNN)

KNN is a simple, intuitive classification algorithm. It classifies a new data point based on the majority vote of its K nearest neighbors in the feature space.

How KNN Works

Choose K — the number of neighbors to consider
Calculate distance between the new point and all training points
Select the K nearest training points
Majority vote — assign the most common class among the K neighbors

Distance Metrics

KNN relies on measuring how "close" data points are. Common distance formulas:

Euclidean: d = √(Σ(x_i − y_i)²)

Straight-line distance — most commonly used.

Manhattan: d = Σ|x_i − y_i|

Sum of absolute differences — works like walking city blocks.

Minkowski: d = (Σ|x_i − y_i|^p)^(1/p)

Generalized — p=1 gives Manhattan, p=2 gives Euclidean.

Choosing K

Small K (e.g., K=1): Sensitive to noise, overfitting — decision boundary is very complex.
Large K: Smoother decision boundary, but may underfit and is slower.
Odd K: Use odd values to avoid tied votes in binary classification.
Rule of thumb: K ≈ √n (where n = number of training samples).

Pros and Cons

✅ Advantages

Simple and intuitive — easy to understand
No training phase — it's a "lazy learner"
Works for multi-class classification
Non-parametric — no assumptions about data distribution

❌ Disadvantages

Computationally expensive at prediction time
Sensitive to irrelevant features and scaling
Performs poorly with high-dimensional data
Requires feature scaling (normalization)

🎮 Interactive: KNN Visualizer

Click on the canvas to place a new point (⭐). Adjust K to see how the classification changes based on nearest neighbors.

K value: 3

⚔️ 4.2.1 — Support Vector Machine (SVM)

SVM is a powerful classification algorithm that finds the optimal hyperplane — the decision boundary that maximizes the margin between two classes.

Key Concepts

Hyperplane: A decision boundary that separates data into classes. In 2D it's a line; in 3D a plane; in higher dimensions a hyperplane.
Support Vectors: The data points closest to the hyperplane — they "support" the margin. These are the most important points for defining the boundary.
Margin: The distance between the hyperplane and the nearest support vectors on either side. SVM maximizes this margin.
Maximum Margin Classifier: SVM finds the hyperplane with the largest possible margin — this gives the best generalization.

SVM Margin Visualization

Hyperplane

← Margin →

● Class A

● Class B

◯ = Support Vectors

Kernel Trick

When data is not linearly separable, SVM uses kernel functions to map data into a higher-dimensional space where a linear separator exists.

Kernel	Formula	Use Case
Linear	K(x,y) = x·y	Linearly separable data
Polynomial	K(x,y) = (x·y + c)^d	Non-linear boundaries
RBF (Gaussian)	K(x,y) = exp(-γ\|\|x-y\|\|²)	Most common; works well for many problems

📈 4.3.1 — Simple Linear Regression

Simple Linear Regression models the relationship between a single independent variable (x) and a dependent variable (y) using a straight line.

y = mx + b (or y = β₀ + β₁x)

Where:

y = predicted value (dependent variable)
x = input feature (independent variable)
m (β₁) = slope — how much y changes per unit change in x
b (β₀) = intercept — value of y when x = 0

Finding the Best Line — Least Squares Method

The algorithm finds the line that minimizes the sum of squared errors (SSE) — the total squared vertical distance between each actual point and the predicted line.

SSE = Σ(y_actual − y_predicted)² = Σ(yᵢ − (mx_i + b))²

β₁ (slope) = [nΣx_iy_i − ΣxᵢΣyᵢ] / [nΣxᵢ² − (Σxᵢ)²]

β₀ (intercept) = ȳ − β₁x̄

📐 Interactive: Linear Regression Line

Adjust slope and intercept to see how the regression line changes. Try to fit the data points!

Slope (m): 2.0

Intercept (b): 10

📊 Multiple Linear Regression

Extends simple linear regression to multiple input features:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

Each coefficient βᵢ represents how much y changes for a one-unit increase in xᵢ, holding all other variables constant.

Example: House Price Prediction

Price = β₀ + β₁(sqft) + β₂(bedrooms) + β₃(age) + β₄(location_score)

Evaluation Metrics for Regression

Metric	Formula	Interpretation
MAE	(1/n)Σ\|yᵢ − ŷᵢ\|	Average absolute error (same units as y)
MSE	(1/n)Σ(yᵢ − ŷᵢ)²	Average squared error (penalizes large errors more)
RMSE	√MSE	Root of MSE (back to original units)
R² Score	1 − (SS_res / SS_total)	Fraction of variance explained (0 to 1, higher = better)

🔄 Logistic Regression

Despite its name, Logistic Regression is a classification algorithm, not a regression algorithm. It predicts the probability that an input belongs to a particular class.

The Sigmoid Function

Logistic Regression uses the sigmoid (logistic) function to map any real number to a probability between 0 and 1:

σ(z) = 1 / (1 + e^(−z)) where z = β₀ + β₁x₁ + β₂x₂ + ...

How It Works

Compute the linear combination: z = β₀ + β₁x₁ + β₂x₂ + ...
Pass z through the sigmoid function to get probability p
If p ≥ 0.5, predict class 1; otherwise predict class 0

Sigmoid Curve Visualization

The S-shaped curve maps any input to a probability between 0 and 1. The threshold (dashed line) at 0.5 determines the classification boundary.

When to Use Logistic Regression

Binary classification problems (yes/no, 0/1)
When you need probability estimates, not just class labels
When features are roughly linearly separable
As a baseline model before trying more complex algorithms

💡

Key Difference: Linear Regression outputs a continuous value (any number). Logistic Regression outputs a probability (0 to 1) and uses a threshold to make a classification decision.

🃏 Quick Revision — Flashcards

How does KNN classify a new data point?

KNN finds the K nearest training points (using a distance metric like Euclidean distance) and assigns the class that appears most frequently among those K neighbors (majority vote). It's a "lazy learner" — no training phase.

Click to reveal

What are Support Vectors in SVM?

Support Vectors are the training data points that are closest to the decision boundary (hyperplane). They define the margin and are the most critical points for the classification. Removing other points doesn't change the hyperplane, but removing a support vector would.

Click to reveal

What is the Kernel Trick in SVM?

When data isn't linearly separable, the kernel trick maps data into a higher-dimensional space where a linear separator exists. Common kernels: Linear, Polynomial, RBF (Gaussian). It avoids the computational cost of actually computing the transformation.

Click to reveal

What does the Sigmoid function do in Logistic Regression?

The sigmoid function σ(z) = 1/(1+e^(-z)) maps any real-valued number to a probability between 0 and 1. If the output ≥ 0.5, predict class 1; otherwise class 0. This makes logistic regression a classifier despite its name.

Click to reveal

What is R² (R-squared) in regression?

R² measures the proportion of variance in the dependent variable that is explained by the model. R² = 1 − (SS_res / SS_total). Values range from 0 (no explanatory power) to 1 (perfect fit). It answers: "How well does my model explain the data?"

Click to reveal

Supervised Learning

📘 4.1.1 — Introduction to Supervised Learning

Supervised Learning Pipeline

🏷️ Classification

📈 Regression

🏷️ 4.1.2 — Classification Model

Types of Classification

📝 4.1.3 — Learning Steps

📍 4.2.1 — K-Nearest Neighbor (KNN)

How KNN Works

Distance Metrics

Choosing K

Pros and Cons

✅ Advantages

❌ Disadvantages

🎮 Interactive: KNN Visualizer

⚔️ 4.2.1 — Support Vector Machine (SVM)

Key Concepts

SVM Margin Visualization

Kernel Trick

📈 4.3.1 — Simple Linear Regression

Finding the Best Line — Least Squares Method

📐 Interactive: Linear Regression Line

📊 Multiple Linear Regression

Example: House Price Prediction

Evaluation Metrics for Regression

🔄 Logistic Regression

The Sigmoid Function

How It Works

Sigmoid Curve Visualization

When to Use Logistic Regression

🃏 Quick Revision — Flashcards

🧠 Unit 4 Quiz