Unit IV

Supervised Learning

Explore classification and regression — the two pillars of supervised learning — with interactive visualizations of KNN, SVM, linear and logistic regression.

📘 4.1.1 — Introduction to Supervised Learning

Supervised Learning is the most common form of ML. The algorithm learns from a labeled dataset — each training example has an input (features) paired with a known output (label/target).

The goal is to learn a mapping function f: X → Y that can predict the output Y for new, unseen inputs X.

Supervised Learning Pipeline

📊 Labeled Dataset
⚙️ Training Algorithm
🧠 Trained Model
🎯 Prediction on New Data

🏷️ Classification

Output is a discrete category/class

  • Email → Spam / Not Spam
  • Image → Cat / Dog / Bird
  • Patient → Diabetic / Not Diabetic

Algorithms: KNN, SVM, Decision Tree, Logistic Regression, Naive Bayes

📈 Regression

Output is a continuous numerical value

  • House features → Price ($350,000)
  • Study hours → Exam score (82%)
  • Temperature → Electricity demand (450 MW)

Algorithms: Linear Regression, Polynomial Regression, Ridge, Lasso

🏷️ 4.1.2 — Classification Model

A classification model assigns input data points to one of several predefined categories. The model learns decision boundaries that separate different classes in the feature space.

Types of Classification

Type Description Example
Binary Two possible classes Spam / Not Spam, Pass / Fail
Multi-class More than two classes (one-vs-all) Digit recognition (0-9), Animal classification
Multi-label Multiple labels per instance Movie genres (action + comedy + romance)

📝 4.1.3 — Learning Steps

Every supervised learning task follows these core steps:

  1. Collect Data: Gather a labeled dataset with features (X) and target (y)
  2. Split Data: Divide into training set (~80%) and test set (~20%)
  3. Choose Algorithm: Select an appropriate model based on problem type and data characteristics
  4. Train Model: Fit the algorithm to the training data — it learns patterns and relationships
  5. Evaluate: Test the model on unseen test data using appropriate metrics
  6. Tune: Adjust hyperparameters to optimize performance
  7. Deploy: Use the final model for predictions on new data

📍 4.2.1 — K-Nearest Neighbor (KNN)

KNN is a simple, intuitive classification algorithm. It classifies a new data point based on the majority vote of its K nearest neighbors in the feature space.

How KNN Works

  1. Choose K — the number of neighbors to consider
  2. Calculate distance between the new point and all training points
  3. Select the K nearest training points
  4. Majority vote — assign the most common class among the K neighbors

Distance Metrics

KNN relies on measuring how "close" data points are. Common distance formulas:

Euclidean: d = √(Σ(x_i − y_i)²)

Straight-line distance — most commonly used.

Manhattan: d = Σ|x_i − y_i|

Sum of absolute differences — works like walking city blocks.

Minkowski: d = (Σ|x_i − y_i|^p)^(1/p)

Generalized — p=1 gives Manhattan, p=2 gives Euclidean.

Choosing K

  • Small K (e.g., K=1): Sensitive to noise, overfitting — decision boundary is very complex.
  • Large K: Smoother decision boundary, but may underfit and is slower.
  • Odd K: Use odd values to avoid tied votes in binary classification.
  • Rule of thumb: K ≈ √n (where n = number of training samples).

Pros and Cons

✅ Advantages

  • Simple and intuitive — easy to understand
  • No training phase — it's a "lazy learner"
  • Works for multi-class classification
  • Non-parametric — no assumptions about data distribution

❌ Disadvantages

  • Computationally expensive at prediction time
  • Sensitive to irrelevant features and scaling
  • Performs poorly with high-dimensional data
  • Requires feature scaling (normalization)

🎮 Interactive: KNN Visualizer

Click on the canvas to place a new point (⭐). Adjust K to see how the classification changes based on nearest neighbors.

⚔️ 4.2.1 — Support Vector Machine (SVM)

SVM is a powerful classification algorithm that finds the optimal hyperplane — the decision boundary that maximizes the margin between two classes.

Key Concepts

  • Hyperplane: A decision boundary that separates data into classes. In 2D it's a line; in 3D a plane; in higher dimensions a hyperplane.
  • Support Vectors: The data points closest to the hyperplane — they "support" the margin. These are the most important points for defining the boundary.
  • Margin: The distance between the hyperplane and the nearest support vectors on either side. SVM maximizes this margin.
  • Maximum Margin Classifier: SVM finds the hyperplane with the largest possible margin — this gives the best generalization.

SVM Margin Visualization

Hyperplane
← Margin →
● Class A
● Class B
◯ = Support Vectors

Kernel Trick

When data is not linearly separable, SVM uses kernel functions to map data into a higher-dimensional space where a linear separator exists.

Kernel Formula Use Case
Linear K(x,y) = x·y Linearly separable data
Polynomial K(x,y) = (x·y + c)^d Non-linear boundaries
RBF (Gaussian) K(x,y) = exp(-γ||x-y||²) Most common; works well for many problems

📈 4.3.1 — Simple Linear Regression

Simple Linear Regression models the relationship between a single independent variable (x) and a dependent variable (y) using a straight line.

y = mx + b    (or y = β₀ + β₁x)

Where:

  • y = predicted value (dependent variable)
  • x = input feature (independent variable)
  • m (β₁) = slope — how much y changes per unit change in x
  • b (β₀) = intercept — value of y when x = 0

Finding the Best Line — Least Squares Method

The algorithm finds the line that minimizes the sum of squared errors (SSE) — the total squared vertical distance between each actual point and the predicted line.

SSE = Σ(y_actual − y_predicted)² = Σ(yᵢ − (mx_i + b))²
β₁ (slope) = [nΣx_iy_i − ΣxᵢΣyᵢ] / [nΣxᵢ² − (Σxᵢ)²]
β₀ (intercept) = ȳ − β₁x̄

📐 Interactive: Linear Regression Line

Adjust slope and intercept to see how the regression line changes. Try to fit the data points!

📊 Multiple Linear Regression

Extends simple linear regression to multiple input features:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

Each coefficient βᵢ represents how much y changes for a one-unit increase in xᵢ, holding all other variables constant.

Example: House Price Prediction

Price = β₀ + β₁(sqft) + β₂(bedrooms) + β₃(age) + β₄(location_score)

Evaluation Metrics for Regression

Metric Formula Interpretation
MAE (1/n)Σ|yᵢ − ŷᵢ| Average absolute error (same units as y)
MSE (1/n)Σ(yᵢ − ŷᵢ)² Average squared error (penalizes large errors more)
RMSE √MSE Root of MSE (back to original units)
R² Score 1 − (SS_res / SS_total) Fraction of variance explained (0 to 1, higher = better)

🔄 Logistic Regression

Despite its name, Logistic Regression is a classification algorithm, not a regression algorithm. It predicts the probability that an input belongs to a particular class.

The Sigmoid Function

Logistic Regression uses the sigmoid (logistic) function to map any real number to a probability between 0 and 1:

σ(z) = 1 / (1 + e^(−z))    where z = β₀ + β₁x₁ + β₂x₂ + ...

How It Works

  1. Compute the linear combination: z = β₀ + β₁x₁ + β₂x₂ + ...
  2. Pass z through the sigmoid function to get probability p
  3. If p ≥ 0.5, predict class 1; otherwise predict class 0

Sigmoid Curve Visualization

The S-shaped curve maps any input to a probability between 0 and 1. The threshold (dashed line) at 0.5 determines the classification boundary.

When to Use Logistic Regression

  • Binary classification problems (yes/no, 0/1)
  • When you need probability estimates, not just class labels
  • When features are roughly linearly separable
  • As a baseline model before trying more complex algorithms
💡

Key Difference: Linear Regression outputs a continuous value (any number). Logistic Regression outputs a probability (0 to 1) and uses a threshold to make a classification decision.

🃏 Quick Revision — Flashcards

How does KNN classify a new data point?
KNN finds the K nearest training points (using a distance metric like Euclidean distance) and assigns the class that appears most frequently among those K neighbors (majority vote). It's a "lazy learner" — no training phase.
Click to reveal
What are Support Vectors in SVM?
Support Vectors are the training data points that are closest to the decision boundary (hyperplane). They define the margin and are the most critical points for the classification. Removing other points doesn't change the hyperplane, but removing a support vector would.
Click to reveal
What is the Kernel Trick in SVM?
When data isn't linearly separable, the kernel trick maps data into a higher-dimensional space where a linear separator exists. Common kernels: Linear, Polynomial, RBF (Gaussian). It avoids the computational cost of actually computing the transformation.
Click to reveal
What does the Sigmoid function do in Logistic Regression?
The sigmoid function σ(z) = 1/(1+e^(-z)) maps any real-valued number to a probability between 0 and 1. If the output ≥ 0.5, predict class 1; otherwise class 0. This makes logistic regression a classifier despite its name.
Click to reveal
What is R² (R-squared) in regression?
R² measures the proportion of variance in the dependent variable that is explained by the model. R² = 1 − (SS_res / SS_total). Values range from 0 (no explanatory power) to 1 (perfect fit). It answers: "How well does my model explain the data?"
Click to reveal

🧠 Unit 4 Quiz