Unit III

Modeling & Evaluation

Learn how to select the right model, train it using proper techniques, evaluate its performance with key metrics, and improve accuracy.

šŸŽÆ 3.1.1 — Selecting a Model

The first step in modeling is choosing the right type of model for your problem. Models are broadly categorized as Predictive or Descriptive.

šŸ”® Predictive Models

Use supervised learning — learn from labeled data to predict outcomes for new, unseen data.

  • Has a target variable (what we're predicting)
  • Classification: Predict a category (spam/not spam, disease/healthy)
  • Regression: Predict a continuous value (price, temperature)
  • Evaluated using accuracy, precision, recall, RMSE

šŸ” Descriptive Models

Use unsupervised learning — discover hidden patterns and structures in data without any target.

  • No target variable — just input features
  • Clustering: Group similar data points (customer segments)
  • Association: Find co-occurrence patterns (market basket)
  • Evaluated using silhouette score, inertia, lift

🧩 Interactive: Model Selection Flow

šŸš€ 3.2.1 — Training a Model (Supervised Learning)

Training means feeding data to an algorithm so it can learn patterns. The critical question is: how do we split data for training and testing?

Holdout Method

The simplest approach: split the dataset into two parts.

  • Training set (70–80%): Used to train (fit) the model
  • Test set (20–30%): Used to evaluate model performance on unseen data

Holdout Split Example (80/20)

Training (80%)
Test (20%)
āš ļø

Limitation: Results depend heavily on which data ends up in training vs test. A single unlucky split can give misleading results — high variance in performance estimates.

K-Fold Cross-Validation

A more robust technique that uses all data for both training and testing across multiple rounds:

  1. Divide the dataset into K equal-sized folds (subsets)
  2. For each fold i (from 1 to K):
    • Use fold i as the test set
    • Use the remaining K-1 folds as the training set
    • Train the model and record the test score
  3. The final performance = average of all K test scores
Final Score = (1/K) Ɨ Ī£ Score_i   (for i = 1 to K)
Aspect Holdout K-Fold Cross-Validation
Number of splits 1 K
Data usage Part of data never used for training All data used for both training and testing
Variance High (depends on the split) Low (averaged over K runs)
Computation Fast K times slower
Typical K values N/A 5 or 10
Best for Large datasets, quick prototyping Small datasets, robust evaluation

šŸ”¬ K-Fold Cross-Validation Visualizer

šŸ”Ž 3.3.1 — Model Representation & Interpretability

Interpretability refers to how easily humans can understand and explain a model's predictions. This is crucial in high-stakes domains like healthcare, finance, and criminal justice.

🟢 Interpretable (White-Box) Models

  • Decisions are transparent and explainable
  • Easy to understand why a prediction was made
  • Examples: Decision Trees, Linear Regression, Logistic Regression
  • Preferred in regulated industries

šŸ”“ Non-Interpretable (Black-Box) Models

  • Internal workings are opaque and complex
  • Difficult to explain individual predictions
  • Examples: Neural Networks, Random Forest, SVM (non-linear kernels)
  • Often more accurate on complex tasks
āš–ļø

Accuracy vs Interpretability Trade-off: More complex models (neural networks) often achieve higher accuracy but are harder to interpret. Simpler models (linear regression) are easily explainable but may miss complex patterns. Choose based on your domain requirements.

šŸ“Š 3.3.2 — Evaluating Performance: Confusion Matrix

A Confusion Matrix is a table that summarizes the performance of a classification model by comparing actual vs predicted labels.

Predicted Positive Predicted Negative
Actual Positive TP (True Positive)
Correctly predicted positive
FN (False Negative)
Missed — Type II Error
Actual Negative FP (False Positive)
False alarm — Type I Error
TN (True Negative)
Correctly predicted negative

Key Performance Metrics

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Overall correctness — what fraction of all predictions were right.

Precision = TP / (TP + FP)

Of all predicted positives, how many were actually positive? High precision = low false alarm rate.

Recall (Sensitivity) = TP / (TP + FN)

Of all actual positives, how many did the model catch? High recall = missing few positives.

F1 Score = 2 Ɨ (Precision Ɨ Recall) / (Precision + Recall)

Harmonic mean of precision and recall — a balanced measure when you need both.

šŸŽÆ

When to prioritize which metric?
• Precision when false positives are costly (spam filter — don't want to block real emails).
• Recall when false negatives are costly (disease detection — don't want to miss a sick patient).
• F1 when you need a balance between the two.

🧮 Interactive: Confusion Matrix Calculator

Enter your confusion matrix values and see all metrics computed in real-time:

Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN

šŸ“ˆ 3.3.3 — Improving Model Performance

If your model doesn't perform well enough, here are proven strategies to improve it:

Understanding Overfitting vs Underfitting

šŸ“ˆ Overfitting (High Variance)

  • Model memorizes training data (including noise)
  • High training accuracy, low test accuracy
  • Too complex for the data
  • Solution: simplify model, add regularization, get more data

šŸ“‰ Underfitting (High Bias)

  • Model is too simple to capture patterns
  • Low training accuracy, low test accuracy
  • Not enough capacity
  • Solution: more complex model, add features, train longer

Improvement Techniques

šŸ”§ Feature Engineering

Create informative features, combine existing ones, or apply domain knowledge to extract better inputs for the model.

⚔ Hyperparameter Tuning

Adjust model settings (learning rate, tree depth, K value) using Grid Search or Random Search.

šŸ›”ļø Regularization

L1 (Lasso) or L2 (Ridge) regularization penalizes overly complex models to prevent overfitting.

šŸ”„ Cross-Validation

Use K-fold cross-validation to get a reliable estimate and avoid misleading results from a single split.

āš–ļø Handle Imbalanced Data

Use SMOTE (oversampling), undersampling, or class weights to address class imbalance.

šŸ“Š More Data

More training data helps the model learn more diverse patterns and improve generalization.

šŸŽ›ļø Interactive: Overfitting Simulator

Adjust model complexity and dataset size to see how they affect training vs testing accuracy.

šŸƒ Quick Revision — Flashcards

What is the difference between Predictive and Descriptive models?
Predictive: Supervised learning — uses labeled data, has a target variable, predicts new outcomes (classification/regression).
Descriptive: Unsupervised learning — no target variable, discovers hidden patterns (clustering/association).
Click to reveal
Why is K-Fold Cross-Validation better than a single Holdout split?
K-Fold uses all data for both training and testing (across K iterations). This gives a more reliable, lower-variance performance estimate. A single holdout can be misleading due to an unlucky split.
Click to reveal
What does the F1 Score measure?
F1 = 2 Ɨ (Precision Ɨ Recall) / (Precision + Recall). It's the harmonic mean of precision and recall, useful when you need a balance between both — especially with imbalanced classes.
Click to reveal
What is the difference between overfitting and underfitting?
Overfitting: Model is too complex, memorizes training data. High train accuracy, low test accuracy.
Underfitting: Model is too simple, can't capture patterns. Low accuracy on both train and test data.
Click to reveal

🧠 Unit 3 Quiz