Unit 2 — Preparing to Model

🔄 2.1.1 — Machine Learning Activities

Building an ML system is not just about choosing an algorithm — it follows a structured lifecycle of activities that ensure the model is effective and reliable.

The ML Lifecycle

1. Problem Definition

→

2. Data Collection

→

3. Data Preparation

→

4. Model Training

→

5. Evaluation

→

6. Deployment

Key ML Activities Explained

Activity	Description	Key Task
Problem Definition	Clearly define what you want the model to achieve	Is this classification, regression, or clustering?
Data Collection	Gather relevant data from various sources	Databases, APIs, web scraping, surveys
Data Preparation	Clean, transform, and format data	Handle missing values, encoding, normalization
Exploratory Data Analysis	Understand data patterns and distributions	Visualizations, statistics, correlation analysis
Feature Engineering	Create or select the most informative features	Feature creation, selection, transformation
Model Training	Train algorithms on prepared data	Choose algorithm, set hyperparameters
Evaluation	Measure model performance	Accuracy, precision, recall, F1-score
Deployment	Put model into production	APIs, integration, monitoring

⚠️

Data-Centric AI: In practice, ~80% of ML project time is spent on data preparation and only ~20% on actual modeling. Clean, high-quality data matters more than fancy algorithms.

📊 2.1.2 — Types of Data in Machine Learning

Understanding data types is fundamental because different ML algorithms require or perform better with specific types of data.

📐 Quantitative (Numerical) Data

Data that represents measurable quantities and can be expressed as numbers.

Discrete: Countable values (e.g., number of students: 25, 30, 42)
Continuous: Any value in a range (e.g., temperature: 36.7°C, height: 175.3 cm)

🏷️ Qualitative (Categorical) Data

Data that represents categories or labels — cannot perform arithmetic operations.

Nominal: No natural order (e.g., color: red, blue, green)
Ordinal: Has natural order (e.g., rating: low, medium, high)

Scales of Measurement

Scale	Properties	Examples	Operations
Nominal	Categories, no order	Gender, Country, Color	= , ≠
Ordinal	Categories with order	Education level, Customer satisfaction	= , ≠ , < ,>
Interval	Equal intervals, no true zero	Temperature (°C), Calendar dates	= , ≠ , < ,> , + , −
Ratio	Equal intervals, true zero	Height, Weight, Income	= , ≠ , < ,> , + , − , × , ÷

🗃️ 2.1.3 — Structures of Data

📋 Structured Data

Organized in rows and columns (tabular)
Stored in relational databases, spreadsheets
Easy to search, query, and analyze
Examples: SQL databases, CSV files, Excel sheets

🌊 Unstructured Data

No predefined format or schema
Includes images, audio, video, text
Requires special techniques (NLP, CV) to process
~80-90% of world's data is unstructured

🔀 Semi-Structured Data

A hybrid between structured and unstructured. Has some organizational properties (tags, markers) but doesn't fit a strict table format. Examples: JSON, XML, HTML, Email.

✅ 2.1.4 — Data Quality and Remediation

The quality of your data directly determines the quality of your ML model. "Garbage in, garbage out" is a fundamental principle.

Common Data Quality Issues

❌ Missing Values

Empty cells due to collection errors, sensor failures, or optional fields. Can bias model training.

⚡ Noisy Data

Random errors or variance in data — typos, measurement errors, or outliers that don't represent true patterns.

📋 Duplicate Data

Repeated records that can skew model training and inflate performance metrics artificially.

📊 Inconsistent Data

Same information stored differently — "NY" vs "New York", date formats, unit discrepancies.

Remediation Techniques

Issue	Technique	Description
Missing Values	Imputation	Replace with mean, median, mode, or predicted values
Missing Values	Deletion	Remove rows/columns with missing data (if small %)
Noisy Data	Smoothing	Binning, regression-based smoothing, clustering
Outliers	Detection & Removal	IQR method, Z-score, or domain-specific thresholds
Duplicates	Deduplication	Identify and remove duplicate records
Inconsistency	Standardization	Uniform formats, consistent naming conventions

⚙️ 2.1.5 — Data Pre-Processing

Data preprocessing transforms raw data into a clean, usable format suitable for ML algorithms. This is the most time-consuming but critical step.

Key Preprocessing Steps

1. Data Cleaning

Handling missing values, removing duplicates, correcting inconsistencies, and filtering noise.

2. Data Transformation

Normalization (Min-Max Scaling): Scales features to a fixed range, typically [0, 1]

X_normalized = (X − X_min) / (X_max − X_min)

Standardization (Z-score Normalization): Transforms data to have mean = 0 and standard deviation = 1

X_standardized = (X − μ) / σ

3. Data Encoding

Label Encoding: Assigns integer values to categories (e.g., Red=0, Blue=1, Green=2)
One-Hot Encoding: Creates binary columns for each category (e.g., is_Red=[1,0,0], is_Blue=[0,1,0])

4. Feature Scaling

Ensures all features are on comparable scales. Especially important for distance-based algorithms (KNN, SVM) and gradient descent optimization.

5. Data Splitting

Dividing data into training set (typically 70–80%) and test set (20–30%) to evaluate model generalization.

🧹 Interactive: Data Cleaning Simulation

This dataset has issues. Click buttons to apply preprocessing techniques and see the effect on data quality.

📉 Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of features (dimensions) in a dataset while preserving as much information as possible.

Why Reduce Dimensions?

Curse of Dimensionality: As dimensions increase, data becomes sparse, making patterns harder to find and models less reliable.
Reduced Computation: Fewer features mean faster training and prediction times.
Noise Reduction: Removing irrelevant features can improve model accuracy.
Visualization: Reducing to 2D or 3D enables visual exploration of high-dimensional data.

Approaches

🔄 Feature Extraction

Creates new features by transforming/combining original ones.

PCA (Principal Component Analysis) — linear projection onto directions of maximum variance
t-SNE — non-linear technique for visualization
LDA — Linear Discriminant Analysis

✂️ Feature Selection

Selects a subset of original features without transformation.

Filter methods — correlation, chi-square test
Wrapper methods — forward/backward selection
Embedded methods — L1 regularization (Lasso)

PCA — Principal Component Analysis

PCA is the most widely used dimensionality reduction technique. It works by:

Standardize the data (mean=0, variance=1)
Compute the covariance matrix
Find the eigenvectors (principal components) and eigenvalues
Sort eigenvectors by eigenvalues in descending order
Select top-k eigenvectors as the new feature space
Project data onto the new k-dimensional space

💡

The first principal component captures the maximum variance in the data. Each subsequent component captures the maximum remaining variance while being orthogonal to all previous components.

🎯 Feature Subset Selection

Feature subset selection aims to identify the most relevant features for your ML model, removing redundant or irrelevant ones.

Methods of Feature Selection

Method	Approach	Pros	Cons
Filter Methods	Rank features using statistical tests independent of any ML model	Fast, scalable, model-independent	Ignores feature interactions
Wrapper Methods	Use an ML model to evaluate subsets of features	Considers feature interactions	Computationally expensive
Embedded Methods	Feature selection is built into the model training process	Balance of speed and accuracy	Model-specific

Common Filter Techniques

Correlation Coefficient: Remove features highly correlated with each other (multicollinearity)
Chi-Square Test: Measures dependence between categorical features and target variable
Information Gain: Measures how much a feature reduces uncertainty about the target
Variance Threshold: Remove features with very low variance (near-constant values)

Forward Selection vs Backward Elimination

➕ Forward Selection

Start with no features
Add one feature at a time
Keep the one that improves performance most
Stop when adding features no longer helps

➖ Backward Elimination

Start with all features
Remove one feature at a time
Remove the one whose removal least affects performance
Stop when removal degrades performance

🃏 Quick Revision — Flashcards

What percentage of an ML project is typically spent on data preparation?

Approximately 80% of an ML project's time is spent on data collection, cleaning, and preprocessing. Only about 20% is spent on actual modeling.

Click to reveal

What is the difference between Normalization and Standardization?

Normalization (Min-Max) scales data to a fixed range [0,1]: X' = (X-Xmin)/(Xmax-Xmin).
Standardization (Z-score) centers data around mean=0 with std=1: X' = (X-μ)/σ.
Use normalization when you need bounded values; use standardization when data has outliers or follows a Gaussian distribution.

Click to reveal

What is the Curse of Dimensionality?

As the number of features (dimensions) increases, the data becomes increasingly sparse. Distance metrics become less meaningful, models need exponentially more data to generalize, and overfitting risk increases. This is why dimensionality reduction is important.

Click to reveal

What is PCA and what does it do?

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms correlated features into a set of uncorrelated principal components. Each component captures maximum variance. It's a feature extraction method (creates new features, unlike feature selection).

Click to reveal

Name the three types of feature selection methods.

1. Filter Methods — Use statistical tests (correlation, chi-square) independent of any model.
2. Wrapper Methods — Use an ML model to evaluate feature subsets (forward selection, backward elimination).
3. Embedded Methods — Feature selection is built into training (L1/Lasso regularization, tree-based importance).

Click to reveal

Preparing to Model

🔄 2.1.1 — Machine Learning Activities

The ML Lifecycle

Key ML Activities Explained

📊 2.1.2 — Types of Data in Machine Learning

📐 Quantitative (Numerical) Data

🏷️ Qualitative (Categorical) Data

Scales of Measurement

🗃️ 2.1.3 — Structures of Data

📋 Structured Data

🌊 Unstructured Data

🔀 Semi-Structured Data

✅ 2.1.4 — Data Quality and Remediation

Common Data Quality Issues

❌ Missing Values

⚡ Noisy Data

📋 Duplicate Data

📊 Inconsistent Data

Remediation Techniques

⚙️ 2.1.5 — Data Pre-Processing

Key Preprocessing Steps

1. Data Cleaning

2. Data Transformation

3. Data Encoding

4. Feature Scaling

5. Data Splitting

🧹 Interactive: Data Cleaning Simulation

📉 Dimensionality Reduction

Why Reduce Dimensions?

Approaches

🔄 Feature Extraction

✂️ Feature Selection

PCA — Principal Component Analysis

🎯 Feature Subset Selection

Methods of Feature Selection

Common Filter Techniques

Forward Selection vs Backward Elimination

➕ Forward Selection

➖ Backward Elimination

🃏 Quick Revision — Flashcards

🧠 Unit 2 Quiz