Unit II

Preparing to Model

Before building any ML model, you need quality data. Learn about ML activities, data types, preprocessing techniques, and dimensionality reduction.

๐Ÿ”„ 2.1.1 โ€” Machine Learning Activities

Building an ML system is not just about choosing an algorithm โ€” it follows a structured lifecycle of activities that ensure the model is effective and reliable.

The ML Lifecycle

1. Problem Definition
โ†’
2. Data Collection
โ†’
3. Data Preparation
โ†’
4. Model Training
โ†’
5. Evaluation
โ†’
6. Deployment

Key ML Activities Explained

Activity Description Key Task
Problem Definition Clearly define what you want the model to achieve Is this classification, regression, or clustering?
Data Collection Gather relevant data from various sources Databases, APIs, web scraping, surveys
Data Preparation Clean, transform, and format data Handle missing values, encoding, normalization
Exploratory Data Analysis Understand data patterns and distributions Visualizations, statistics, correlation analysis
Feature Engineering Create or select the most informative features Feature creation, selection, transformation
Model Training Train algorithms on prepared data Choose algorithm, set hyperparameters
Evaluation Measure model performance Accuracy, precision, recall, F1-score
Deployment Put model into production APIs, integration, monitoring
โš ๏ธ

Data-Centric AI: In practice, ~80% of ML project time is spent on data preparation and only ~20% on actual modeling. Clean, high-quality data matters more than fancy algorithms.

๐Ÿ“Š 2.1.2 โ€” Types of Data in Machine Learning

Understanding data types is fundamental because different ML algorithms require or perform better with specific types of data.

๐Ÿ“ Quantitative (Numerical) Data

Data that represents measurable quantities and can be expressed as numbers.

  • Discrete: Countable values (e.g., number of students: 25, 30, 42)
  • Continuous: Any value in a range (e.g., temperature: 36.7ยฐC, height: 175.3 cm)

๐Ÿท๏ธ Qualitative (Categorical) Data

Data that represents categories or labels โ€” cannot perform arithmetic operations.

  • Nominal: No natural order (e.g., color: red, blue, green)
  • Ordinal: Has natural order (e.g., rating: low, medium, high)

Scales of Measurement

Scale Properties Examples Operations
Nominal Categories, no order Gender, Country, Color = , โ‰ 
Ordinal Categories with order Education level, Customer satisfaction = , โ‰  , < ,>
Interval Equal intervals, no true zero Temperature (ยฐC), Calendar dates = , โ‰  , < ,> , + , โˆ’
Ratio Equal intervals, true zero Height, Weight, Income = , โ‰  , < ,> , + , โˆ’ , ร— , รท

๐Ÿ—ƒ๏ธ 2.1.3 โ€” Structures of Data

๐Ÿ“‹ Structured Data

  • Organized in rows and columns (tabular)
  • Stored in relational databases, spreadsheets
  • Easy to search, query, and analyze
  • Examples: SQL databases, CSV files, Excel sheets

๐ŸŒŠ Unstructured Data

  • No predefined format or schema
  • Includes images, audio, video, text
  • Requires special techniques (NLP, CV) to process
  • ~80-90% of world's data is unstructured

๐Ÿ”€ Semi-Structured Data

A hybrid between structured and unstructured. Has some organizational properties (tags, markers) but doesn't fit a strict table format. Examples: JSON, XML, HTML, Email.

โœ… 2.1.4 โ€” Data Quality and Remediation

The quality of your data directly determines the quality of your ML model. "Garbage in, garbage out" is a fundamental principle.

Common Data Quality Issues

โŒ Missing Values

Empty cells due to collection errors, sensor failures, or optional fields. Can bias model training.

โšก Noisy Data

Random errors or variance in data โ€” typos, measurement errors, or outliers that don't represent true patterns.

๐Ÿ“‹ Duplicate Data

Repeated records that can skew model training and inflate performance metrics artificially.

๐Ÿ“Š Inconsistent Data

Same information stored differently โ€” "NY" vs "New York", date formats, unit discrepancies.

Remediation Techniques

Issue Technique Description
Missing Values Imputation Replace with mean, median, mode, or predicted values
Missing Values Deletion Remove rows/columns with missing data (if small %)
Noisy Data Smoothing Binning, regression-based smoothing, clustering
Outliers Detection & Removal IQR method, Z-score, or domain-specific thresholds
Duplicates Deduplication Identify and remove duplicate records
Inconsistency Standardization Uniform formats, consistent naming conventions

โš™๏ธ 2.1.5 โ€” Data Pre-Processing

Data preprocessing transforms raw data into a clean, usable format suitable for ML algorithms. This is the most time-consuming but critical step.

Key Preprocessing Steps

1. Data Cleaning

Handling missing values, removing duplicates, correcting inconsistencies, and filtering noise.

2. Data Transformation

  • Normalization (Min-Max Scaling): Scales features to a fixed range, typically [0, 1]
X_normalized = (X โˆ’ X_min) / (X_max โˆ’ X_min)
  • Standardization (Z-score Normalization): Transforms data to have mean = 0 and standard deviation = 1
X_standardized = (X โˆ’ ฮผ) / ฯƒ

3. Data Encoding

  • Label Encoding: Assigns integer values to categories (e.g., Red=0, Blue=1, Green=2)
  • One-Hot Encoding: Creates binary columns for each category (e.g., is_Red=[1,0,0], is_Blue=[0,1,0])

4. Feature Scaling

Ensures all features are on comparable scales. Especially important for distance-based algorithms (KNN, SVM) and gradient descent optimization.

5. Data Splitting

Dividing data into training set (typically 70โ€“80%) and test set (20โ€“30%) to evaluate model generalization.

๐Ÿงน Interactive: Data Cleaning Simulation

This dataset has issues. Click buttons to apply preprocessing techniques and see the effect on data quality.

๐Ÿ“‰ Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of features (dimensions) in a dataset while preserving as much information as possible.

Why Reduce Dimensions?

  • Curse of Dimensionality: As dimensions increase, data becomes sparse, making patterns harder to find and models less reliable.
  • Reduced Computation: Fewer features mean faster training and prediction times.
  • Noise Reduction: Removing irrelevant features can improve model accuracy.
  • Visualization: Reducing to 2D or 3D enables visual exploration of high-dimensional data.

Approaches

๐Ÿ”„ Feature Extraction

Creates new features by transforming/combining original ones.

  • PCA (Principal Component Analysis) โ€” linear projection onto directions of maximum variance
  • t-SNE โ€” non-linear technique for visualization
  • LDA โ€” Linear Discriminant Analysis

โœ‚๏ธ Feature Selection

Selects a subset of original features without transformation.

  • Filter methods โ€” correlation, chi-square test
  • Wrapper methods โ€” forward/backward selection
  • Embedded methods โ€” L1 regularization (Lasso)

PCA โ€” Principal Component Analysis

PCA is the most widely used dimensionality reduction technique. It works by:

  1. Standardize the data (mean=0, variance=1)
  2. Compute the covariance matrix
  3. Find the eigenvectors (principal components) and eigenvalues
  4. Sort eigenvectors by eigenvalues in descending order
  5. Select top-k eigenvectors as the new feature space
  6. Project data onto the new k-dimensional space
๐Ÿ’ก

The first principal component captures the maximum variance in the data. Each subsequent component captures the maximum remaining variance while being orthogonal to all previous components.

๐ŸŽฏ Feature Subset Selection

Feature subset selection aims to identify the most relevant features for your ML model, removing redundant or irrelevant ones.

Methods of Feature Selection

Method Approach Pros Cons
Filter Methods Rank features using statistical tests independent of any ML model Fast, scalable, model-independent Ignores feature interactions
Wrapper Methods Use an ML model to evaluate subsets of features Considers feature interactions Computationally expensive
Embedded Methods Feature selection is built into the model training process Balance of speed and accuracy Model-specific

Common Filter Techniques

  • Correlation Coefficient: Remove features highly correlated with each other (multicollinearity)
  • Chi-Square Test: Measures dependence between categorical features and target variable
  • Information Gain: Measures how much a feature reduces uncertainty about the target
  • Variance Threshold: Remove features with very low variance (near-constant values)

Forward Selection vs Backward Elimination

โž• Forward Selection

  • Start with no features
  • Add one feature at a time
  • Keep the one that improves performance most
  • Stop when adding features no longer helps

โž– Backward Elimination

  • Start with all features
  • Remove one feature at a time
  • Remove the one whose removal least affects performance
  • Stop when removal degrades performance

๐Ÿƒ Quick Revision โ€” Flashcards

What percentage of an ML project is typically spent on data preparation?
Approximately 80% of an ML project's time is spent on data collection, cleaning, and preprocessing. Only about 20% is spent on actual modeling.
Click to reveal
What is the difference between Normalization and Standardization?
Normalization (Min-Max) scales data to a fixed range [0,1]: X' = (X-Xmin)/(Xmax-Xmin).
Standardization (Z-score) centers data around mean=0 with std=1: X' = (X-ฮผ)/ฯƒ.
Use normalization when you need bounded values; use standardization when data has outliers or follows a Gaussian distribution.
Click to reveal
What is the Curse of Dimensionality?
As the number of features (dimensions) increases, the data becomes increasingly sparse. Distance metrics become less meaningful, models need exponentially more data to generalize, and overfitting risk increases. This is why dimensionality reduction is important.
Click to reveal
What is PCA and what does it do?
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms correlated features into a set of uncorrelated principal components. Each component captures maximum variance. It's a feature extraction method (creates new features, unlike feature selection).
Click to reveal
Name the three types of feature selection methods.
1. Filter Methods โ€” Use statistical tests (correlation, chi-square) independent of any model.
2. Wrapper Methods โ€” Use an ML model to evaluate feature subsets (forward selection, backward elimination).
3. Embedded Methods โ€” Feature selection is built into training (L1/Lasso regularization, tree-based importance).
Click to reveal

๐Ÿง  Unit 2 Quiz