Before building any ML model, you need quality data. Learn about ML activities, data types, preprocessing techniques, and dimensionality reduction.
Building an ML system is not just about choosing an algorithm โ it follows a structured lifecycle of activities that ensure the model is effective and reliable.
| Activity | Description | Key Task |
|---|---|---|
| Problem Definition | Clearly define what you want the model to achieve | Is this classification, regression, or clustering? |
| Data Collection | Gather relevant data from various sources | Databases, APIs, web scraping, surveys |
| Data Preparation | Clean, transform, and format data | Handle missing values, encoding, normalization |
| Exploratory Data Analysis | Understand data patterns and distributions | Visualizations, statistics, correlation analysis |
| Feature Engineering | Create or select the most informative features | Feature creation, selection, transformation |
| Model Training | Train algorithms on prepared data | Choose algorithm, set hyperparameters |
| Evaluation | Measure model performance | Accuracy, precision, recall, F1-score |
| Deployment | Put model into production | APIs, integration, monitoring |
Data-Centric AI: In practice, ~80% of ML project time is spent on data preparation and only ~20% on actual modeling. Clean, high-quality data matters more than fancy algorithms.
Understanding data types is fundamental because different ML algorithms require or perform better with specific types of data.
Data that represents measurable quantities and can be expressed as numbers.
Data that represents categories or labels โ cannot perform arithmetic operations.
| Scale | Properties | Examples | Operations |
|---|---|---|---|
| Nominal | Categories, no order | Gender, Country, Color | = , โ |
| Ordinal | Categories with order | Education level, Customer satisfaction | = , โ , < ,> |
| Interval | Equal intervals, no true zero | Temperature (ยฐC), Calendar dates | = , โ , < ,> , + , โ |
| Ratio | Equal intervals, true zero | Height, Weight, Income | = , โ , < ,> , + , โ , ร , รท |
A hybrid between structured and unstructured. Has some organizational properties (tags, markers) but doesn't fit a strict table format. Examples: JSON, XML, HTML, Email.
The quality of your data directly determines the quality of your ML model. "Garbage in, garbage out" is a fundamental principle.
Empty cells due to collection errors, sensor failures, or optional fields. Can bias model training.
Random errors or variance in data โ typos, measurement errors, or outliers that don't represent true patterns.
Repeated records that can skew model training and inflate performance metrics artificially.
Same information stored differently โ "NY" vs "New York", date formats, unit discrepancies.
| Issue | Technique | Description |
|---|---|---|
| Missing Values | Imputation | Replace with mean, median, mode, or predicted values |
| Missing Values | Deletion | Remove rows/columns with missing data (if small %) |
| Noisy Data | Smoothing | Binning, regression-based smoothing, clustering |
| Outliers | Detection & Removal | IQR method, Z-score, or domain-specific thresholds |
| Duplicates | Deduplication | Identify and remove duplicate records |
| Inconsistency | Standardization | Uniform formats, consistent naming conventions |
Data preprocessing transforms raw data into a clean, usable format suitable for ML algorithms. This is the most time-consuming but critical step.
Handling missing values, removing duplicates, correcting inconsistencies, and filtering noise.
Ensures all features are on comparable scales. Especially important for distance-based algorithms (KNN, SVM) and gradient descent optimization.
Dividing data into training set (typically 70โ80%) and test set (20โ30%) to evaluate model generalization.
This dataset has issues. Click buttons to apply preprocessing techniques and see the effect on data quality.
Dimensionality reduction is the process of reducing the number of features (dimensions) in a dataset while preserving as much information as possible.
Creates new features by transforming/combining original ones.
Selects a subset of original features without transformation.
PCA is the most widely used dimensionality reduction technique. It works by:
The first principal component captures the maximum variance in the data. Each subsequent component captures the maximum remaining variance while being orthogonal to all previous components.
Feature subset selection aims to identify the most relevant features for your ML model, removing redundant or irrelevant ones.
| Method | Approach | Pros | Cons |
|---|---|---|---|
| Filter Methods | Rank features using statistical tests independent of any ML model | Fast, scalable, model-independent | Ignores feature interactions |
| Wrapper Methods | Use an ML model to evaluate subsets of features | Considers feature interactions | Computationally expensive |
| Embedded Methods | Feature selection is built into the model training process | Balance of speed and accuracy | Model-specific |