Discover hidden patterns without labels — clustering with K-Means and hierarchical methods, and
association rule mining with the Apriori algorithm.
🔍 5.1.1 — Introduction to Unsupervised Learning
Unsupervised learning works with unlabeled data — the
algorithm finds hidden patterns, structures, or groupings without being told what to look
for.
📊 Clustering
Group
similar data points together based on feature similarity.
Customer segmentation
Image compression
Anomaly detection
Document grouping
🔗 Association
Find
relationships and patterns between items in large datasets.
Market basket analysis
"Customers who bought X also bought Y"
Recommendation systems
Cross-selling strategies
💡
Key difference from Supervised Learning: No target variable, no "correct
answers." The algorithm discovers structure on its own. Evaluation is harder — no
ground-truth labels to check against.
🔵 5.1.2 — Clustering Concepts
Clustering aims to partition data into groups (clusters) such that:
Points within the same cluster are as similar as possible (high
intra-cluster similarity)
Points in different clusters are as different as possible (high
inter-cluster distance)
Types of Clustering
Type
Description
Example Algorithm
Partitional
Divides data into non-overlapping clusters
K-Means, K-Medoids
Hierarchical
Creates a tree of nested clusters (dendrogram)
Agglomerative, Divisive
Density-Based
Clusters based on dense regions of data
DBSCAN
Evaluation Metrics
Silhouette Score: Measures how well each point fits its cluster vs
other clusters. Range: [-1, 1]. Higher = better.
Inertia (WCSS): Sum of squared distances from each point to its cluster
centroid. Lower = tighter clusters.
Elbow Method: Plot inertia vs K and look for the "elbow" where adding
more clusters stops giving significant improvement.
🎯 5.2.1 — K-Means Clustering
K-Means is the most popular clustering algorithm. It partitions data into
exactly K clusters by iteratively assigning points to the nearest centroid
and updating centroids.
Algorithm Steps
Choose K — the number of clusters
Initialize K centroids randomly
Assignment step: Assign each data point to the nearest centroid
Update step: Recalculate each centroid as the mean of all points in its
cluster
Repeat steps 3-4 until centroids no longer change (convergence)
Objective: Minimize WCSS = Σᵢ Σⱼ∈Cᵢ ||xⱼ - μᵢ||²
Where μᵢ is the centroid of cluster Cᵢ,
and xⱼ is a data point.
Pros and Cons
✅ Advantages
Simple and easy to implement
Fast and scalable for large datasets
Guaranteed convergence
Works well with spherical clusters
❌ Disadvantages
Must specify K in advance
Sensitive to initial centroid placement
Assumes spherical, equal-sized clusters
Sensitive to outliers
🎮 Interactive: K-Means Clustering Demo
Watch K-Means
iterate! Adjust K and click "Run K-Means" to see the algorithm in action.
🌳 5.2.2 — Hierarchical Clustering
Hierarchical clustering builds a tree of clusters (dendrogram) rather than
producing a flat partition. No need to specify K in advance.
Two Approaches
⬆️ Agglomerative (Bottom-Up)
Most
common approach.
Start: each point is its own cluster
Find the two closest clusters
Merge them into one cluster
Repeat until only one cluster remains
⬇️ Divisive (Top-Down)
Less
common, more complex.
Start: all points in one large cluster
Split the cluster into two sub-clusters
Recursively split each sub-cluster
Repeat until each point is its own cluster
Linkage Methods
How do we measure distance between clusters (not just individual points)?
Method
Description
Characteristic
Single Linkage
Min distance between any pair across clusters
Can produce elongated (chaining) clusters
Complete Linkage
Max distance between any pair across clusters
Produces compact, spherical clusters
Average Linkage
Average distance between all pairs across clusters
Balanced approach
Ward's Method
Minimizes total within-cluster variance
Tends to create equal-sized clusters
Reading a Dendrogram
Height (y-axis)
represents the distance at which clusters were merged. Cut the dendrogram at a chosen
height to get a desired number of clusters.
🔗 5.3 — Association Rule Mining
Association rule mining finds interesting relationships (rules) between
items in transactional datasets — "If a customer buys X, they are also likely to buy Y."
Key Metrics
Support(A→B) = P(A ∩ B) = (Transactions containing both A and B) /
Total transactions
Given that A is purchased, how likely is B also purchased?
Lift(A→B) = Confidence(A→B) / Support(B)
How much more likely are A and B bought together than independently?
Lift > 1: Positive association — items are bought together more than
expected
Lift = 1: Independent — no association
Lift < 1: Negative association — buying one makes the other less
likely
⛏️ 5.3.1 — Apriori Algorithm
The Apriori algorithm is the classic approach for mining frequent itemsets
and generating association rules.
Key Principle
📐
Apriori Property: If an itemset is infrequent, all its supersets must
also be infrequent. This allows us to prune the search space efficiently.
Algorithm Steps
Set minimum support threshold
Generate C₁: Count support of each individual item
Prune to L₁: Keep only items meeting minimum support
Generate C₂: Create all 2-item combinations from L₁
Prune to L₂: Keep only pairs meeting minimum support
Continue generating larger itemsets until no more frequent sets found
Generate rules from frequent itemsets that meet minimum confidence
🛒 Interactive: Market Basket Analysis
Enter
transactions (comma-separated items per row). Adjust minimum support and click "Analyze"
to find frequent itemsets and association rules.
🃏 Quick Revision — Flashcards
What are the two main steps in K-Means that repeat?
1. Assignment Step: Assign each point to the
nearest centroid. 2. Update Step: Recalculate centroids as the
mean of all points in each cluster. Repeat until convergence (centroids stop
moving).
Click to reveal
What is the Elbow Method?
A technique to find the optimal K in K-Means. Plot WCSS
(inertia) vs K. The "elbow" point — where the rate of decrease sharply
slows — suggests the best K value.
Click to reveal
What is the Apriori Property?
If an itemset is infrequent, all of its
supersets must also be infrequent. This allows the algorithm to prune large portions
of the search space, making it computationally feasible.
Click to reveal
What does Lift measure in association rules?
Lift(A→B) = Confidence(A→B) / Support(B). It
measures how much more likely items are bought together vs independently. Lift >
1 = positive association, Lift = 1 = independent, Lift < 1 = negative
association.
Click to reveal
Name the two types of hierarchical clustering.
Agglomerative (Bottom-Up): Start with individual
points, merge closest pairs until one cluster remains. Divisive
(Top-Down): Start with one big cluster, recursively split until each
point is its own cluster.