Clustering
Clustering groups unlabeled data points so that points within a cluster are more similar to each other than to points in other clusters. It is the most fundamental unsupervised learning task — used in customer segmentation, image compression, anomaly detection, topic discovery, and gene expression analysis.
Why Clustering Matters
Supervised learning requires labels. Labels are expensive. In most organizations, 95%+ of data is unlabeled. Clustering finds hidden structure without human annotation:
- Customer segmentation: Group customers by purchasing behavior for targeted marketing
- Image compression: Reduce color palette by clustering pixel colors
- Document organization: Group similar documents without predefined categories
- Anomaly detection: Points that don't belong to any cluster are anomalies
- Pre-labeling: Generate pseudo-labels for semi-supervised pipelines
Intuition: What Makes a Good Cluster?
A good clustering maximizes intra-cluster similarity and minimizes inter-cluster similarity. The challenge: "similarity" depends on the distance metric, and the "right" number of clusters is rarely known in advance.
K-Means: Lloyd's Algorithm
The Idea
Partition
- Assign each point to the nearest centroid
- Update each centroid to the mean of its assigned points
Mathematical Formulation
Given data
where
Worked Example — K-Means WCSS and One Iteration
Mini Dataset (K=2):
| Point | x1 | x2 |
|---|---|---|
| A | 1 | 1 |
| B | 2 | 1 |
| C | 1 | 2 |
| D | 8 | 8 |
| E | 9 | 8 |
| F | 8 | 9 |
Initialize centroids: mu1 = [1, 1] (point A), mu2 = [9, 8] (point E)
Step 1: Assignment — assign each point to nearest centroid A: d(A, mu1)=0.00, d(A, mu2)=sqrt(64+49)=10.63 -> Cluster 1 B: d(B, mu1)=1.00, d(B, mu2)=sqrt(49+49)=9.90 -> Cluster 1 C: d(C, mu1)=1.00, d(C, mu2)=sqrt(64+36)=10.00 -> Cluster 1 D: d(D, mu1)=sqrt(49+49)=9.90, d(D, mu2)=1.00 -> Cluster 2 E: d(E, mu1)=10.63, d(E, mu2)=0.00 -> Cluster 2 F: d(F, mu1)=10.00, d(F, mu2)=sqrt(1+1)=1.41 -> Cluster 2
Cluster 1: {A, B, C}, Cluster 2:
Step 2: Update — recompute centroids mu1 = [(1+2+1)/3, (1+1+2)/3] = [1.33, 1.33] mu2 = [(8+9+8)/3, (8+8+9)/3] = [8.33, 8.33]
Step 3: Compute WCSS Cluster 1: (1-1.33)^2+(1-1.33)^2 + (2-1.33)^2+(1-1.33)^2 + (1-1.33)^2+(2-1.33)^2 = 0.11+0.11 + 0.44+0.11 + 0.11+0.44 = 1.33 Cluster 2: (8-8.33)^2+(8-8.33)^2 + (9-8.33)^2+(8-8.33)^2 + (8-8.33)^2+(9-8.33)^2 = 0.11+0.11 + 0.44+0.11 + 0.11+0.44 = 1.33 WCSS = 1.33 + 1.33 = 2.67
Interpret: "K-Means found two natural clusters in one iteration. The WCSS of 2.67 is low, indicating tight clusters. Each point is within ~0.67 units of its centroid."
Lloyd's Algorithm Step-by-Step
- Initialize
centroids (randomly or via K-Means++) - Assignment step: For each point
, assign to nearest centroid:
- Update step: Recompute each centroid:
- Repeat until centroids converge (movement
) or max iterations reached
Convergence guarantee: WCSS decreases monotonically at each step. Since there are finitely many partitions, the algorithm always converges — but possibly to a local minimum.
K-Means++ Initialization
Random initialization can lead to poor convergence. K-Means++ selects initial centroids that are spread out:
- Choose first centroid uniformly at random from data
- For each remaining centroid, choose point
with probability proportional to , where is the distance to the nearest existing centroid - Repeat until
centroids are chosen
This gives an
From-Scratch NumPy Implementation
import numpy as np
class KMeansFromScratch:
"""K-Means clustering with K-Means++ initialization."""
def __init__(self, n_clusters=3, max_iter=300, tol=1e-4, random_state=42):
self.n_clusters = n_clusters
self.max_iter = max_iter
self.tol = tol
self.rng = np.random.RandomState(random_state)
def _init_centroids_plus_plus(self, X):
"""K-Means++ initialization."""
n_samples = X.shape[0]
centroids = [X[self.rng.randint(n_samples)]]
for _ in range(1, self.n_clusters):
# Squared distances to nearest centroid
dists = np.min(
[np.sum((X - c) ** 2, axis=1) for c in centroids], axis=0
)
# Probability proportional to D(x)^2
probs = dists / dists.sum()
idx = self.rng.choice(n_samples, p=probs)
centroids.append(X[idx])
return np.array(centroids)
def fit(self, X):
"""Fit K-Means to data X of shape (n_samples, n_features)."""
self.centroids_ = self._init_centroids_plus_plus(X)
self.inertia_history_ = []
for iteration in range(self.max_iter):
# Assignment step: compute distances to all centroids
distances = np.array([
np.sum((X - c) ** 2, axis=1) for c in self.centroids_
]).T # shape: (n_samples, n_clusters)
self.labels_ = np.argmin(distances, axis=1)
# Compute inertia (WCSS)
inertia = sum(
np.sum((X[self.labels_ == k] - self.centroids_[k]) ** 2)
for k in range(self.n_clusters)
)
self.inertia_history_.append(inertia)
# Update step: recompute centroids
new_centroids = np.array([
X[self.labels_ == k].mean(axis=0)
if np.any(self.labels_ == k)
else self.centroids_[k] # keep old if empty cluster
for k in range(self.n_clusters)
])
# Check convergence
shift = np.sum((new_centroids - self.centroids_) ** 2)
self.centroids_ = new_centroids
if shift < self.tol:
break
self.n_iter_ = iteration + 1
self.inertia_ = self.inertia_history_[-1]
return self
def predict(self, X):
"""Assign new points to nearest centroid."""
distances = np.array([
np.sum((X - c) ** 2, axis=1) for c in self.centroids_
]).T
return np.argmin(distances, axis=1)Scikit-learn K-Means
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate synthetic data
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8,
random_state=42)
# Always scale before clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit K-Means
kmeans = KMeans(n_clusters=4, init='k-means++', n_init=10,
max_iter=300, random_state=42)
kmeans.fit(X_scaled)
print(f"Inertia (WCSS): {kmeans.inertia_:.2f}")
print(f"Iterations: {kmeans.n_iter_}")
print(f"Cluster sizes: {np.bincount(kmeans.labels_)}")Choosing K: Elbow Method and Silhouette Score
Elbow Method
Plot WCSS (inertia) vs.
inertias = []
K_range = range(2, 11)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method')
plt.grid(True, alpha=0.3)
plt.show()Limitation: The elbow is often ambiguous. Use silhouette score for a more quantitative answer.
Silhouette Score
For each point
where:
= mean distance from to all other points in the same cluster (cohesion) = mean distance from to all points in the nearest other cluster (separation)
The overall silhouette score is
Worked Example — Silhouette Score
Using the K-Means result above. Compute silhouette for point A=[1,1]:
Cluster 1: {A=[1,1], B=[2,1], C=[1,2]} Cluster 2:
Step 1: a(A) = mean distance to same-cluster points d(A,B) = sqrt((1-2)^2+(1-1)^2) = 1.000 d(A,C) = sqrt((1-1)^2+(1-2)^2) = 1.000 a(A) = (1.000 + 1.000) / 2 = 1.000
Step 2: b(A) = mean distance to nearest other cluster d(A,D) = sqrt(49+49) = 9.899 d(A,E) = sqrt(64+49) = 10.630 d(A,F) = sqrt(49+64) = 10.630 b(A) = (9.899 + 10.630 + 10.630) / 3 = 10.386
Step 3: Compute silhouette s(A) = (10.386 - 1.000) / max(1.000, 10.386) = 9.386/10.386 = 0.904
Interpret: "s(A) = 0.904, very close to 1.0, meaning point A is well-clustered (much closer to its own cluster than to the other). Values near 0 indicate borderline points, negative values suggest wrong cluster assignment."
: Point is well-clustered : Point is on the border between clusters : Point is likely in the wrong cluster
from sklearn.metrics import silhouette_score, silhouette_samples
sil_scores = []
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
sil_scores.append(score)
print(f"K={k}: Silhouette = {score:.3f}")
best_k = K_range[np.argmax(sil_scores)]
print(f"\nBest K by silhouette: {best_k}")DBSCAN: Density-Based Clustering
The Idea
Unlike K-Means, DBSCAN doesn't require specifying
Key Parameters
- eps (
): Maximum distance between two points to be considered neighbors - min_samples: Minimum number of points in the
-neighborhood to form a core point
Point Types
- Core point: Has at least
min_samplespoints within radius - Border point: Within
of a core point but not a core point itself - Noise point: Neither core nor border — labeled as -1
Algorithm
- For each unvisited point
: - Find all points within
distance (the -neighborhood) - If
min_samples,is a core point — start a new cluster - Expand the cluster by recursively adding all density-reachable points
- Find all points within
- Points not reachable from any core point are noise
Choosing eps with the k-Distance Graph
from sklearn.neighbors import NearestNeighbors
# Use k = min_samples - 1 (convention)
k = 4
nn = NearestNeighbors(n_neighbors=k + 1)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
# Sort k-th nearest neighbor distances
k_distances = np.sort(distances[:, k])
plt.figure(figsize=(8, 4))
plt.plot(k_distances)
plt.xlabel('Points (sorted by distance)')
plt.ylabel(f'{k}-th Nearest Neighbor Distance')
plt.title('k-Distance Graph for Choosing eps')
plt.grid(True, alpha=0.3)
plt.show()
# Look for the "elbow" — that distance is a good epsScikit-learn DBSCAN
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')
labels = dbscan.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
print(f"Silhouette (excl. noise): "
f"{silhouette_score(X_scaled[labels != -1], labels[labels != -1]):.3f}")DBSCAN vs K-Means
| Feature | K-Means | DBSCAN |
|---|---|---|
| Requires K | Yes | No |
| Cluster shape | Spherical | Arbitrary |
| Handles noise | No | Yes (labels as -1) |
| Handles varying density | No | Poorly (single eps) |
| Complexity | ||
| Deterministic | No (init-dependent) | Yes |
Hierarchical Agglomerative Clustering
The Idea
Start with each point as its own cluster. Repeatedly merge the two closest clusters until only one remains. The result is a dendrogram — a tree showing the merge history.
Linkage Criteria
The "distance between clusters" depends on the linkage method:
- Single linkage:
— prone to chaining - Complete linkage:
— produces compact clusters - Average linkage:
- Ward's method: Minimizes the increase in total within-cluster variance — most commonly used
Ward's merge criterion: Merge clusters
Worked Example — Ward's Merge Criterion
3 candidate cluster merges:
- Merge X: A(3 points, mu=[1,1]) with B(3 points, mu=[2,2])
- Merge Y: C(2 points, mu=[1,1]) with D(4 points, mu=[5,5])
- Merge Z: E(5 points, mu=[3,3]) with F(5 points, mu=[3.5, 3.5])
Step 1: Delta for merge X Delta_X = (3*3)/(3+3) * ||(1-2)^2 + (1-2)^2|| = 9/6 * (1+1) = 1.5 * 2 = 3.0
Step 2: Delta for merge Y Delta_Y = (2*4)/(2+4) * ||(1-5)^2 + (1-5)^2|| = 8/6 * (16+16) = 1.333 * 32 = 42.67
Step 3: Delta for merge Z Delta_Z = (5*5)/(5+5) * ||(3-3.5)^2 + (3-3.5)^2|| = 25/10 * (0.25+0.25) = 2.5 * 0.5 = 1.25
Step 4: Ward's method merges the pair with smallest Delta Delta_Z = 1.25 < Delta_X = 3.0 < Delta_Y = 42.67 -> Merge E and F first (they're closest and similar-sized)
Interpret: "Ward's method prefers merging clusters that are close together AND similar in size. Merge Y has a huge Delta (42.67) because the centroids are far apart. Merge Z wins because the clusters are very close (0.5 units apart) even though they're larger."
Scikit-learn + Dendrogram
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# Compute linkage matrix for dendrogram
Z = linkage(X_scaled, method='ward')
plt.figure(figsize=(12, 5))
dendrogram(Z, truncate_mode='lastp', p=20, leaf_rotation=90)
plt.title('Hierarchical Clustering Dendrogram (Ward)')
plt.xlabel('Cluster Size')
plt.ylabel('Distance')
plt.show()
# Cut the dendrogram at desired number of clusters
agg = AgglomerativeClustering(n_clusters=4, linkage='ward')
agg_labels = agg.fit_predict(X_scaled)
print(f"Silhouette: {silhouette_score(X_scaled, agg_labels):.3f}")Gaussian Mixture Models (GMM)
The Idea
K-Means assigns each point to exactly one cluster (hard assignment). GMMs model data as a mixture of
Generative Model
The data is generated by:
- Choose cluster
with probability (mixing coefficient) - Draw
from
The probability density:
where:
The EM Algorithm
Since we can't directly maximize the log-likelihood (it involves a sum inside the log), we use Expectation-Maximization:
E-step — Compute responsibilities (posterior probability that point
Worked Example — GMM E-Step (Responsibilities)
1D data, K=2 Gaussians. Parameters:
- Cluster 1: pi1=0.5, mu1=2, sigma1=1
- Cluster 2: pi2=0.5, mu2=6, sigma2=1
Compute responsibilities for x=3.5:
Step 1: N(3.5 | mu1=2, sigma=1) = (1/sqrt(2*pi)) * exp(-(3.5-2)^2/2) = 0.3989 * exp(-1.125) = 0.3989 * 0.3247 = 0.1295
Step 2: N(3.5 | mu2=6, sigma=1) = 0.3989 * exp(-(3.5-6)^2/2) = 0.3989 * exp(-3.125) = 0.3989 * 0.0439 = 0.0175
Step 3: Compute responsibilities gamma_1 = (0.5 * 0.1295) / (0.5 * 0.1295 + 0.5 * 0.0175) = 0.0648 / (0.0648 + 0.0088) = 0.0648 / 0.0735 = 0.881 gamma_2 = 0.0088 / 0.0735 = 0.119
Step 4: Verify: 0.881 + 0.119 = 1.000
Interpret: "Point x=3.5 has 88.1% responsibility to cluster 1 (centered at 2) and 11.9% to cluster 2 (centered at 6). Unlike K-Means which would assign it 100% to cluster 1, GMM gives soft assignments reflecting uncertainty."
M-step — Update parameters using responsibilities:
Repeat until log-likelihood converges.
Scikit-learn GMM
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4, covariance_type='full',
n_init=5, random_state=42)
gmm.fit(X_scaled)
# Soft assignments (probabilities)
probs = gmm.predict_proba(X_scaled)
labels = gmm.predict(X_scaled)
print(f"Log-likelihood: {gmm.score(X_scaled):.3f}")
print(f"BIC: {gmm.bic(X_scaled):.1f}")
print(f"AIC: {gmm.aic(X_scaled):.1f}")
# Use BIC to select number of components
bics = []
for k in range(2, 10):
g = GaussianMixture(n_components=k, random_state=42)
g.fit(X_scaled)
bics.append(g.bic(X_scaled))
best_k = range(2, 10)[np.argmin(bics)]
print(f"Best K by BIC: {best_k}")Covariance Types
| Type | Parameters per component | Shape |
|---|---|---|
'full' | Arbitrary ellipsoid | |
'tied' | Same ellipsoid, different centers | |
'diag' | Axis-aligned ellipsoid | |
'spherical' | Sphere (like K-Means) |
Real Dataset: Mall Customers Segmentation
import pandas as pd
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
# Load Mall Customers dataset
# Columns: CustomerID, Gender, Age, Annual Income (k$), Spending Score (1-100)
url = "https://raw.githubusercontent.com/dsrscientist/dataset1/master/mall_customers.csv"
df = pd.read_csv(url)
print(df.head())
print(f"\nShape: {df.shape}")
print(df.describe())
# Use Annual Income and Spending Score for 2D visualization
X = df[['Annual Income (k$)', 'Spending Score (1-100)']].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# --- Elbow Method ---
inertias = []
sil_scores = []
K_range = range(2, 11)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
sil_scores.append(silhouette_score(X_scaled, km.labels_))
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_title('Elbow Method')
axes[0].set_xlabel('K')
axes[0].set_ylabel('Inertia')
axes[1].plot(K_range, sil_scores, 'ro-')
axes[1].set_title('Silhouette Score')
axes[1].set_xlabel('K')
axes[1].set_ylabel('Score')
plt.tight_layout()
plt.show()
# --- K-Means with best K ---
best_k = 5 # Typical result for this dataset
kmeans = KMeans(n_clusters=best_k, n_init=10, random_state=42)
km_labels = kmeans.fit_predict(X_scaled)
# --- Compare all methods ---
methods = {
'K-Means (K=5)': km_labels,
'DBSCAN (eps=0.5)': DBSCAN(eps=0.5, min_samples=5).fit_predict(X_scaled),
'Agglomerative (K=5)': AgglomerativeClustering(n_clusters=5).fit_predict(X_scaled),
'GMM (K=5)': GaussianMixture(n_components=5, random_state=42).fit_predict(X_scaled),
}
fig, axes = plt.subplots(1, 4, figsize=(20, 5))
for ax, (name, labels) in zip(axes, methods.items()):
scatter = ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30, alpha=0.7)
ax.set_title(name)
ax.set_xlabel('Annual Income (k$)')
ax.set_ylabel('Spending Score')
# Silhouette (skip noise-only results)
valid = labels[labels != -1]
if len(set(valid)) > 1:
sil = silhouette_score(X_scaled[labels != -1], valid)
ax.text(0.05, 0.95, f'Sil: {sil:.3f}', transform=ax.transAxes, va='top')
plt.tight_layout()
plt.show()
# --- Segment profiling ---
df['Cluster'] = km_labels
profile = df.groupby('Cluster').agg({
'Age': 'mean',
'Annual Income (k$)': 'mean',
'Spending Score (1-100)': 'mean',
'CustomerID': 'count'
}).rename(columns={'CustomerID': 'Count'}).round(1)
print("\nCluster Profiles:")
print(profile)Hyperparameters Reference
K-Means
| Parameter | Default | Guidance |
|---|---|---|
n_clusters | 8 | Use elbow + silhouette to choose |
init | 'k-means++' | Always use K-Means++ |
n_init | 10 | More runs = better chance of global optimum |
max_iter | 300 | Increase if not converging |
tol | 1e-4 | Convergence threshold for centroid movement |
DBSCAN
| Parameter | Default | Guidance |
|---|---|---|
eps | 0.5 | Use k-distance graph to choose |
min_samples | 5 | Rule of thumb: |
metric | 'euclidean' | Try 'cosine' for text data |
Agglomerative
| Parameter | Default | Guidance |
|---|---|---|
n_clusters | 2 | Use dendrogram to pick cutoff |
linkage | 'ward' | Ward for compact clusters, average for varied |
metric | 'euclidean' | Ward requires euclidean |
GMM
| Parameter | Default | Guidance |
|---|---|---|
n_components | 1 | Use BIC/AIC to select |
covariance_type | 'full' | Start full, try diag if overfitting |
n_init | 1 | Use 5-10 for stability |
reg_covar | 1e-6 | Increase if singular covariance errors |
Edge Cases and Pitfalls
Scaling Is Mandatory
K-Means and DBSCAN use distance metrics. If features have different scales, the feature with the largest range dominates. Always standardize with StandardScaler or MinMaxScaler.
Empty Clusters in K-Means
If a centroid has no assigned points, the cluster dies. Scikit-learn handles this by reinitializing the centroid, but from-scratch implementations must handle it explicitly.
DBSCAN with Varying Density
A single eps cannot capture clusters with different densities. Use HDBSCAN (hierarchical DBSCAN) instead:
import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=5)
labels = clusterer.fit_predict(X_scaled)High-Dimensional Data
Distance metrics become less meaningful in high dimensions (curse of dimensionality). Apply dimensionality reduction before clustering.
Non-Convex Clusters
K-Means assumes convex, spherical clusters. For non-convex shapes, use DBSCAN, spectral clustering, or GMM.
Categorical Features
K-Means uses Euclidean distance and cannot handle categorical features directly. Use K-Modes or K-Prototypes for mixed data, or encode categoricals properly.
Comparison Summary
| Algorithm | Best For | Weaknesses | Complexity |
|---|---|---|---|
| K-Means | Large datasets, spherical clusters | Requires K, sensitive to init | |
| DBSCAN | Arbitrary shapes, noise detection | Struggles with varying density | |
| Agglomerative | Small-medium data, dendrogram | ||
| GMM | Soft assignments, elliptical clusters | Can overfit, slow with many features | |
| HDBSCAN | Varying density, robust | More complex to tune |
Cross-References
- Dimensionality Reduction — Reduce dimensions before clustering high-dimensional data
- Anomaly Detection — Clustering-based anomaly detection
- Evaluation Metrics — Silhouette, adjusted Rand index, normalized mutual information
- Feature Engineering — Feature scaling and transformation for clustering
- Algorithm Selection Guide — When to choose clustering vs other methods