Support Vector Machines (SVM)

Support Vector Machines find the hyperplane that maximizes the margin between classes. The math is elegant, the kernel trick is powerful, and SVMs remain competitive for small-to-medium datasets — especially in high dimensions.

Maximum Margin Classification

The Idea

Given linearly separable data, infinitely many hyperplanes can separate the classes. SVM finds the one that maximizes the margin — the distance between the hyperplane and the nearest data points from each class.

The decision function is:

f (x) = w^{T} x + b

Points with $f (x) > 0$ belong to class +1, points with $f (x) < 0$ belong to class -1.

The Margin

The distance from a point $x_{i}$ to the hyperplane is:

distance = \frac{| w^{T} x_{i} + b |}{∥ w ∥}

For the nearest points (support vectors), we normalize so that $| w^{T} x_{i} + b | = 1$ . The margin becomes:

margin = \frac{2}{∥ w ∥}

Worked Example — SVM Margin Calculation

2D dataset with w = [2, 1], b = -5:

Step 1: Compute ||w|| ||w|| = sqrt(2^2 + 1^2) = sqrt(4 + 1) = sqrt(5) = 2.236

Step 2: Compute margin margin = 2 / ||w|| = 2 / 2.236 = 0.894

Step 3: Compute distance of point x = [3, 2] from the hyperplane distance = |w^T x + b| / ||w|| = |2(3) + 1(2) - 5| / 2.236 = |6 + 2 - 5| / 2.236 = |3| / 2.236 = 1.342

Step 4: Classify the point f(x) = 2(3) + 1(2) - 5 = 3 > 0 -> class +1

Interpret: "The margin is 0.894 units wide. Point [3,2] is 1.342 units from the boundary (well outside the margin of 0.447 on each side), so it's confidently classified as +1."

The Optimization Problem

Maximize the margin = minimize $∥ w ∥^{2}$ :

Primal problem:

min_{w, b} \frac{1}{2} ∥ w ∥^{2}

subject to:

y_{i} (w^{T} x_{i} + b) \geq 1, \forall i = 1, \dots, n

This is a convex quadratic program with linear constraints.

Lagrangian Dual Formulation

Setting Up the Lagrangian

Introduce Lagrange multipliers $α_{i} \geq 0$ for each constraint:

L (w, b, α) = \frac{1}{2} ∥ w ∥^{2} - \sum_{i = 1}^{n} α_{i} [y_{i} (w^{T} x_{i} + b) - 1]

Deriving the Dual

Take partial derivatives and set to zero:

\frac{\partial L}{\partial w} = w - \sum_{i = 1}^{n} α_{i} y_{i} x_{i} = 0 ⟹ w = \sum_{i = 1}^{n} α_{i} y_{i} x_{i}

\frac{\partial L}{\partial b} = - \sum_{i = 1}^{n} α_{i} y_{i} = 0 ⟹ \sum_{i = 1}^{n} α_{i} y_{i} = 0

Substituting back into the Lagrangian:

Dual problem:

max_{α} \sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} y_{i} y_{j} x_{i}^{T} x_{j}

subject to:

α_{i} \geq 0, \sum_{i = 1}^{n} α_{i} y_{i} = 0

KKT Conditions

The Karush-Kuhn-Tucker conditions give us:

α_{i} [y_{i} (w^{T} x_{i} + b) - 1] = 0

This means either $α_{i} = 0$ (the point is not a support vector) or $y_{i} (w^{T} x_{i} + b) = 1$ (the point is on the margin — a support vector).

Key insight: Only support vectors have $α_{i} > 0$ . The solution depends only on a small subset of training points.

python

# svm_math.py — Visualizing margin and support vectors
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_blobs

# Create linearly separable data
X, y = make_blobs(n_samples=100, centers=2, random_state=42, cluster_std=1.2)
y = 2 * y - 1  # Convert to {-1, +1}

# Train linear SVM
svm = SVC(kernel='linear', C=100)  # large C ≈ hard margin
svm.fit(X, y)

# Get decision function parameters
w = svm.coef_[0]
b = svm.intercept_[0]

print(f"Weight vector w: {w}")
print(f"Bias b: {b:.4f}")
print(f"||w|| = {np.linalg.norm(w):.4f}")
print(f"Margin = 2/||w|| = {2/np.linalg.norm(w):.4f}")
print(f"Number of support vectors: {svm.n_support_}")

# Plot
fig, ax = plt.subplots(1, 1, figsize=(10, 8))

# Data points
ax.scatter(X[y==1, 0], X[y==1, 1], c='blue', marker='o', s=50, label='Class +1')
ax.scatter(X[y==-1, 0], X[y==-1, 1], c='red', marker='s', s=50, label='Class -1')

# Support vectors
sv = svm.support_vectors_
ax.scatter(sv[:, 0], sv[:, 1], s=200, facecolors='none', edgecolors='black',
           linewidths=2, label='Support Vectors')

# Decision boundary and margins
xx = np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 200)
yy = -(w[0] * xx + b) / w[1]
yy_up = -(w[0] * xx + b - 1) / w[1]
yy_down = -(w[0] * xx + b + 1) / w[1]

ax.plot(xx, yy, 'k-', linewidth=2, label='Decision boundary')
ax.plot(xx, yy_up, 'k--', linewidth=1, label='Margin')
ax.plot(xx, yy_down, 'k--', linewidth=1)

ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('SVM: Maximum Margin Classification')
ax.legend()
ax.set_xlim(X[:, 0].min()-1, X[:, 0].max()+1)
ax.set_ylim(X[:, 1].min()-1, X[:, 1].max()+1)
plt.tight_layout()
plt.savefig('svm_margin.png', dpi=150)
plt.show()

Soft Margin SVM

Real data is rarely linearly separable. Soft margin SVM allows some points to violate the margin using slack variables $ξ_{i} \geq 0$ :

min_{w, b, ξ} \frac{1}{2} ∥ w ∥^{2} + C \sum_{i = 1}^{n} ξ_{i}

subject to:

y_{i} (w^{T} x_{i} + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0

The parameter $C$ controls the trade-off:

Large $C$ : Small margin, few violations (close to hard margin)
Small $C$ : Large margin, more violations allowed

python

# soft_margin.py — Effect of C parameter
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

X, y = make_classification(n_samples=200, n_features=2, n_redundant=0,
                           n_informative=2, random_state=42, n_clusters_per_class=1)
scaler = StandardScaler()
X = scaler.fit_transform(X)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, C in zip(axes, [0.01, 1.0, 100.0]):
    svm = SVC(kernel='linear', C=C)
    svm.fit(X, y)

    # Decision boundary
    xx, yy = np.meshgrid(np.linspace(-3, 3, 200), np.linspace(-3, 3, 200))
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    ax.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.2, colors=['red', 'blue'])
    ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors='black', linewidths=[1, 2, 1],
               linestyles=['--', '-', '--'])
    ax.scatter(X[y==0, 0], X[y==0, 1], c='red', marker='o', s=30)
    ax.scatter(X[y==1, 0], X[y==1, 1], c='blue', marker='s', s=30)
    sv = svm.support_vectors_
    ax.scatter(sv[:, 0], sv[:, 1], s=100, facecolors='none', edgecolors='black')
    ax.set_title(f'C={C} ({len(sv)} support vectors)')

plt.tight_layout()
plt.savefig('soft_margin.png', dpi=150)
plt.show()

The Kernel Trick

Why Kernels?

When data is not linearly separable, we can map it to a higher-dimensional space where it is. A kernel function $K (x_{i}, x_{j})$ computes the dot product in the transformed space without explicitly computing the transformation:

K (x_{i}, x_{j}) = ϕ (x_{i})^{T} ϕ (x_{j})

This is computationally brilliant — the dual problem only needs dot products between data points, so we replace $x_{i}^{T} x_{j}$ with $K (x_{i}, x_{j})$ .

Common Kernels

Kernel	Formula	Parameters
Linear	$K (x, z) = x^{T} z$	None
Polynomial	$K (x, z) = (γ x^{T} z + r)^{d}$	$γ$ , $r$ , degree $d$
RBF (Gaussian)	$K (x, z) = \exp (- γ \| x - z \|^{2})$	$γ$
Sigmoid	$K (x, z) = \tanh (γ x^{T} z + r)$	$γ$ , $r$

Worked Example — Kernel Computations

Two points: x = [1, 2], z = [3, 1], gamma = 0.5:

Step 1: Linear kernel K_linear = x^T z = 1(3) + 2(1) = 3 + 2 = 5

Step 2: Polynomial kernel (degree=2, r=1, gamma=1) K_poly = (1 * x^T z + 1)^2 = (1*5 + 1)^2 = 6^2 = 36

Step 3: RBF kernel (gamma=0.5) ||x - z||^2 = (1-3)^2 + (2-1)^2 = 4 + 1 = 5 K_rbf = exp(-0.5 * 5) = exp(-2.5) = 0.082

Step 4: RBF with same point (x = z) ||x - x||^2 = 0 K_rbf = exp(-0.5 * 0) = exp(0) = 1.0

Interpret: "The RBF kernel gives 1.0 for identical points and decays to 0 as points move apart. With gamma=0.5, points 5 squared-distance units apart have similarity 0.082 (very dissimilar). Higher gamma makes the kernel more local."

RBF Kernel Deep Dive

The RBF kernel maps data to an infinite-dimensional space. The $γ$ parameter controls the influence radius:

Small $γ$ : Large radius, smooth boundary, can underfit
Large $γ$ : Small radius, tight boundary, can overfit

python

# kernel_demo.py — Different kernels on non-linear data
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_circles, make_moons

datasets = [
    ("Circles", make_circles(n_samples=300, factor=0.5, noise=0.1, random_state=42)),
    ("Moons", make_moons(n_samples=300, noise=0.15, random_state=42)),
]

kernels = ['linear', 'poly', 'rbf']

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

for row, (name, (X, y)) in enumerate(datasets):
    for col, kernel in enumerate(kernels):
        ax = axes[row, col]
        svm = SVC(kernel=kernel, C=1.0, gamma='scale', degree=3)
        svm.fit(X, y)

        xx, yy = np.meshgrid(
            np.linspace(X[:, 0].min()-0.5, X[:, 0].max()+0.5, 200),
            np.linspace(X[:, 1].min()-0.5, X[:, 1].max()+0.5, 200)
        )
        Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

        ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
        ax.scatter(X[y==0, 0], X[y==0, 1], c='red', s=20)
        ax.scatter(X[y==1, 0], X[y==1, 1], c='blue', s=20)
        acc = svm.score(X, y)
        ax.set_title(f'{name} — {kernel} (acc={acc:.2f})')

plt.tight_layout()
plt.savefig('kernel_comparison.png', dpi=150)
plt.show()

Effect of Gamma on RBF

python

# gamma_effect.py — How gamma affects RBF SVM
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=200, noise=0.15, random_state=42)

fig, axes = plt.subplots(1, 4, figsize=(20, 4))

for ax, gamma in zip(axes, [0.1, 1.0, 10.0, 100.0]):
    svm = SVC(kernel='rbf', gamma=gamma, C=1.0)
    svm.fit(X, y)

    xx, yy = np.meshgrid(
        np.linspace(-1.5, 2.5, 200),
        np.linspace(-1, 1.5, 200)
    )
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    ax.contourf(xx, yy, Z, levels=20, cmap='RdYlBu', alpha=0.7)
    ax.scatter(X[y==0, 0], X[y==0, 1], c='red', s=20, edgecolors='black', linewidths=0.5)
    ax.scatter(X[y==1, 0], X[y==1, 1], c='blue', s=20, edgecolors='black', linewidths=0.5)
    n_sv = svm.n_support_.sum()
    ax.set_title(f'gamma={gamma} ({n_sv} SVs)')

plt.tight_layout()
plt.savefig('rbf_gamma.png', dpi=150)
plt.show()

From-Scratch Linear SVM

python

# svm_scratch.py — Linear SVM using gradient descent on hinge loss
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

class LinearSVMScratch:
    """Linear SVM using sub-gradient descent on hinge loss.

    Minimizes: (1/n) Σ max(0, 1 - yi(w·xi + b)) + λ||w||²
    """

    def __init__(self, learning_rate=0.01, lambda_param=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.lambda_param = lambda_param
        self.n_iterations = n_iterations
        self.w = None
        self.b = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        # Convert labels to {-1, +1}
        y_ = np.where(y <= 0, -1, 1)

        self.w = np.zeros(n_features)
        self.b = 0.0

        for epoch in range(self.n_iterations):
            for i in range(n_samples):
                # Check if point satisfies margin condition
                condition = y_[i] * (np.dot(X[i], self.w) + self.b) >= 1

                if condition:
                    # Correctly classified with margin — only regularization gradient
                    self.w -= self.lr * (2 * self.lambda_param * self.w)
                else:
                    # Misclassified or within margin — hinge loss gradient
                    self.w -= self.lr * (2 * self.lambda_param * self.w - y_[i] * X[i])
                    self.b -= self.lr * (-y_[i])

        return self

    def predict(self, X):
        decision = np.dot(X, self.w) + self.b
        return np.where(decision >= 0, 1, 0)

    def decision_function(self, X):
        return np.dot(X, self.w) + self.b


# Test
X, y = make_classification(n_samples=500, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

svm = LinearSVMScratch(learning_rate=0.001, lambda_param=0.01, n_iterations=500)
svm.fit(X_train_s, y_train)
y_pred = svm.predict(X_test_s)
print(f"From-scratch accuracy: {accuracy_score(y_test, y_pred):.4f}")

from sklearn.svm import SVC
sk_svm = SVC(kernel='linear', C=1.0)
sk_svm.fit(X_train_s, y_train)
print(f"sklearn accuracy:      {sk_svm.score(X_test_s, y_test):.4f}")

End-to-End Examples

Iris Dataset

python

# iris_svm.py — SVM on Iris with kernel comparison
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

iris = load_iris()
X, y = iris.data, iris.target

kernels = ['linear', 'poly', 'rbf']
for kernel in kernels:
    pipe = make_pipeline(StandardScaler(), SVC(kernel=kernel))
    scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    print(f"{kernel:>8}: {scores.mean():.4f} +/- {scores.std():.4f}")

# Grid search for RBF
param_grid = {
    'svc__C': [0.1, 1, 10, 100],
    'svc__gamma': ['scale', 'auto', 0.01, 0.1, 1],
}

pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf'))
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid.fit(X, y)
print(f"\nBest: {grid.best_score_:.4f} with {grid.best_params_}")

Digits Dataset

python

# digits_svm.py — SVM on handwritten digits (high dimensional)
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
import time

digits = load_digits()
X, y = digits.data, digits.target
print(f"Shape: {X.shape}")  # (1797, 64) — 8x8 pixel images

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# RBF SVM excels in high dimensions
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', C=10, gamma=0.001))

start = time.time()
pipe.fit(X_train, y_train)
train_time = time.time() - start

y_pred = pipe.predict(X_test)
print(f"Training time: {train_time:.3f}s")
print(f"Accuracy: {pipe.score(X_test, y_test):.4f}")
print(classification_report(y_test, y_pred))

SVM Strengths and Weaknesses

Aspect	Assessment
Strengths
High-dimensional data	Excellent — kernel trick scales well
Small-medium datasets	Very competitive
Margin-based	Robust to outliers (with proper C)
Theoretical foundations	Strong generalization bounds
Weaknesses
Large datasets ( $n > 100 K$ )	$O (n^{2})$ to $O (n^{3})$ training time
Feature scaling required	Always need StandardScaler
Probability estimates	Not native — need Platt scaling
Interpretability	Black box — no feature importance
Multi-class	One-vs-one or one-vs-rest (slower)

Support Vector Machines (SVM) ​

Maximum Margin Classification ​

The Idea ​

The Margin ​

The Optimization Problem ​

Lagrangian Dual Formulation ​

Setting Up the Lagrangian ​

Deriving the Dual ​

KKT Conditions ​

Soft Margin SVM ​

The Kernel Trick ​

Why Kernels? ​

Common Kernels ​

RBF Kernel Deep Dive ​

Effect of Gamma on RBF ​

From-Scratch Linear SVM ​

End-to-End Examples ​

Iris Dataset ​

Digits Dataset ​

SVM Strengths and Weaknesses ​

Further Reading ​

Related Pages

Support Vector Machines (SVM)

Maximum Margin Classification

The Idea

The Margin

The Optimization Problem

Lagrangian Dual Formulation

Setting Up the Lagrangian

Deriving the Dual

KKT Conditions

Soft Margin SVM

The Kernel Trick

Why Kernels?

Common Kernels

RBF Kernel Deep Dive

Effect of Gamma on RBF

From-Scratch Linear SVM

End-to-End Examples

Iris Dataset

Digits Dataset

SVM Strengths and Weaknesses

Further Reading