Math Foundations for Machine Learning

Every machine learning algorithm reduces to mathematical operations. Linear regression is a matrix equation. Gradient descent is calculus. Naive Bayes is probability theory. Understanding the math does not just help you debug models — it lets you invent new ones.

This page covers the four pillars: linear algebra, calculus, probability, and statistics — each with full derivations and runnable Python code.

Linear Algebra

Vectors

A vector is an ordered list of numbers. In ML, a single data point with $d$ features is a vector $x \in R^{d}$ .

python

# vectors.py — Vector operations with NumPy
import numpy as np

# A data point with 3 features
x = np.array([1.0, 2.0, 3.0])
y = np.array([4.0, 5.0, 6.0])

# Addition — element-wise
print(f"x + y = {x + y}")  # [5. 7. 9.]

# Scalar multiplication
print(f"3x = {3 * x}")  # [3. 6. 9.]

# Dot product — measures similarity
dot = np.dot(x, y)  # 1*4 + 2*5 + 3*6 = 32
print(f"x · y = {dot}")

# Magnitude (L2 norm)
norm_x = np.linalg.norm(x)
print(f"||x|| = {norm_x:.4f}")  # sqrt(1 + 4 + 9) = 3.7417

# Unit vector
x_hat = x / norm_x
print(f"Unit vector: {x_hat}")
print(f"||x_hat|| = {np.linalg.norm(x_hat):.4f}")  # 1.0

# Cosine similarity — angle between vectors
cos_sim = np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
print(f"cos(x, y) = {cos_sim:.4f}")  # 0.9746 — very similar direction

The dot product has deep geometric meaning:

x \cdot y = ∥ x ∥ ∥ y ∥ \cos θ

Worked Example — Dot Product and Cosine Similarity

Two feature vectors: x = [3, 4], y = [4, 3]

Step 1: Dot product x . y = 3(4) + 4(3) = 12 + 12 = 24

Step 2: Magnitudes ||x|| = sqrt(9 + 16) = sqrt(25) = 5 ||y|| = sqrt(16 + 9) = sqrt(25) = 5

Step 3: Cosine similarity cos(theta) = 24 / (5 * 5) = 24/25 = 0.96

Step 4: Angle theta = arccos(0.96) = 16.3 degrees

Interpret: "The vectors are highly similar (cos=0.96, angle=16.3 degrees). They point in nearly the same direction. If these were document vectors, the documents would be about 96% similar in content."

When $θ = 0$ (same direction), $\cos θ = 1$ and the dot product is maximized. When $θ = 90 °$ (perpendicular), the dot product is zero. This is why the dot product measures similarity.

Matrices

A dataset with $n$ samples and $d$ features is a matrix $X \in R^{n \times d}$ . Each row is a data point, each column is a feature.

python

# matrices.py — Matrix operations with NumPy
import numpy as np

# Dataset: 4 samples, 3 features
X = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
])
print(f"Shape: {X.shape}")  # (4, 3) — 4 samples, 3 features

# Transpose — swap rows and columns
print(f"X^T shape: {X.T.shape}")  # (3, 4)

# Matrix multiplication — X^T X is the Gram matrix
# Used in linear regression normal equation
gram = X.T @ X  # (3,4) @ (4,3) = (3,3)
print(f"X^T X =\n{gram}")

# Inverse — only for square, full-rank matrices
A = np.array([[4, 7], [2, 6]])
A_inv = np.linalg.inv(A)
print(f"A^-1 =\n{A_inv}")

# Verify: A @ A^-1 = I
identity = A @ A_inv
print(f"A @ A^-1 =\n{np.round(identity, 10)}")

# Determinant — measures how much A scales volume
det = np.linalg.det(A)
print(f"det(A) = {det:.1f}")  # 10.0 — nonzero means invertible

# Trace — sum of diagonal elements
print(f"tr(A) = {np.trace(A)}")  # 4 + 6 = 10

Matrix Multiplication Rules

For $A \in R^{m \times n}$ and $B \in R^{n \times p}$ , the product $C = AB$ has shape $m \times p$ :

C_{i j} = \sum_{k = 1}^{n} A_{i k} B_{k j}

The inner dimensions must match: $(m \times n) \times (n \times p)$ .

Eigendecomposition

An eigenvector of matrix $A$ is a vector that, when multiplied by $A$ , only gets scaled (not rotated):

A v = λ v

Worked Example — Eigenvalues and Eigenvectors

Matrix A = [[4, 2],[1, 3]]. Find eigenvalues and eigenvectors.

Step 1: Solve det(A - lambdaI) = 0 det([[4-lambda, 2],[1, 3-lambda]]) = (4-lambda)(3-lambda) - 21 = 0 lambda^2 - 7lambda + 12 - 2 = 0 lambda^2 - 7lambda + 10 = 0 (lambda - 5)(lambda - 2) = 0 lambda1 = 5, lambda2 = 2

Step 2: Find eigenvector for lambda1 = 5 (A - 5I)v = 0 -> [[-1, 2],[1, -2]]v = 0 -v1 + 2v2 = 0 -> v1 = 2v2 Eigenvector: v1 = [2, 1] (normalized: [0.894, 0.447])

Step 3: Find eigenvector for lambda2 = 2 (A - 2I)v = 0 -> [[2, 2],[1, 1]]v = 0 2v1 + 2v2 = 0 -> v1 = -v2 Eigenvector: v2 = [-1, 1] (normalized: [-0.707, 0.707])

Step 4: Verify A @ v1 = lambda1 * v1 A @ [2,1] = [4(2)+2(1), 1(2)+3(1)] = [10, 5] = 5 * [2, 1] -> correct!

Interpret: "The matrix A stretches space along [2,1] by factor 5 and along [-1,1] by factor 2. In PCA, the largest eigenvalue (5) corresponds to the direction of most variance — that's the first principal component."

where $λ$ is the eigenvalue. Eigendecomposition is the foundation of PCA (principal component analysis).

python

# eigen.py — Eigendecomposition and its role in PCA
import numpy as np

# Covariance matrix of 2D data
np.random.seed(42)
data = np.random.randn(200, 2) @ np.array([[2, 1], [1, 3]])

cov = np.cov(data.T)
print(f"Covariance matrix:\n{cov}")

# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov)

# Sort by descending eigenvalue
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print(f"\nEigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")

# Variance explained ratio — this IS PCA
total_var = eigenvalues.sum()
var_ratio = eigenvalues / total_var
print(f"\nVariance explained: {var_ratio}")
print(f"First component captures {var_ratio[0]:.1%} of variance")

# Project data onto first eigenvector (1D PCA)
projected = data @ eigenvectors[:, 0]
print(f"Original shape: {data.shape}, Projected shape: {projected.shape}")

The Normal Equation Uses All of This

Linear regression's closed-form solution ties it all together:

θ = (X^{T} X)^{- 1} X^{T} y

This requires matrix transpose, matrix multiplication, and matrix inversion — three core linear algebra operations.

Calculus

Derivatives

The derivative of $f (x)$ measures the instantaneous rate of change — the slope of the tangent line:

f^{'} (x) = lim_{h \to 0} \frac{f (x + h) - f (x)}{h}

In ML, we care about derivatives because training = minimizing a loss function, and derivatives tell us which direction to move.

python

# derivatives.py — Derivatives numerically and symbolically
import numpy as np

# Numerical derivative
def numerical_derivative(f, x, h=1e-7):
    """Central difference — more accurate than forward difference."""
    return (f(x + h) - f(x - h)) / (2 * h)

# Example: f(x) = x^2 → f'(x) = 2x
f = lambda x: x**2
x = 3.0
print(f"f'({x}) numerical = {numerical_derivative(f, x):.6f}")  # ≈ 6.0
print(f"f'({x}) analytical = {2 * x}")  # 6.0

# MSE loss derivative
# L(θ) = (1/n) Σ (yi - θxi)^2
# dL/dθ = -(2/n) Σ xi(yi - θxi)
X = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
theta = 1.5

loss = lambda t: np.mean((y - t * X)**2)
grad_numerical = numerical_derivative(loss, theta)
grad_analytical = -2 * np.mean(X * (y - theta * X))

print(f"\nMSE loss at θ={theta}: {loss(theta):.4f}")
print(f"Gradient (numerical):  {grad_numerical:.6f}")
print(f"Gradient (analytical): {grad_analytical:.6f}")

Key Derivative Rules for ML

Rule	Formula	ML Use
Power rule	$\frac{d}{d x} x^{n} = n x^{n - 1}$	MSE loss gradient
Chain rule	$\frac{d}{d x} f (g (x)) = f^{'} (g (x)) \cdot g^{'} (x)$	Backpropagation
Product rule	$\frac{d}{d x} [f (x) g (x)] = f^{'} g + f g^{'}$	Regularized loss
Log derivative	$\frac{d}{d x} \ln x = \frac{1}{x}$	Log-likelihood
Exponential	$\frac{d}{d x} e^{x} = e^{x}$	Sigmoid, softmax

The Chain Rule

The chain rule is the single most important calculus concept for ML. It says: if $y = f (g (x))$ , then:

\frac{d y}{d x} = \frac{d y}{d g} \cdot \frac{d g}{d x}

This is how backpropagation works — it applies the chain rule repeatedly through layers of a neural network.

python

# chain_rule.py — Chain rule in action
import numpy as np

# Sigmoid function: σ(z) = 1 / (1 + e^(-z))
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Derivative of sigmoid using chain rule:
# σ(z) = (1 + e^(-z))^(-1)
# Let u = 1 + e^(-z)
# σ = u^(-1) → dσ/du = -u^(-2)
# du/dz = -e^(-z)
# dσ/dz = (-u^(-2))(-e^(-z)) = e^(-z) / (1 + e^(-z))^2
# Simplifies to: σ(z)(1 - σ(z))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

# Verify with numerical derivative
z = 2.0
analytical = sigmoid_derivative(z)
numerical = (sigmoid(z + 1e-7) - sigmoid(z - 1e-7)) / (2e-7)
print(f"σ'({z}) analytical = {analytical:.8f}")
print(f"σ'({z}) numerical  = {numerical:.8f}")

# Full chain: loss = (y - σ(wx + b))^2
# dloss/dw = dloss/dσ · dσ/dz · dz/dw
# = -2(y - σ(z)) · σ(z)(1-σ(z)) · x
w, b, x, y_true = 0.5, 0.1, 2.0, 1.0
z = w * x + b
y_pred = sigmoid(z)

dloss_dsigma = -2 * (y_true - y_pred)
dsigma_dz = sigmoid_derivative(z)
dz_dw = x

dloss_dw = dloss_dsigma * dsigma_dz * dz_dw
print(f"\n∂loss/∂w = {dloss_dw:.6f}")

Partial Derivatives and Gradients

When a function has multiple inputs, the partial derivative measures how it changes with respect to one input while holding others fixed.

The gradient is the vector of all partial derivatives:

\nabla f (x) = [\begin{matrix} \frac{\partial f}{\partial x_{1}} \\ \frac{\partial f}{\partial x_{2}} \\ ⋮ \\ \frac{\partial f}{\partial x_{d}} \end{matrix}]

The gradient points in the direction of steepest ascent. To minimize a loss, we move in the opposite direction — this is gradient descent.

python

# gradient.py — Gradient of a function with multiple variables
import numpy as np

# f(x1, x2) = x1^2 + 3*x1*x2 + x2^2
# ∂f/∂x1 = 2*x1 + 3*x2
# ∂f/∂x2 = 3*x1 + 2*x2

def f(x):
    return x[0]**2 + 3*x[0]*x[1] + x[1]**2

def gradient_f(x):
    return np.array([2*x[0] + 3*x[1], 3*x[0] + 2*x[1]])

# Numerical gradient for verification
def numerical_gradient(f, x, h=1e-7):
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy(); x_plus[i] += h
        x_minus = x.copy(); x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

x = np.array([1.0, 2.0])
print(f"Analytical gradient: {gradient_f(x)}")
print(f"Numerical gradient:  {numerical_gradient(f, x)}")

# Gradient descent on this function
x = np.array([5.0, 5.0])
lr = 0.1
for i in range(20):
    grad = gradient_f(x)
    x = x - lr * grad
    if i % 5 == 0:
        print(f"Step {i:2d}: x = [{x[0]:.4f}, {x[1]:.4f}], f(x) = {f(x):.4f}")

Gradient Descent Derivation

For a loss function $L (θ)$ , gradient descent updates parameters as:

θ_{t + 1} = θ_{t} - η \nabla L (θ_{t})

Worked Example — Gradient Descent Step

Minimize f(x) = (x - 3)^2. Start at x0 = 0, learning rate eta = 0.4.

Step 1: Compute gradient at x0 = 0 f'(x) = 2(x - 3) f'(0) = 2(0 - 3) = -6

Step 2: Update x x1 = 0 - 0.4 * (-6) = 0 + 2.4 = 2.4

Step 3: Next iteration at x1 = 2.4 f'(2.4) = 2(2.4 - 3) = -1.2 x2 = 2.4 - 0.4 * (-1.2) = 2.4 + 0.48 = 2.88

Step 4: Continue f'(2.88) = 2(2.88 - 3) = -0.24 x3 = 2.88 + 0.096 = 2.976

Interpret: "Starting at x=0, gradient descent moves toward the minimum at x=3. Each step reduces the distance: 3.0 -> 0.6 -> 0.12 -> 0.024. The negative gradient points uphill, so subtracting it moves us downhill."

where $η$ is the learning rate. The derivation comes from the first-order Taylor expansion:

L (θ + δ) \approx L (θ) + \nabla L (θ)^{T} δ

To decrease $L$ , choose $δ = - η \nabla L (θ)$ , giving:

L (θ + δ) \approx L (θ) - η ∥ \nabla L (θ) ∥^{2}

Since $∥ \nabla L ∥^{2} \geq 0$ , the loss decreases (for small enough $η$ ).

Probability

Basic Probability Rules

Concept	Formula	Example
Complement	$P (A^{c}) = 1 - P (A)$	$P (not spam) = 1 - P (spam)$
Union	$P (A \cup B) = P (A) + P (B) - P (A \cap B)$	$P (rain or cold)$
Conditional	$P (A ∣ B) = \frac{P (A \cap B)}{P (B)}$	$P (spam ∣ "free")$
Independence	$P (A \cap B) = P (A) P (B)$	Naive Bayes assumption

Bayes' Theorem

Bayes' theorem relates conditional probabilities:

P (A ∣ B) = \frac{P (B ∣ A) \cdot P (A)}{P (B)}

Worked Example — Bayes' Theorem (Medical Test)

A disease affects 1% of the population. A test has 95% sensitivity and 90% specificity.

P(disease) = 0.01
P(positive | disease) = 0.95
P(positive | healthy) = 0.10 (1 - specificity)

Step 1: P(positive) using law of total probability P(+) = P(+|disease)*P(disease) + P(+|healthy)*P(healthy) = 0.95 * 0.01 + 0.10 * 0.99 = 0.0095 + 0.099 = 0.1085

Step 2: P(disease | positive) using Bayes' theorem P(disease|+) = P(+|disease)*P(disease) / P(+) = (0.95 * 0.01) / 0.1085 = 0.0095 / 0.1085 = 0.0876

Step 3: Interpret Only 8.76% of positive tests actually have the disease!

Step 4: Why so low? Of 10,000 people: 100 have disease, 9900 are healthy True positives: 100 * 0.95 = 95 False positives: 9900 * 0.10 = 990 P(disease|+) = 95 / (95 + 990) = 95/1085 = 8.76%

Interpret: "This is the base rate fallacy. Despite a 95% sensitive test, the low prevalence (1%) means most positive results are false positives. You'd need a second independent test to raise confidence."

Derivation: Start from the definition of conditional probability in both directions:

P (A ∣ B) = \frac{P (A \cap B)}{P (B)}, P (B ∣ A) = \frac{P (A \cap B)}{P (A)}

From the second equation: $P (A \cap B) = P (B ∣ A) \cdot P (A)$ . Substitute into the first:

P (A ∣ B) = \frac{P (B ∣ A) \cdot P (A)}{P (B)}

Expanding $P (B)$ using the law of total probability:

P (B) = P (B ∣ A) P (A) + P (B ∣ A^{c}) P (A^{c})

python

# bayes.py — Bayes' theorem: medical test example
import numpy as np

# A disease affects 1% of the population
# Test sensitivity (true positive rate): 95%
# Test specificity (true negative rate): 90%
P_disease = 0.01
P_positive_given_disease = 0.95     # sensitivity
P_positive_given_healthy = 0.10     # 1 - specificity

# P(disease | positive) = ?
# Using Bayes' theorem:
P_positive = (P_positive_given_disease * P_disease +
              P_positive_given_healthy * (1 - P_disease))

P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive

print(f"P(disease)               = {P_disease:.2%}")
print(f"P(positive | disease)    = {P_positive_given_disease:.2%}")
print(f"P(positive | healthy)    = {P_positive_given_healthy:.2%}")
print(f"P(positive)              = {P_positive:.4f}")
print(f"P(disease | positive)    = {P_disease_given_positive:.2%}")
# Only ~8.8% — despite a 95% sensitive test!
# This is the base rate fallacy — low prevalence means
# most positives are false positives

# Simulation to verify
np.random.seed(42)
n = 1_000_000
has_disease = np.random.rand(n) < P_disease
test_positive = np.where(
    has_disease,
    np.random.rand(n) < P_positive_given_disease,
    np.random.rand(n) < P_positive_given_healthy
)

actual = has_disease[test_positive].mean()
print(f"\nSimulated P(disease|positive) = {actual:.2%}")

Probability Distributions

python

# distributions.py — Key distributions for ML
import numpy as np
from scipy import stats

# --- Gaussian (Normal) Distribution ---
# PDF: f(x) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²))
mu, sigma = 5.0, 2.0
x = np.linspace(-2, 12, 1000)
pdf = stats.norm.pdf(x, mu, sigma)
print(f"Normal: mean={mu}, std={sigma}")
print(f"P(3 < X < 7) = {stats.norm.cdf(7, mu, sigma) - stats.norm.cdf(3, mu, sigma):.4f}")

# --- Bernoulli Distribution ---
# P(X=1) = p, P(X=0) = 1-p
p = 0.3
samples = stats.bernoulli.rvs(p, size=10000)
print(f"\nBernoulli: p={p}, sample mean={samples.mean():.4f}")

# --- Binomial Distribution ---
# n trials, each with probability p
n_trials, p = 100, 0.3
print(f"\nBinomial: n={n_trials}, p={p}")
print(f"Expected value = np = {n_trials * p}")
print(f"P(X = 30) = {stats.binom.pmf(30, n_trials, p):.4f}")
print(f"P(X <= 25) = {stats.binom.cdf(25, n_trials, p):.4f}")

# --- Multinomial Distribution ---
# Used in Naive Bayes for text classification
probs = [0.5, 0.3, 0.2]  # word probabilities
counts = np.random.multinomial(100, probs)
print(f"\nMultinomial sample (100 words): {counts}")

# --- Poisson Distribution ---
# P(X=k) = λ^k e^(-λ) / k!
lam = 4.0
print(f"\nPoisson: λ={lam}")
print(f"P(X=3) = {stats.poisson.pmf(3, lam):.4f}")
print(f"P(X<=5) = {stats.poisson.cdf(5, lam):.4f}")

Maximum Likelihood Estimation (MLE)

MLE finds the parameters that make the observed data most probable. Given data $x_{1}, \dots, x_{n}$ and a model with parameter $θ$ :

{\hat{θ}}_{M L E} = \arg max_{θ} \prod_{i = 1}^{n} P (x_{i} ∣ θ)

Taking the log (since log is monotonic):

{\hat{θ}}_{M L E} = \arg max_{θ} \sum_{i = 1}^{n} \log P (x_{i} ∣ θ)

Worked Example — MLE for Coin Flip

Observed data: 7 heads, 3 tails in 10 flips. Find MLE for p (probability of heads).

Each flip follows Bernoulli(p): P(H) = p, P(T) = 1-p.

Step 1: Write the likelihood L(p) = p^7 * (1-p)^3

Step 2: Write the log-likelihood log L(p) = 7log(p) + 3log(1-p)

Step 3: Take derivative and set to zero d(log L)/dp = 7/p - 3/(1-p) = 0 7(1-p) = 3p 7 - 7p = 3p 7 = 10p p_MLE = 7/10 = 0.70

Step 4: Verify with different p values p=0.5: log L = 7*(-0.693) + 3*(-0.693) = -6.93 p=0.7: log L = 7*(-0.357) + 3*(-1.204) = -2.499 - 3.612 = -6.11 p=0.9: log L = 7*(-0.105) + 3*(-2.303) = -0.735 - 6.909 = -7.64

p=0.7 gives the highest log-likelihood, confirming our answer.

Interpret: "The MLE for a coin's bias given 7 heads in 10 flips is simply the observed proportion: 7/10 = 0.7. This is the most intuitive estimator, and MLE provides the mathematical justification."

python

# mle.py — MLE for Gaussian parameters
import numpy as np

# Generate data from a Gaussian
np.random.seed(42)
true_mu, true_sigma = 5.0, 2.0
data = np.random.normal(true_mu, true_sigma, 1000)

# MLE for μ: take the derivative of log-likelihood, set to 0
# log L = -n/2 log(2π) - n/2 log(σ²) - 1/(2σ²) Σ(xi - μ)²
# ∂log L/∂μ = 1/σ² Σ(xi - μ) = 0
# → μ_MLE = (1/n) Σ xi = sample mean

# MLE for σ²: ∂log L/∂σ² = -n/(2σ²) + 1/(2σ⁴) Σ(xi - μ)² = 0
# → σ²_MLE = (1/n) Σ(xi - μ)²

mu_mle = np.mean(data)
sigma_mle = np.sqrt(np.mean((data - mu_mle)**2))

print(f"True μ = {true_mu}, MLE μ = {mu_mle:.4f}")
print(f"True σ = {true_sigma}, MLE σ = {sigma_mle:.4f}")

Statistics

Expectation and Variance

Expected value (mean):

E [X] = \sum_{x} x \cdot P (X = x) (discrete)

E [X] = \int_{- \infty}^{\infty} x \cdot f (x) d x (continuous)

Variance — measures spread:

Var (X) = E [(X - μ)^{2}] = E [X^{2}] - (E [X])^{2}

python

# expectation_variance.py — Computing and verifying
import numpy as np

# Discrete: fair six-sided die
outcomes = np.arange(1, 7)
probs = np.ones(6) / 6

E_X = np.sum(outcomes * probs)
E_X2 = np.sum(outcomes**2 * probs)
Var_X = E_X2 - E_X**2

print(f"E[X] = {E_X:.4f}")    # 3.5
print(f"Var(X) = {Var_X:.4f}") # 2.9167

# Verify with simulation
rolls = np.random.randint(1, 7, 100000)
print(f"\nSimulated E[X] = {rolls.mean():.4f}")
print(f"Simulated Var(X) = {rolls.var():.4f}")

Bias-Variance Tradeoff

The expected prediction error can be decomposed:

E [(y - \hat{f} (x))^{2}] = Bias [\hat{f} (x)]^{2} + Var [\hat{f} (x)] + σ^{2}

Worked Example — Bias-Variance Tradeoff

True f(x) = 5. Noise sigma = 0.5. Three models trained on different data:

Model Complexity	Predictions over datasets	E[f_hat]	Bias^2	Var
Too simple (const=3)	[3, 3, 3, 3, 3]	3.0	(5-3)^2 = 4.0	0.0
Just right	[4.8, 5.3, 4.7, 5.1, 5.1]	5.0	(5-5)^2 = 0.0	0.04
Too complex	[3.5, 6.5, 4.0, 7.0, 4.0]	5.0	(5-5)^2 = 0.0	2.10

Expected prediction errors (sigma^2 = 0.25): Simple: EPE = 4.0 + 0.0 + 0.25 = 4.25 Right: EPE = 0.0 + 0.04 + 0.25 = 0.29 Complex: EPE = 0.0 + 2.10 + 0.25 = 2.35

Interpret: "The simple model has zero variance but huge bias (always predicts 3 instead of 5). The complex model is unbiased on average but varies wildly. The just-right model achieves the lowest total error (0.29) with a good bias-variance balance. No model can go below 0.25 (the irreducible noise)."

where:

Bias = $E [\hat{f} (x)] - f (x)$ — error from wrong assumptions
Variance = $E [(\hat{f} (x) - E [\hat{f} (x)])^{2}]$ — error from sensitivity to training data
$σ^{2}$ = irreducible noise

Derivation:

Let $y = f (x) + ϵ$ where $E [ϵ] = 0$ and $Var (ϵ) = σ^{2}$ .

E [(y - \hat{f})^{2}] = E [(f + ϵ - \hat{f})^{2}]

= E [(f - \hat{f})^{2}] + E [ϵ^{2}] + 2 E [(f - \hat{f}) ϵ]

Since $ϵ$ is independent of $\hat{f}$ , the cross term vanishes:

= E [(f - \hat{f})^{2}] + σ^{2}

Now expand $(f - \hat{f})^{2}$ by adding and subtracting $E [\hat{f}]$ :

E [(f - \hat{f})^{2}] = (f - E [\hat{f}])^{2} + E [(\hat{f} - E [\hat{f}])^{2}]

= {Bias}^{2} + Variance

python

# bias_variance.py — Demonstrate the tradeoff
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

# True function
def true_f(x):
    return np.sin(2 * x) + 0.5 * x

# Generate many datasets to estimate bias and variance
np.random.seed(42)
n_datasets = 200
n_train = 30
x_test = np.linspace(0, 5, 100).reshape(-1, 1)
y_true = true_f(x_test.ravel())

for degree in [1, 4, 15]:
    predictions = np.zeros((n_datasets, len(x_test)))

    for i in range(n_datasets):
        x_train = np.sort(np.random.uniform(0, 5, n_train)).reshape(-1, 1)
        y_train = true_f(x_train.ravel()) + np.random.normal(0, 0.5, n_train)

        model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
        model.fit(x_train, y_train)
        predictions[i] = model.predict(x_test)

    mean_pred = predictions.mean(axis=0)
    bias_sq = np.mean((mean_pred - y_true)**2)
    variance = np.mean(predictions.var(axis=0))
    mse = np.mean((predictions - y_true)**2)

    print(f"Degree {degree:2d}: Bias²={bias_sq:.4f}, Var={variance:.4f}, "
          f"Bias²+Var={bias_sq + variance:.4f}, MSE={mse:.4f}")
# Degree 1 — high bias (underfitting)
# Degree 4 — good balance
# Degree 15 — high variance (overfitting)

Covariance and Correlation

Covariance measures how two variables move together:

Cov (X, Y) = E [(X - μ_{X}) (Y - μ_{Y})]

Pearson correlation normalizes to $[- 1, 1]$ :

ρ_{X, Y} = \frac{Cov (X, Y)}{σ_{X} σ_{Y}}

python

# correlation.py — Covariance, correlation, and their matrix forms
import numpy as np

# Generate correlated data
np.random.seed(42)
n = 500
x1 = np.random.randn(n)
x2 = 0.8 * x1 + 0.6 * np.random.randn(n)  # correlated with x1
x3 = np.random.randn(n)  # independent

data = np.column_stack([x1, x2, x3])

# Covariance matrix — used in PCA, Gaussian distributions
cov_matrix = np.cov(data.T)
print("Covariance matrix:")
print(np.round(cov_matrix, 3))

# Correlation matrix — used in feature selection
corr_matrix = np.corrcoef(data.T)
print("\nCorrelation matrix:")
print(np.round(corr_matrix, 3))

# x1 and x2 are correlated (~0.8), x3 is independent (~0.0)
print(f"\ncorr(x1, x2) = {corr_matrix[0, 1]:.3f}")
print(f"corr(x1, x3) = {corr_matrix[0, 2]:.3f}")

Hypothesis Testing Intuition

While you rarely do formal hypothesis testing in ML, the concepts appear everywhere:

p-value in feature importance tests
Confidence intervals for model performance
A/B testing for model comparison in production

python

# hypothesis.py — Comparing two model performances
import numpy as np
from scipy import stats

# Two models evaluated on 30 test folds
np.random.seed(42)
model_a_scores = np.random.normal(0.82, 0.03, 30)
model_b_scores = np.random.normal(0.85, 0.03, 30)

# Paired t-test — are the models significantly different?
t_stat, p_value = stats.ttest_rel(model_a_scores, model_b_scores)
print(f"Model A mean: {model_a_scores.mean():.4f}")
print(f"Model B mean: {model_b_scores.mean():.4f}")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")

if p_value < 0.05:
    print("Difference is statistically significant at α=0.05")
else:
    print("No statistically significant difference")

# Bootstrap confidence interval for Model B accuracy
n_bootstrap = 10000
bootstrap_means = np.array([
    np.random.choice(model_b_scores, size=len(model_b_scores), replace=True).mean()
    for _ in range(n_bootstrap)
])
ci_lower = np.percentile(bootstrap_means, 2.5)
ci_upper = np.percentile(bootstrap_means, 97.5)
print(f"\n95% CI for Model B: [{ci_lower:.4f}, {ci_upper:.4f}]")

Information Theory

Entropy

Entropy measures the uncertainty of a random variable:

H (X) = - \sum_{i = 1}^{n} p_{i} \log_{2} p_{i}

Worked Example — Shannon Entropy

Compare entropy of three distributions over {A, B, C}:

Distribution 1: Uniform [1/3, 1/3, 1/3] H = -(1/3)log2(1/3) - (1/3)log2(1/3) - (1/3)log2(1/3) = -3 * (1/3)(-1.585) = 3 * 0.528 = 1.585 bits

Distribution 2: Skewed [0.8, 0.1, 0.1] H = -0.8log2(0.8) - 0.1log2(0.1) - 0.1*log2(0.1) = -0.8(-0.322) - 0.1(-3.322) - 0.1(-3.322) = 0.258 + 0.332 + 0.332 = 0.922 bits

Distribution 3: Certain [1.0, 0.0, 0.0] H = -1.0log2(1.0) = -1.00 = 0 bits

Interpret: "Uniform distribution has maximum entropy (1.585 bits) — maximum uncertainty about which class. The skewed distribution has less entropy (0.922) — we can guess 'A' most of the time. A certain distribution has zero entropy — no uncertainty at all. Decision trees seek splits that minimize child entropy."

Maximum entropy: uniform distribution (maximum uncertainty). Minimum entropy: deterministic (zero uncertainty).

python

# entropy.py — Entropy and its role in decision trees
import numpy as np

def entropy(probs):
    """Shannon entropy in bits."""
    probs = np.array(probs)
    probs = probs[probs > 0]  # avoid log(0)
    return -np.sum(probs * np.log2(probs))

# Fair coin: maximum entropy for binary
print(f"Fair coin entropy:   {entropy([0.5, 0.5]):.4f} bits")  # 1.0

# Biased coin: lower entropy
print(f"Biased (0.9, 0.1):   {entropy([0.9, 0.1]):.4f} bits")  # 0.469

# Certain outcome: zero entropy
print(f"Certain (1.0, 0.0):  {entropy([1.0, 0.0]):.4f} bits")  # 0.0

# 3-class classification: how pure is this split?
print(f"\nPure split [1, 0, 0]:   {entropy([1.0, 0.0, 0.0]):.4f}")
print(f"Mixed [0.33,0.33,0.34]: {entropy([0.33, 0.33, 0.34]):.4f}")
print(f"Skewed [0.8, 0.1, 0.1]: {entropy([0.8, 0.1, 0.1]):.4f}")

KL Divergence

Measures how one probability distribution differs from another:

D_{K L} (P ∥ Q) = \sum_{x} P (x) \log \frac{P (x)}{Q (x)}

Worked Example — KL Divergence and Cross-Entropy

True distribution P = [0, 0, 1] (class 3 is correct). Two model predictions:

Good model: Q1 = [0.05, 0.05, 0.90]
Bad model: Q2 = [0.40, 0.30, 0.30]

Step 1: Cross-entropy H(P, Q1) (good model) H(P, Q1) = -0ln(0.05) - 0ln(0.05) - 1*ln(0.90) = -ln(0.90) = 0.105

Step 2: Cross-entropy H(P, Q2) (bad model) H(P, Q2) = -0ln(0.40) - 0ln(0.30) - 1*ln(0.30) = -ln(0.30) = 1.204

Step 3: KL divergence (since P is one-hot, KL = cross-entropy - entropy of P) H(P) = 0 (deterministic distribution) D_KL(P||Q1) = 0.105 - 0 = 0.105 D_KL(P||Q2) = 1.204 - 0 = 1.204

Interpret: "The good model's cross-entropy (0.105) is much lower than the bad model's (1.204). Minimizing cross-entropy during training is equivalent to minimizing KL divergence — making the model's predicted distribution closer to the true distribution."

KL divergence is not symmetric: $D_{K L} (P ∥ Q) \neq D_{K L} (Q ∥ P)$ .

Cross-entropy (used as loss in classification) is related:

H (P, Q) = - \sum_{x} P (x) \log Q (x) = H (P) + D_{K L} (P ∥ Q)

Since $H (P)$ is constant during training, minimizing cross-entropy is equivalent to minimizing KL divergence.

python

# kl_divergence.py — Cross-entropy loss connection
import numpy as np

def cross_entropy(p, q):
    """Cross-entropy between true distribution p and predicted q."""
    q = np.clip(q, 1e-15, 1)  # avoid log(0)
    return -np.sum(p * np.log(q))

def kl_divergence(p, q):
    """KL divergence D_KL(P || Q)."""
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / np.clip(q[mask], 1e-15, 1)))

# True distribution (one-hot for class 2 in 3-class problem)
p = np.array([0.0, 0.0, 1.0])

# Good prediction
q_good = np.array([0.05, 0.05, 0.90])
# Bad prediction
q_bad = np.array([0.40, 0.30, 0.30])

print(f"Cross-entropy (good): {cross_entropy(p, q_good):.4f}")
print(f"Cross-entropy (bad):  {cross_entropy(p, q_bad):.4f}")
print(f"KL divergence (good): {kl_divergence(p, q_good):.4f}")
print(f"KL divergence (bad):  {kl_divergence(p, q_bad):.4f}")

Putting It All Together: Math in a Real ML Pipeline

python

# full_pipeline_math.py — See all four pillars in action
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data — LINEAR ALGEBRA: data is a matrix
iris = load_iris()
X, y = iris.data, iris.target
print(f"Data matrix shape: {X.shape}")  # (150, 4)

# Train/test split — PROBABILITY: random sampling
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scaling — STATISTICS: standardize to zero mean, unit variance
# z = (x - μ) / σ
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Train mean: {X_train_scaled.mean(axis=0).round(10)}")  # ~0
print(f"Train std:  {X_train_scaled.std(axis=0).round(4)}")     # ~1

# Logistic Regression — CALCULUS: gradient descent minimizes cross-entropy
# The model learns: P(y=k|x) = softmax(Wx + b)
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

# Weights — LINEAR ALGEBRA: weight matrix W ∈ R^(3×4)
print(f"\nWeight matrix shape: {model.coef_.shape}")
print(f"Weights:\n{model.coef_.round(3)}")

# Prediction — LINEAR ALGEBRA: matrix multiply + CALCULUS: softmax
y_pred = model.predict(X_test_scaled)

# Evaluation — STATISTICS: accuracy is an estimator of true performance
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.4f}")

# PROBABILITY: predicted probabilities via softmax
probs = model.predict_proba(X_test_scaled)[:3]
print(f"\nPredicted probabilities (first 3):\n{probs.round(4)}")

Quick Reference

Math Concept	ML Application	Key Formula
Dot product	Similarity, linear models	$w \cdot x = \sum w_{i} x_{i}$
Matrix inverse	Normal equation	$θ = (X^{T} X)^{- 1} X^{T} y$
Eigenvalues	PCA	$A v = λ v$
Derivative	Gradient descent	$θ \leftarrow θ - η \frac{\partial L}{\partial θ}$
Chain rule	Backpropagation	$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$
Bayes' theorem	Naive Bayes, Bayesian models	$P (A ∣ B) = \frac{P (B ∣ A) P (A)}{P (B)}$
MLE	Parameter estimation	$\hat{θ} = \arg max \sum \log P (x_{i} ∣ θ)$
Bias-variance	Model selection	$Error = {Bias}^{2} + Var + Noise$
Entropy	Decision trees, cross-entropy loss	$H = - \sum p_{i} \log p_{i}$
KL divergence	Cross-entropy loss	$D_{K L} (P \| Q) = \sum P \log \frac{P}{Q}$

Math Foundations for Machine Learning ​

Linear Algebra ​

Vectors ​

Matrices ​

Matrix Multiplication Rules ​

Eigendecomposition ​

The Normal Equation Uses All of This ​

Calculus ​

Derivatives ​

Key Derivative Rules for ML ​

The Chain Rule ​

Partial Derivatives and Gradients ​

Gradient Descent Derivation ​

Probability ​

Basic Probability Rules ​

Bayes' Theorem ​

Probability Distributions ​

Maximum Likelihood Estimation (MLE) ​

Statistics ​

Expectation and Variance ​

Bias-Variance Tradeoff ​

Covariance and Correlation ​

Hypothesis Testing Intuition ​

Information Theory ​

Entropy ​

KL Divergence ​

Putting It All Together: Math in a Real ML Pipeline ​

Quick Reference ​

Further Reading ​

Related Pages

Math Foundations for Machine Learning

Linear Algebra

Vectors

Matrices

Matrix Multiplication Rules

Eigendecomposition

The Normal Equation Uses All of This

Calculus

Derivatives

Key Derivative Rules for ML

The Chain Rule

Partial Derivatives and Gradients

Gradient Descent Derivation

Probability

Basic Probability Rules

Bayes' Theorem

Probability Distributions

Maximum Likelihood Estimation (MLE)

Statistics

Expectation and Variance

Bias-Variance Tradeoff

Covariance and Correlation

Hypothesis Testing Intuition

Information Theory

Entropy

KL Divergence

Putting It All Together: Math in a Real ML Pipeline

Quick Reference

Further Reading