Skip to content
Unverified — AI-generated content. Help verify this page

Math Foundations for Machine Learning

Every machine learning algorithm reduces to mathematical operations. Linear regression is a matrix equation. Gradient descent is calculus. Naive Bayes is probability theory. Understanding the math does not just help you debug models — it lets you invent new ones.

This page covers the four pillars: linear algebra, calculus, probability, and statistics — each with full derivations and runnable Python code.


Linear Algebra

Vectors

A vector is an ordered list of numbers. In ML, a single data point with d features is a vector xRd.

python
# vectors.py — Vector operations with NumPy
import numpy as np

# A data point with 3 features
x = np.array([1.0, 2.0, 3.0])
y = np.array([4.0, 5.0, 6.0])

# Addition — element-wise
print(f"x + y = {x + y}")  # [5. 7. 9.]

# Scalar multiplication
print(f"3x = {3 * x}")  # [3. 6. 9.]

# Dot product — measures similarity
dot = np.dot(x, y)  # 1*4 + 2*5 + 3*6 = 32
print(f"x · y = {dot}")

# Magnitude (L2 norm)
norm_x = np.linalg.norm(x)
print(f"||x|| = {norm_x:.4f}")  # sqrt(1 + 4 + 9) = 3.7417

# Unit vector
x_hat = x / norm_x
print(f"Unit vector: {x_hat}")
print(f"||x_hat|| = {np.linalg.norm(x_hat):.4f}")  # 1.0

# Cosine similarity — angle between vectors
cos_sim = np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
print(f"cos(x, y) = {cos_sim:.4f}")  # 0.9746 — very similar direction

The dot product has deep geometric meaning:

xy=xycosθ
Worked Example — Dot Product and Cosine Similarity

Two feature vectors: x = [3, 4], y = [4, 3]

Step 1: Dot product x . y = 3(4) + 4(3) = 12 + 12 = 24

Step 2: Magnitudes ||x|| = sqrt(9 + 16) = sqrt(25) = 5 ||y|| = sqrt(16 + 9) = sqrt(25) = 5

Step 3: Cosine similarity cos(theta) = 24 / (5 * 5) = 24/25 = 0.96

Step 4: Angle theta = arccos(0.96) = 16.3 degrees

Interpret: "The vectors are highly similar (cos=0.96, angle=16.3 degrees). They point in nearly the same direction. If these were document vectors, the documents would be about 96% similar in content."

When θ=0 (same direction), cosθ=1 and the dot product is maximized. When θ=90° (perpendicular), the dot product is zero. This is why the dot product measures similarity.

Matrices

A dataset with n samples and d features is a matrix XRn×d. Each row is a data point, each column is a feature.

python
# matrices.py — Matrix operations with NumPy
import numpy as np

# Dataset: 4 samples, 3 features
X = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
])
print(f"Shape: {X.shape}")  # (4, 3) — 4 samples, 3 features

# Transpose — swap rows and columns
print(f"X^T shape: {X.T.shape}")  # (3, 4)

# Matrix multiplication — X^T X is the Gram matrix
# Used in linear regression normal equation
gram = X.T @ X  # (3,4) @ (4,3) = (3,3)
print(f"X^T X =\n{gram}")

# Inverse — only for square, full-rank matrices
A = np.array([[4, 7], [2, 6]])
A_inv = np.linalg.inv(A)
print(f"A^-1 =\n{A_inv}")

# Verify: A @ A^-1 = I
identity = A @ A_inv
print(f"A @ A^-1 =\n{np.round(identity, 10)}")

# Determinant — measures how much A scales volume
det = np.linalg.det(A)
print(f"det(A) = {det:.1f}")  # 10.0 — nonzero means invertible

# Trace — sum of diagonal elements
print(f"tr(A) = {np.trace(A)}")  # 4 + 6 = 10

Matrix Multiplication Rules

For ARm×n and BRn×p, the product C=AB has shape m×p:

Cij=k=1nAikBkj

The inner dimensions must match: (m×n)×(n×p).

Eigendecomposition

An eigenvector of matrix A is a vector that, when multiplied by A, only gets scaled (not rotated):

Av=λv
Worked Example — Eigenvalues and Eigenvectors

Matrix A = [[4, 2],[1, 3]]. Find eigenvalues and eigenvectors.

Step 1: Solve det(A - lambdaI) = 0 det([[4-lambda, 2],[1, 3-lambda]]) = (4-lambda)(3-lambda) - 21 = 0 lambda^2 - 7lambda + 12 - 2 = 0 lambda^2 - 7lambda + 10 = 0 (lambda - 5)(lambda - 2) = 0 lambda1 = 5, lambda2 = 2

Step 2: Find eigenvector for lambda1 = 5 (A - 5I)v = 0 -> [[-1, 2],[1, -2]]v = 0 -v1 + 2v2 = 0 -> v1 = 2v2 Eigenvector: v1 = [2, 1] (normalized: [0.894, 0.447])

Step 3: Find eigenvector for lambda2 = 2 (A - 2I)v = 0 -> [[2, 2],[1, 1]]v = 0 2v1 + 2v2 = 0 -> v1 = -v2 Eigenvector: v2 = [-1, 1] (normalized: [-0.707, 0.707])

Step 4: Verify A @ v1 = lambda1 * v1 A @ [2,1] = [4(2)+2(1), 1(2)+3(1)] = [10, 5] = 5 * [2, 1] -> correct!

Interpret: "The matrix A stretches space along [2,1] by factor 5 and along [-1,1] by factor 2. In PCA, the largest eigenvalue (5) corresponds to the direction of most variance — that's the first principal component."

where λ is the eigenvalue. Eigendecomposition is the foundation of PCA (principal component analysis).

python
# eigen.py — Eigendecomposition and its role in PCA
import numpy as np

# Covariance matrix of 2D data
np.random.seed(42)
data = np.random.randn(200, 2) @ np.array([[2, 1], [1, 3]])

cov = np.cov(data.T)
print(f"Covariance matrix:\n{cov}")

# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov)

# Sort by descending eigenvalue
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print(f"\nEigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")

# Variance explained ratio — this IS PCA
total_var = eigenvalues.sum()
var_ratio = eigenvalues / total_var
print(f"\nVariance explained: {var_ratio}")
print(f"First component captures {var_ratio[0]:.1%} of variance")

# Project data onto first eigenvector (1D PCA)
projected = data @ eigenvectors[:, 0]
print(f"Original shape: {data.shape}, Projected shape: {projected.shape}")

The Normal Equation Uses All of This

Linear regression's closed-form solution ties it all together:

θ=(XTX)1XTy

This requires matrix transpose, matrix multiplication, and matrix inversion — three core linear algebra operations.


Calculus

Derivatives

The derivative of f(x) measures the instantaneous rate of change — the slope of the tangent line:

f(x)=limh0f(x+h)f(x)h

In ML, we care about derivatives because training = minimizing a loss function, and derivatives tell us which direction to move.

python
# derivatives.py — Derivatives numerically and symbolically
import numpy as np

# Numerical derivative
def numerical_derivative(f, x, h=1e-7):
    """Central difference — more accurate than forward difference."""
    return (f(x + h) - f(x - h)) / (2 * h)

# Example: f(x) = x^2 → f'(x) = 2x
f = lambda x: x**2
x = 3.0
print(f"f'({x}) numerical = {numerical_derivative(f, x):.6f}")  # ≈ 6.0
print(f"f'({x}) analytical = {2 * x}")  # 6.0

# MSE loss derivative
# L(θ) = (1/n) Σ (yi - θxi)^2
# dL/dθ = -(2/n) Σ xi(yi - θxi)
X = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
theta = 1.5

loss = lambda t: np.mean((y - t * X)**2)
grad_numerical = numerical_derivative(loss, theta)
grad_analytical = -2 * np.mean(X * (y - theta * X))

print(f"\nMSE loss at θ={theta}: {loss(theta):.4f}")
print(f"Gradient (numerical):  {grad_numerical:.6f}")
print(f"Gradient (analytical): {grad_analytical:.6f}")

Key Derivative Rules for ML

RuleFormulaML Use
Power ruleddxxn=nxn1MSE loss gradient
Chain ruleddxf(g(x))=f(g(x))g(x)Backpropagation
Product ruleddx[f(x)g(x)]=fg+fgRegularized loss
Log derivativeddxlnx=1xLog-likelihood
Exponentialddxex=exSigmoid, softmax

The Chain Rule

The chain rule is the single most important calculus concept for ML. It says: if y=f(g(x)), then:

dydx=dydgdgdx

This is how backpropagation works — it applies the chain rule repeatedly through layers of a neural network.

python
# chain_rule.py — Chain rule in action
import numpy as np

# Sigmoid function: σ(z) = 1 / (1 + e^(-z))
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Derivative of sigmoid using chain rule:
# σ(z) = (1 + e^(-z))^(-1)
# Let u = 1 + e^(-z)
# σ = u^(-1) → dσ/du = -u^(-2)
# du/dz = -e^(-z)
# dσ/dz = (-u^(-2))(-e^(-z)) = e^(-z) / (1 + e^(-z))^2
# Simplifies to: σ(z)(1 - σ(z))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

# Verify with numerical derivative
z = 2.0
analytical = sigmoid_derivative(z)
numerical = (sigmoid(z + 1e-7) - sigmoid(z - 1e-7)) / (2e-7)
print(f"σ'({z}) analytical = {analytical:.8f}")
print(f"σ'({z}) numerical  = {numerical:.8f}")

# Full chain: loss = (y - σ(wx + b))^2
# dloss/dw = dloss/dσ · dσ/dz · dz/dw
# = -2(y - σ(z)) · σ(z)(1-σ(z)) · x
w, b, x, y_true = 0.5, 0.1, 2.0, 1.0
z = w * x + b
y_pred = sigmoid(z)

dloss_dsigma = -2 * (y_true - y_pred)
dsigma_dz = sigmoid_derivative(z)
dz_dw = x

dloss_dw = dloss_dsigma * dsigma_dz * dz_dw
print(f"\n∂loss/∂w = {dloss_dw:.6f}")

Partial Derivatives and Gradients

When a function has multiple inputs, the partial derivative measures how it changes with respect to one input while holding others fixed.

The gradient is the vector of all partial derivatives:

f(x)=[fx1fx2fxd]

The gradient points in the direction of steepest ascent. To minimize a loss, we move in the opposite direction — this is gradient descent.

python
# gradient.py — Gradient of a function with multiple variables
import numpy as np

# f(x1, x2) = x1^2 + 3*x1*x2 + x2^2
# ∂f/∂x1 = 2*x1 + 3*x2
# ∂f/∂x2 = 3*x1 + 2*x2

def f(x):
    return x[0]**2 + 3*x[0]*x[1] + x[1]**2

def gradient_f(x):
    return np.array([2*x[0] + 3*x[1], 3*x[0] + 2*x[1]])

# Numerical gradient for verification
def numerical_gradient(f, x, h=1e-7):
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy(); x_plus[i] += h
        x_minus = x.copy(); x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

x = np.array([1.0, 2.0])
print(f"Analytical gradient: {gradient_f(x)}")
print(f"Numerical gradient:  {numerical_gradient(f, x)}")

# Gradient descent on this function
x = np.array([5.0, 5.0])
lr = 0.1
for i in range(20):
    grad = gradient_f(x)
    x = x - lr * grad
    if i % 5 == 0:
        print(f"Step {i:2d}: x = [{x[0]:.4f}, {x[1]:.4f}], f(x) = {f(x):.4f}")

Gradient Descent Derivation

For a loss function L(θ), gradient descent updates parameters as:

θt+1=θtηL(θt)
Worked Example — Gradient Descent Step

Minimize f(x) = (x - 3)^2. Start at x0 = 0, learning rate eta = 0.4.

Step 1: Compute gradient at x0 = 0 f'(x) = 2(x - 3) f'(0) = 2(0 - 3) = -6

Step 2: Update x x1 = 0 - 0.4 * (-6) = 0 + 2.4 = 2.4

Step 3: Next iteration at x1 = 2.4 f'(2.4) = 2(2.4 - 3) = -1.2 x2 = 2.4 - 0.4 * (-1.2) = 2.4 + 0.48 = 2.88

Step 4: Continue f'(2.88) = 2(2.88 - 3) = -0.24 x3 = 2.88 + 0.096 = 2.976

Interpret: "Starting at x=0, gradient descent moves toward the minimum at x=3. Each step reduces the distance: 3.0 -> 0.6 -> 0.12 -> 0.024. The negative gradient points uphill, so subtracting it moves us downhill."

where η is the learning rate. The derivation comes from the first-order Taylor expansion:

L(θ+δ)L(θ)+L(θ)Tδ

To decrease L, choose δ=ηL(θ), giving:

L(θ+δ)L(θ)ηL(θ)2

Since L20, the loss decreases (for small enough η).


Probability

Basic Probability Rules

ConceptFormulaExample
ComplementP(Ac)=1P(A)P(not spam)=1P(spam)
UnionP(AB)=P(A)+P(B)P(AB)P(rain or cold)
ConditionalP(AB)=P(AB)P(B)P(spam"free")
IndependenceP(AB)=P(A)P(B)Naive Bayes assumption

Bayes' Theorem

Bayes' theorem relates conditional probabilities:

P(AB)=P(BA)P(A)P(B)
Worked Example — Bayes' Theorem (Medical Test)

A disease affects 1% of the population. A test has 95% sensitivity and 90% specificity.

  • P(disease) = 0.01
  • P(positive | disease) = 0.95
  • P(positive | healthy) = 0.10 (1 - specificity)

Step 1: P(positive) using law of total probability P(+) = P(+|disease)*P(disease) + P(+|healthy)*P(healthy) = 0.95 * 0.01 + 0.10 * 0.99 = 0.0095 + 0.099 = 0.1085

Step 2: P(disease | positive) using Bayes' theorem P(disease|+) = P(+|disease)*P(disease) / P(+) = (0.95 * 0.01) / 0.1085 = 0.0095 / 0.1085 = 0.0876

Step 3: Interpret Only 8.76% of positive tests actually have the disease!

Step 4: Why so low? Of 10,000 people: 100 have disease, 9900 are healthy True positives: 100 * 0.95 = 95 False positives: 9900 * 0.10 = 990 P(disease|+) = 95 / (95 + 990) = 95/1085 = 8.76%

Interpret: "This is the base rate fallacy. Despite a 95% sensitive test, the low prevalence (1%) means most positive results are false positives. You'd need a second independent test to raise confidence."

Derivation: Start from the definition of conditional probability in both directions:

P(AB)=P(AB)P(B),P(BA)=P(AB)P(A)

From the second equation: P(AB)=P(BA)P(A). Substitute into the first:

P(AB)=P(BA)P(A)P(B)

Expanding P(B) using the law of total probability:

P(B)=P(BA)P(A)+P(BAc)P(Ac)
python
# bayes.py — Bayes' theorem: medical test example
import numpy as np

# A disease affects 1% of the population
# Test sensitivity (true positive rate): 95%
# Test specificity (true negative rate): 90%
P_disease = 0.01
P_positive_given_disease = 0.95     # sensitivity
P_positive_given_healthy = 0.10     # 1 - specificity

# P(disease | positive) = ?
# Using Bayes' theorem:
P_positive = (P_positive_given_disease * P_disease +
              P_positive_given_healthy * (1 - P_disease))

P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive

print(f"P(disease)               = {P_disease:.2%}")
print(f"P(positive | disease)    = {P_positive_given_disease:.2%}")
print(f"P(positive | healthy)    = {P_positive_given_healthy:.2%}")
print(f"P(positive)              = {P_positive:.4f}")
print(f"P(disease | positive)    = {P_disease_given_positive:.2%}")
# Only ~8.8% — despite a 95% sensitive test!
# This is the base rate fallacy — low prevalence means
# most positives are false positives

# Simulation to verify
np.random.seed(42)
n = 1_000_000
has_disease = np.random.rand(n) < P_disease
test_positive = np.where(
    has_disease,
    np.random.rand(n) < P_positive_given_disease,
    np.random.rand(n) < P_positive_given_healthy
)

actual = has_disease[test_positive].mean()
print(f"\nSimulated P(disease|positive) = {actual:.2%}")

Probability Distributions

python
# distributions.py — Key distributions for ML
import numpy as np
from scipy import stats

# --- Gaussian (Normal) Distribution ---
# PDF: f(x) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²))
mu, sigma = 5.0, 2.0
x = np.linspace(-2, 12, 1000)
pdf = stats.norm.pdf(x, mu, sigma)
print(f"Normal: mean={mu}, std={sigma}")
print(f"P(3 < X < 7) = {stats.norm.cdf(7, mu, sigma) - stats.norm.cdf(3, mu, sigma):.4f}")

# --- Bernoulli Distribution ---
# P(X=1) = p, P(X=0) = 1-p
p = 0.3
samples = stats.bernoulli.rvs(p, size=10000)
print(f"\nBernoulli: p={p}, sample mean={samples.mean():.4f}")

# --- Binomial Distribution ---
# n trials, each with probability p
n_trials, p = 100, 0.3
print(f"\nBinomial: n={n_trials}, p={p}")
print(f"Expected value = np = {n_trials * p}")
print(f"P(X = 30) = {stats.binom.pmf(30, n_trials, p):.4f}")
print(f"P(X <= 25) = {stats.binom.cdf(25, n_trials, p):.4f}")

# --- Multinomial Distribution ---
# Used in Naive Bayes for text classification
probs = [0.5, 0.3, 0.2]  # word probabilities
counts = np.random.multinomial(100, probs)
print(f"\nMultinomial sample (100 words): {counts}")

# --- Poisson Distribution ---
# P(X=k) = λ^k e^(-λ) / k!
lam = 4.0
print(f"\nPoisson: λ={lam}")
print(f"P(X=3) = {stats.poisson.pmf(3, lam):.4f}")
print(f"P(X<=5) = {stats.poisson.cdf(5, lam):.4f}")

Maximum Likelihood Estimation (MLE)

MLE finds the parameters that make the observed data most probable. Given data x1,,xn and a model with parameter θ:

θ^MLE=argmaxθi=1nP(xiθ)

Taking the log (since log is monotonic):

θ^MLE=argmaxθi=1nlogP(xiθ)
Worked Example — MLE for Coin Flip

Observed data: 7 heads, 3 tails in 10 flips. Find MLE for p (probability of heads).

Each flip follows Bernoulli(p): P(H) = p, P(T) = 1-p.

Step 1: Write the likelihood L(p) = p^7 * (1-p)^3

Step 2: Write the log-likelihood log L(p) = 7log(p) + 3log(1-p)

Step 3: Take derivative and set to zero d(log L)/dp = 7/p - 3/(1-p) = 0 7(1-p) = 3p 7 - 7p = 3p 7 = 10p p_MLE = 7/10 = 0.70

Step 4: Verify with different p values p=0.5: log L = 7*(-0.693) + 3*(-0.693) = -6.93 p=0.7: log L = 7*(-0.357) + 3*(-1.204) = -2.499 - 3.612 = -6.11 p=0.9: log L = 7*(-0.105) + 3*(-2.303) = -0.735 - 6.909 = -7.64

p=0.7 gives the highest log-likelihood, confirming our answer.

Interpret: "The MLE for a coin's bias given 7 heads in 10 flips is simply the observed proportion: 7/10 = 0.7. This is the most intuitive estimator, and MLE provides the mathematical justification."

python
# mle.py — MLE for Gaussian parameters
import numpy as np

# Generate data from a Gaussian
np.random.seed(42)
true_mu, true_sigma = 5.0, 2.0
data = np.random.normal(true_mu, true_sigma, 1000)

# MLE for μ: take the derivative of log-likelihood, set to 0
# log L = -n/2 log(2π) - n/2 log(σ²) - 1/(2σ²) Σ(xi - μ)²
# ∂log L/∂μ = 1/σ² Σ(xi - μ) = 0
# → μ_MLE = (1/n) Σ xi = sample mean

# MLE for σ²: ∂log L/∂σ² = -n/(2σ²) + 1/(2σ⁴) Σ(xi - μ)² = 0
# → σ²_MLE = (1/n) Σ(xi - μ)²

mu_mle = np.mean(data)
sigma_mle = np.sqrt(np.mean((data - mu_mle)**2))

print(f"True μ = {true_mu}, MLE μ = {mu_mle:.4f}")
print(f"True σ = {true_sigma}, MLE σ = {sigma_mle:.4f}")

Statistics

Expectation and Variance

Expected value (mean):

E[X]=xxP(X=x)(discrete)E[X]=xf(x)dx(continuous)

Variance — measures spread:

Var(X)=E[(Xμ)2]=E[X2](E[X])2
python
# expectation_variance.py — Computing and verifying
import numpy as np

# Discrete: fair six-sided die
outcomes = np.arange(1, 7)
probs = np.ones(6) / 6

E_X = np.sum(outcomes * probs)
E_X2 = np.sum(outcomes**2 * probs)
Var_X = E_X2 - E_X**2

print(f"E[X] = {E_X:.4f}")    # 3.5
print(f"Var(X) = {Var_X:.4f}") # 2.9167

# Verify with simulation
rolls = np.random.randint(1, 7, 100000)
print(f"\nSimulated E[X] = {rolls.mean():.4f}")
print(f"Simulated Var(X) = {rolls.var():.4f}")

Bias-Variance Tradeoff

The expected prediction error can be decomposed:

E[(yf^(x))2]=Bias[f^(x)]2+Var[f^(x)]+σ2
Worked Example — Bias-Variance Tradeoff

True f(x) = 5. Noise sigma = 0.5. Three models trained on different data:

Model ComplexityPredictions over datasetsE[f_hat]Bias^2Var
Too simple (const=3)[3, 3, 3, 3, 3]3.0(5-3)^2 = 4.00.0
Just right[4.8, 5.3, 4.7, 5.1, 5.1]5.0(5-5)^2 = 0.00.04
Too complex[3.5, 6.5, 4.0, 7.0, 4.0]5.0(5-5)^2 = 0.02.10

Expected prediction errors (sigma^2 = 0.25): Simple: EPE = 4.0 + 0.0 + 0.25 = 4.25 Right: EPE = 0.0 + 0.04 + 0.25 = 0.29 Complex: EPE = 0.0 + 2.10 + 0.25 = 2.35

Interpret: "The simple model has zero variance but huge bias (always predicts 3 instead of 5). The complex model is unbiased on average but varies wildly. The just-right model achieves the lowest total error (0.29) with a good bias-variance balance. No model can go below 0.25 (the irreducible noise)."

where:

  • Bias = E[f^(x)]f(x) — error from wrong assumptions
  • Variance = E[(f^(x)E[f^(x)])2] — error from sensitivity to training data
  • σ2 = irreducible noise

Derivation:

Let y=f(x)+ϵ where E[ϵ]=0 and Var(ϵ)=σ2.

E[(yf^)2]=E[(f+ϵf^)2]=E[(ff^)2]+E[ϵ2]+2E[(ff^)ϵ]

Since ϵ is independent of f^, the cross term vanishes:

=E[(ff^)2]+σ2

Now expand (ff^)2 by adding and subtracting E[f^]:

E[(ff^)2]=(fE[f^])2+E[(f^E[f^])2]=Bias2+Variance
python
# bias_variance.py — Demonstrate the tradeoff
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

# True function
def true_f(x):
    return np.sin(2 * x) + 0.5 * x

# Generate many datasets to estimate bias and variance
np.random.seed(42)
n_datasets = 200
n_train = 30
x_test = np.linspace(0, 5, 100).reshape(-1, 1)
y_true = true_f(x_test.ravel())

for degree in [1, 4, 15]:
    predictions = np.zeros((n_datasets, len(x_test)))

    for i in range(n_datasets):
        x_train = np.sort(np.random.uniform(0, 5, n_train)).reshape(-1, 1)
        y_train = true_f(x_train.ravel()) + np.random.normal(0, 0.5, n_train)

        model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
        model.fit(x_train, y_train)
        predictions[i] = model.predict(x_test)

    mean_pred = predictions.mean(axis=0)
    bias_sq = np.mean((mean_pred - y_true)**2)
    variance = np.mean(predictions.var(axis=0))
    mse = np.mean((predictions - y_true)**2)

    print(f"Degree {degree:2d}: Bias²={bias_sq:.4f}, Var={variance:.4f}, "
          f"Bias²+Var={bias_sq + variance:.4f}, MSE={mse:.4f}")
# Degree 1 — high bias (underfitting)
# Degree 4 — good balance
# Degree 15 — high variance (overfitting)

Covariance and Correlation

Covariance measures how two variables move together:

Cov(X,Y)=E[(XμX)(YμY)]

Pearson correlation normalizes to [1,1]:

ρX,Y=Cov(X,Y)σXσY
python
# correlation.py — Covariance, correlation, and their matrix forms
import numpy as np

# Generate correlated data
np.random.seed(42)
n = 500
x1 = np.random.randn(n)
x2 = 0.8 * x1 + 0.6 * np.random.randn(n)  # correlated with x1
x3 = np.random.randn(n)  # independent

data = np.column_stack([x1, x2, x3])

# Covariance matrix — used in PCA, Gaussian distributions
cov_matrix = np.cov(data.T)
print("Covariance matrix:")
print(np.round(cov_matrix, 3))

# Correlation matrix — used in feature selection
corr_matrix = np.corrcoef(data.T)
print("\nCorrelation matrix:")
print(np.round(corr_matrix, 3))

# x1 and x2 are correlated (~0.8), x3 is independent (~0.0)
print(f"\ncorr(x1, x2) = {corr_matrix[0, 1]:.3f}")
print(f"corr(x1, x3) = {corr_matrix[0, 2]:.3f}")

Hypothesis Testing Intuition

While you rarely do formal hypothesis testing in ML, the concepts appear everywhere:

  • p-value in feature importance tests
  • Confidence intervals for model performance
  • A/B testing for model comparison in production
python
# hypothesis.py — Comparing two model performances
import numpy as np
from scipy import stats

# Two models evaluated on 30 test folds
np.random.seed(42)
model_a_scores = np.random.normal(0.82, 0.03, 30)
model_b_scores = np.random.normal(0.85, 0.03, 30)

# Paired t-test — are the models significantly different?
t_stat, p_value = stats.ttest_rel(model_a_scores, model_b_scores)
print(f"Model A mean: {model_a_scores.mean():.4f}")
print(f"Model B mean: {model_b_scores.mean():.4f}")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")

if p_value < 0.05:
    print("Difference is statistically significant at α=0.05")
else:
    print("No statistically significant difference")

# Bootstrap confidence interval for Model B accuracy
n_bootstrap = 10000
bootstrap_means = np.array([
    np.random.choice(model_b_scores, size=len(model_b_scores), replace=True).mean()
    for _ in range(n_bootstrap)
])
ci_lower = np.percentile(bootstrap_means, 2.5)
ci_upper = np.percentile(bootstrap_means, 97.5)
print(f"\n95% CI for Model B: [{ci_lower:.4f}, {ci_upper:.4f}]")

Information Theory

Entropy

Entropy measures the uncertainty of a random variable:

H(X)=i=1npilog2pi
Worked Example — Shannon Entropy

Compare entropy of three distributions over {A, B, C}:

Distribution 1: Uniform [1/3, 1/3, 1/3] H = -(1/3)log2(1/3) - (1/3)log2(1/3) - (1/3)log2(1/3) = -3 * (1/3)(-1.585) = 3 * 0.528 = 1.585 bits

Distribution 2: Skewed [0.8, 0.1, 0.1] H = -0.8log2(0.8) - 0.1log2(0.1) - 0.1*log2(0.1) = -0.8(-0.322) - 0.1(-3.322) - 0.1(-3.322) = 0.258 + 0.332 + 0.332 = 0.922 bits

Distribution 3: Certain [1.0, 0.0, 0.0] H = -1.0log2(1.0) = -1.00 = 0 bits

Interpret: "Uniform distribution has maximum entropy (1.585 bits) — maximum uncertainty about which class. The skewed distribution has less entropy (0.922) — we can guess 'A' most of the time. A certain distribution has zero entropy — no uncertainty at all. Decision trees seek splits that minimize child entropy."

Maximum entropy: uniform distribution (maximum uncertainty). Minimum entropy: deterministic (zero uncertainty).

python
# entropy.py — Entropy and its role in decision trees
import numpy as np

def entropy(probs):
    """Shannon entropy in bits."""
    probs = np.array(probs)
    probs = probs[probs > 0]  # avoid log(0)
    return -np.sum(probs * np.log2(probs))

# Fair coin: maximum entropy for binary
print(f"Fair coin entropy:   {entropy([0.5, 0.5]):.4f} bits")  # 1.0

# Biased coin: lower entropy
print(f"Biased (0.9, 0.1):   {entropy([0.9, 0.1]):.4f} bits")  # 0.469

# Certain outcome: zero entropy
print(f"Certain (1.0, 0.0):  {entropy([1.0, 0.0]):.4f} bits")  # 0.0

# 3-class classification: how pure is this split?
print(f"\nPure split [1, 0, 0]:   {entropy([1.0, 0.0, 0.0]):.4f}")
print(f"Mixed [0.33,0.33,0.34]: {entropy([0.33, 0.33, 0.34]):.4f}")
print(f"Skewed [0.8, 0.1, 0.1]: {entropy([0.8, 0.1, 0.1]):.4f}")

KL Divergence

Measures how one probability distribution differs from another:

DKL(PQ)=xP(x)logP(x)Q(x)
Worked Example — KL Divergence and Cross-Entropy

True distribution P = [0, 0, 1] (class 3 is correct). Two model predictions:

  • Good model: Q1 = [0.05, 0.05, 0.90]
  • Bad model: Q2 = [0.40, 0.30, 0.30]

Step 1: Cross-entropy H(P, Q1) (good model) H(P, Q1) = -0ln(0.05) - 0ln(0.05) - 1*ln(0.90) = -ln(0.90) = 0.105

Step 2: Cross-entropy H(P, Q2) (bad model) H(P, Q2) = -0ln(0.40) - 0ln(0.30) - 1*ln(0.30) = -ln(0.30) = 1.204

Step 3: KL divergence (since P is one-hot, KL = cross-entropy - entropy of P) H(P) = 0 (deterministic distribution) D_KL(P||Q1) = 0.105 - 0 = 0.105 D_KL(P||Q2) = 1.204 - 0 = 1.204

Interpret: "The good model's cross-entropy (0.105) is much lower than the bad model's (1.204). Minimizing cross-entropy during training is equivalent to minimizing KL divergence — making the model's predicted distribution closer to the true distribution."

KL divergence is not symmetric: DKL(PQ)DKL(QP).

Cross-entropy (used as loss in classification) is related:

H(P,Q)=xP(x)logQ(x)=H(P)+DKL(PQ)

Since H(P) is constant during training, minimizing cross-entropy is equivalent to minimizing KL divergence.

python
# kl_divergence.py — Cross-entropy loss connection
import numpy as np

def cross_entropy(p, q):
    """Cross-entropy between true distribution p and predicted q."""
    q = np.clip(q, 1e-15, 1)  # avoid log(0)
    return -np.sum(p * np.log(q))

def kl_divergence(p, q):
    """KL divergence D_KL(P || Q)."""
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / np.clip(q[mask], 1e-15, 1)))

# True distribution (one-hot for class 2 in 3-class problem)
p = np.array([0.0, 0.0, 1.0])

# Good prediction
q_good = np.array([0.05, 0.05, 0.90])
# Bad prediction
q_bad = np.array([0.40, 0.30, 0.30])

print(f"Cross-entropy (good): {cross_entropy(p, q_good):.4f}")
print(f"Cross-entropy (bad):  {cross_entropy(p, q_bad):.4f}")
print(f"KL divergence (good): {kl_divergence(p, q_good):.4f}")
print(f"KL divergence (bad):  {kl_divergence(p, q_bad):.4f}")

Putting It All Together: Math in a Real ML Pipeline

python
# full_pipeline_math.py — See all four pillars in action
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data — LINEAR ALGEBRA: data is a matrix
iris = load_iris()
X, y = iris.data, iris.target
print(f"Data matrix shape: {X.shape}")  # (150, 4)

# Train/test split — PROBABILITY: random sampling
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scaling — STATISTICS: standardize to zero mean, unit variance
# z = (x - μ) / σ
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Train mean: {X_train_scaled.mean(axis=0).round(10)}")  # ~0
print(f"Train std:  {X_train_scaled.std(axis=0).round(4)}")     # ~1

# Logistic Regression — CALCULUS: gradient descent minimizes cross-entropy
# The model learns: P(y=k|x) = softmax(Wx + b)
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

# Weights — LINEAR ALGEBRA: weight matrix W ∈ R^(3×4)
print(f"\nWeight matrix shape: {model.coef_.shape}")
print(f"Weights:\n{model.coef_.round(3)}")

# Prediction — LINEAR ALGEBRA: matrix multiply + CALCULUS: softmax
y_pred = model.predict(X_test_scaled)

# Evaluation — STATISTICS: accuracy is an estimator of true performance
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.4f}")

# PROBABILITY: predicted probabilities via softmax
probs = model.predict_proba(X_test_scaled)[:3]
print(f"\nPredicted probabilities (first 3):\n{probs.round(4)}")

Quick Reference

Math ConceptML ApplicationKey Formula
Dot productSimilarity, linear modelswx=wixi
Matrix inverseNormal equationθ=(XTX)1XTy
EigenvaluesPCAAv=λv
DerivativeGradient descentθθηLθ
Chain ruleBackpropagationLw=Ly^y^zzw
Bayes' theoremNaive Bayes, Bayesian modelsP(AB)=P(BA)P(A)P(B)
MLEParameter estimationθ^=argmaxlogP(xiθ)
Bias-varianceModel selectionError=Bias2+Var+Noise
EntropyDecision trees, cross-entropy lossH=pilogpi
KL divergenceCross-entropy lossDKL(P|Q)=PlogPQ

Further Reading

"What I cannot create, I do not understand." — Richard Feynman