Math Foundations for Machine Learning
Every machine learning algorithm reduces to mathematical operations. Linear regression is a matrix equation. Gradient descent is calculus. Naive Bayes is probability theory. Understanding the math does not just help you debug models — it lets you invent new ones.
This page covers the four pillars: linear algebra, calculus, probability, and statistics — each with full derivations and runnable Python code.
Linear Algebra
Vectors
A vector is an ordered list of numbers. In ML, a single data point with
# vectors.py — Vector operations with NumPy
import numpy as np
# A data point with 3 features
x = np.array([1.0, 2.0, 3.0])
y = np.array([4.0, 5.0, 6.0])
# Addition — element-wise
print(f"x + y = {x + y}") # [5. 7. 9.]
# Scalar multiplication
print(f"3x = {3 * x}") # [3. 6. 9.]
# Dot product — measures similarity
dot = np.dot(x, y) # 1*4 + 2*5 + 3*6 = 32
print(f"x · y = {dot}")
# Magnitude (L2 norm)
norm_x = np.linalg.norm(x)
print(f"||x|| = {norm_x:.4f}") # sqrt(1 + 4 + 9) = 3.7417
# Unit vector
x_hat = x / norm_x
print(f"Unit vector: {x_hat}")
print(f"||x_hat|| = {np.linalg.norm(x_hat):.4f}") # 1.0
# Cosine similarity — angle between vectors
cos_sim = np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
print(f"cos(x, y) = {cos_sim:.4f}") # 0.9746 — very similar directionThe dot product has deep geometric meaning:
Worked Example — Dot Product and Cosine Similarity
Two feature vectors: x = [3, 4], y = [4, 3]
Step 1: Dot product x . y = 3(4) + 4(3) = 12 + 12 = 24
Step 2: Magnitudes ||x|| = sqrt(9 + 16) = sqrt(25) = 5 ||y|| = sqrt(16 + 9) = sqrt(25) = 5
Step 3: Cosine similarity cos(theta) = 24 / (5 * 5) = 24/25 = 0.96
Step 4: Angle theta = arccos(0.96) = 16.3 degrees
Interpret: "The vectors are highly similar (cos=0.96, angle=16.3 degrees). They point in nearly the same direction. If these were document vectors, the documents would be about 96% similar in content."
When
Matrices
A dataset with
# matrices.py — Matrix operations with NumPy
import numpy as np
# Dataset: 4 samples, 3 features
X = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]
])
print(f"Shape: {X.shape}") # (4, 3) — 4 samples, 3 features
# Transpose — swap rows and columns
print(f"X^T shape: {X.T.shape}") # (3, 4)
# Matrix multiplication — X^T X is the Gram matrix
# Used in linear regression normal equation
gram = X.T @ X # (3,4) @ (4,3) = (3,3)
print(f"X^T X =\n{gram}")
# Inverse — only for square, full-rank matrices
A = np.array([[4, 7], [2, 6]])
A_inv = np.linalg.inv(A)
print(f"A^-1 =\n{A_inv}")
# Verify: A @ A^-1 = I
identity = A @ A_inv
print(f"A @ A^-1 =\n{np.round(identity, 10)}")
# Determinant — measures how much A scales volume
det = np.linalg.det(A)
print(f"det(A) = {det:.1f}") # 10.0 — nonzero means invertible
# Trace — sum of diagonal elements
print(f"tr(A) = {np.trace(A)}") # 4 + 6 = 10Matrix Multiplication Rules
For
The inner dimensions must match:
Eigendecomposition
An eigenvector of matrix
Worked Example — Eigenvalues and Eigenvectors
Matrix A = [[4, 2],[1, 3]]. Find eigenvalues and eigenvectors.
Step 1: Solve det(A - lambdaI) = 0 det([[4-lambda, 2],[1, 3-lambda]]) = (4-lambda)(3-lambda) - 21 = 0 lambda^2 - 7lambda + 12 - 2 = 0 lambda^2 - 7lambda + 10 = 0 (lambda - 5)(lambda - 2) = 0 lambda1 = 5, lambda2 = 2
Step 2: Find eigenvector for lambda1 = 5 (A - 5I)v = 0 -> [[-1, 2],[1, -2]]v = 0 -v1 + 2v2 = 0 -> v1 = 2v2 Eigenvector: v1 = [2, 1] (normalized: [0.894, 0.447])
Step 3: Find eigenvector for lambda2 = 2 (A - 2I)v = 0 -> [[2, 2],[1, 1]]v = 0 2v1 + 2v2 = 0 -> v1 = -v2 Eigenvector: v2 = [-1, 1] (normalized: [-0.707, 0.707])
Step 4: Verify A @ v1 = lambda1 * v1 A @ [2,1] = [4(2)+2(1), 1(2)+3(1)] = [10, 5] = 5 * [2, 1] -> correct!
Interpret: "The matrix A stretches space along [2,1] by factor 5 and along [-1,1] by factor 2. In PCA, the largest eigenvalue (5) corresponds to the direction of most variance — that's the first principal component."
where
# eigen.py — Eigendecomposition and its role in PCA
import numpy as np
# Covariance matrix of 2D data
np.random.seed(42)
data = np.random.randn(200, 2) @ np.array([[2, 1], [1, 3]])
cov = np.cov(data.T)
print(f"Covariance matrix:\n{cov}")
# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov)
# Sort by descending eigenvalue
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
print(f"\nEigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")
# Variance explained ratio — this IS PCA
total_var = eigenvalues.sum()
var_ratio = eigenvalues / total_var
print(f"\nVariance explained: {var_ratio}")
print(f"First component captures {var_ratio[0]:.1%} of variance")
# Project data onto first eigenvector (1D PCA)
projected = data @ eigenvectors[:, 0]
print(f"Original shape: {data.shape}, Projected shape: {projected.shape}")The Normal Equation Uses All of This
Linear regression's closed-form solution ties it all together:
This requires matrix transpose, matrix multiplication, and matrix inversion — three core linear algebra operations.
Calculus
Derivatives
The derivative of
In ML, we care about derivatives because training = minimizing a loss function, and derivatives tell us which direction to move.
# derivatives.py — Derivatives numerically and symbolically
import numpy as np
# Numerical derivative
def numerical_derivative(f, x, h=1e-7):
"""Central difference — more accurate than forward difference."""
return (f(x + h) - f(x - h)) / (2 * h)
# Example: f(x) = x^2 → f'(x) = 2x
f = lambda x: x**2
x = 3.0
print(f"f'({x}) numerical = {numerical_derivative(f, x):.6f}") # ≈ 6.0
print(f"f'({x}) analytical = {2 * x}") # 6.0
# MSE loss derivative
# L(θ) = (1/n) Σ (yi - θxi)^2
# dL/dθ = -(2/n) Σ xi(yi - θxi)
X = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
theta = 1.5
loss = lambda t: np.mean((y - t * X)**2)
grad_numerical = numerical_derivative(loss, theta)
grad_analytical = -2 * np.mean(X * (y - theta * X))
print(f"\nMSE loss at θ={theta}: {loss(theta):.4f}")
print(f"Gradient (numerical): {grad_numerical:.6f}")
print(f"Gradient (analytical): {grad_analytical:.6f}")Key Derivative Rules for ML
| Rule | Formula | ML Use |
|---|---|---|
| Power rule | MSE loss gradient | |
| Chain rule | Backpropagation | |
| Product rule | Regularized loss | |
| Log derivative | Log-likelihood | |
| Exponential | Sigmoid, softmax |
The Chain Rule
The chain rule is the single most important calculus concept for ML. It says: if
This is how backpropagation works — it applies the chain rule repeatedly through layers of a neural network.
# chain_rule.py — Chain rule in action
import numpy as np
# Sigmoid function: σ(z) = 1 / (1 + e^(-z))
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Derivative of sigmoid using chain rule:
# σ(z) = (1 + e^(-z))^(-1)
# Let u = 1 + e^(-z)
# σ = u^(-1) → dσ/du = -u^(-2)
# du/dz = -e^(-z)
# dσ/dz = (-u^(-2))(-e^(-z)) = e^(-z) / (1 + e^(-z))^2
# Simplifies to: σ(z)(1 - σ(z))
def sigmoid_derivative(z):
s = sigmoid(z)
return s * (1 - s)
# Verify with numerical derivative
z = 2.0
analytical = sigmoid_derivative(z)
numerical = (sigmoid(z + 1e-7) - sigmoid(z - 1e-7)) / (2e-7)
print(f"σ'({z}) analytical = {analytical:.8f}")
print(f"σ'({z}) numerical = {numerical:.8f}")
# Full chain: loss = (y - σ(wx + b))^2
# dloss/dw = dloss/dσ · dσ/dz · dz/dw
# = -2(y - σ(z)) · σ(z)(1-σ(z)) · x
w, b, x, y_true = 0.5, 0.1, 2.0, 1.0
z = w * x + b
y_pred = sigmoid(z)
dloss_dsigma = -2 * (y_true - y_pred)
dsigma_dz = sigmoid_derivative(z)
dz_dw = x
dloss_dw = dloss_dsigma * dsigma_dz * dz_dw
print(f"\n∂loss/∂w = {dloss_dw:.6f}")Partial Derivatives and Gradients
When a function has multiple inputs, the partial derivative measures how it changes with respect to one input while holding others fixed.
The gradient is the vector of all partial derivatives:
The gradient points in the direction of steepest ascent. To minimize a loss, we move in the opposite direction — this is gradient descent.
# gradient.py — Gradient of a function with multiple variables
import numpy as np
# f(x1, x2) = x1^2 + 3*x1*x2 + x2^2
# ∂f/∂x1 = 2*x1 + 3*x2
# ∂f/∂x2 = 3*x1 + 2*x2
def f(x):
return x[0]**2 + 3*x[0]*x[1] + x[1]**2
def gradient_f(x):
return np.array([2*x[0] + 3*x[1], 3*x[0] + 2*x[1]])
# Numerical gradient for verification
def numerical_gradient(f, x, h=1e-7):
grad = np.zeros_like(x)
for i in range(len(x)):
x_plus = x.copy(); x_plus[i] += h
x_minus = x.copy(); x_minus[i] -= h
grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
return grad
x = np.array([1.0, 2.0])
print(f"Analytical gradient: {gradient_f(x)}")
print(f"Numerical gradient: {numerical_gradient(f, x)}")
# Gradient descent on this function
x = np.array([5.0, 5.0])
lr = 0.1
for i in range(20):
grad = gradient_f(x)
x = x - lr * grad
if i % 5 == 0:
print(f"Step {i:2d}: x = [{x[0]:.4f}, {x[1]:.4f}], f(x) = {f(x):.4f}")Gradient Descent Derivation
For a loss function
Worked Example — Gradient Descent Step
Minimize f(x) = (x - 3)^2. Start at x0 = 0, learning rate eta = 0.4.
Step 1: Compute gradient at x0 = 0 f'(x) = 2(x - 3) f'(0) = 2(0 - 3) = -6
Step 2: Update x x1 = 0 - 0.4 * (-6) = 0 + 2.4 = 2.4
Step 3: Next iteration at x1 = 2.4 f'(2.4) = 2(2.4 - 3) = -1.2 x2 = 2.4 - 0.4 * (-1.2) = 2.4 + 0.48 = 2.88
Step 4: Continue f'(2.88) = 2(2.88 - 3) = -0.24 x3 = 2.88 + 0.096 = 2.976
Interpret: "Starting at x=0, gradient descent moves toward the minimum at x=3. Each step reduces the distance: 3.0 -> 0.6 -> 0.12 -> 0.024. The negative gradient points uphill, so subtracting it moves us downhill."
where
To decrease
Since
Probability
Basic Probability Rules
| Concept | Formula | Example |
|---|---|---|
| Complement | ||
| Union | ||
| Conditional | ||
| Independence | Naive Bayes assumption |
Bayes' Theorem
Bayes' theorem relates conditional probabilities:
Worked Example — Bayes' Theorem (Medical Test)
A disease affects 1% of the population. A test has 95% sensitivity and 90% specificity.
- P(disease) = 0.01
- P(positive | disease) = 0.95
- P(positive | healthy) = 0.10 (1 - specificity)
Step 1: P(positive) using law of total probability P(+) = P(+|disease)*P(disease) + P(+|healthy)*P(healthy) = 0.95 * 0.01 + 0.10 * 0.99 = 0.0095 + 0.099 = 0.1085
Step 2: P(disease | positive) using Bayes' theorem P(disease|+) = P(+|disease)*P(disease) / P(+) = (0.95 * 0.01) / 0.1085 = 0.0095 / 0.1085 = 0.0876
Step 3: Interpret Only 8.76% of positive tests actually have the disease!
Step 4: Why so low? Of 10,000 people: 100 have disease, 9900 are healthy True positives: 100 * 0.95 = 95 False positives: 9900 * 0.10 = 990 P(disease|+) = 95 / (95 + 990) = 95/1085 = 8.76%
Interpret: "This is the base rate fallacy. Despite a 95% sensitive test, the low prevalence (1%) means most positive results are false positives. You'd need a second independent test to raise confidence."
Derivation: Start from the definition of conditional probability in both directions:
From the second equation:
Expanding
# bayes.py — Bayes' theorem: medical test example
import numpy as np
# A disease affects 1% of the population
# Test sensitivity (true positive rate): 95%
# Test specificity (true negative rate): 90%
P_disease = 0.01
P_positive_given_disease = 0.95 # sensitivity
P_positive_given_healthy = 0.10 # 1 - specificity
# P(disease | positive) = ?
# Using Bayes' theorem:
P_positive = (P_positive_given_disease * P_disease +
P_positive_given_healthy * (1 - P_disease))
P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive
print(f"P(disease) = {P_disease:.2%}")
print(f"P(positive | disease) = {P_positive_given_disease:.2%}")
print(f"P(positive | healthy) = {P_positive_given_healthy:.2%}")
print(f"P(positive) = {P_positive:.4f}")
print(f"P(disease | positive) = {P_disease_given_positive:.2%}")
# Only ~8.8% — despite a 95% sensitive test!
# This is the base rate fallacy — low prevalence means
# most positives are false positives
# Simulation to verify
np.random.seed(42)
n = 1_000_000
has_disease = np.random.rand(n) < P_disease
test_positive = np.where(
has_disease,
np.random.rand(n) < P_positive_given_disease,
np.random.rand(n) < P_positive_given_healthy
)
actual = has_disease[test_positive].mean()
print(f"\nSimulated P(disease|positive) = {actual:.2%}")Probability Distributions
# distributions.py — Key distributions for ML
import numpy as np
from scipy import stats
# --- Gaussian (Normal) Distribution ---
# PDF: f(x) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²))
mu, sigma = 5.0, 2.0
x = np.linspace(-2, 12, 1000)
pdf = stats.norm.pdf(x, mu, sigma)
print(f"Normal: mean={mu}, std={sigma}")
print(f"P(3 < X < 7) = {stats.norm.cdf(7, mu, sigma) - stats.norm.cdf(3, mu, sigma):.4f}")
# --- Bernoulli Distribution ---
# P(X=1) = p, P(X=0) = 1-p
p = 0.3
samples = stats.bernoulli.rvs(p, size=10000)
print(f"\nBernoulli: p={p}, sample mean={samples.mean():.4f}")
# --- Binomial Distribution ---
# n trials, each with probability p
n_trials, p = 100, 0.3
print(f"\nBinomial: n={n_trials}, p={p}")
print(f"Expected value = np = {n_trials * p}")
print(f"P(X = 30) = {stats.binom.pmf(30, n_trials, p):.4f}")
print(f"P(X <= 25) = {stats.binom.cdf(25, n_trials, p):.4f}")
# --- Multinomial Distribution ---
# Used in Naive Bayes for text classification
probs = [0.5, 0.3, 0.2] # word probabilities
counts = np.random.multinomial(100, probs)
print(f"\nMultinomial sample (100 words): {counts}")
# --- Poisson Distribution ---
# P(X=k) = λ^k e^(-λ) / k!
lam = 4.0
print(f"\nPoisson: λ={lam}")
print(f"P(X=3) = {stats.poisson.pmf(3, lam):.4f}")
print(f"P(X<=5) = {stats.poisson.cdf(5, lam):.4f}")Maximum Likelihood Estimation (MLE)
MLE finds the parameters that make the observed data most probable. Given data
Taking the log (since log is monotonic):
Worked Example — MLE for Coin Flip
Observed data: 7 heads, 3 tails in 10 flips. Find MLE for p (probability of heads).
Each flip follows Bernoulli(p): P(H) = p, P(T) = 1-p.
Step 1: Write the likelihood L(p) = p^7 * (1-p)^3
Step 2: Write the log-likelihood log L(p) = 7log(p) + 3log(1-p)
Step 3: Take derivative and set to zero d(log L)/dp = 7/p - 3/(1-p) = 0 7(1-p) = 3p 7 - 7p = 3p 7 = 10p p_MLE = 7/10 = 0.70
Step 4: Verify with different p values p=0.5: log L = 7*(-0.693) + 3*(-0.693) = -6.93 p=0.7: log L = 7*(-0.357) + 3*(-1.204) = -2.499 - 3.612 = -6.11 p=0.9: log L = 7*(-0.105) + 3*(-2.303) = -0.735 - 6.909 = -7.64
p=0.7 gives the highest log-likelihood, confirming our answer.
Interpret: "The MLE for a coin's bias given 7 heads in 10 flips is simply the observed proportion: 7/10 = 0.7. This is the most intuitive estimator, and MLE provides the mathematical justification."
# mle.py — MLE for Gaussian parameters
import numpy as np
# Generate data from a Gaussian
np.random.seed(42)
true_mu, true_sigma = 5.0, 2.0
data = np.random.normal(true_mu, true_sigma, 1000)
# MLE for μ: take the derivative of log-likelihood, set to 0
# log L = -n/2 log(2π) - n/2 log(σ²) - 1/(2σ²) Σ(xi - μ)²
# ∂log L/∂μ = 1/σ² Σ(xi - μ) = 0
# → μ_MLE = (1/n) Σ xi = sample mean
# MLE for σ²: ∂log L/∂σ² = -n/(2σ²) + 1/(2σ⁴) Σ(xi - μ)² = 0
# → σ²_MLE = (1/n) Σ(xi - μ)²
mu_mle = np.mean(data)
sigma_mle = np.sqrt(np.mean((data - mu_mle)**2))
print(f"True μ = {true_mu}, MLE μ = {mu_mle:.4f}")
print(f"True σ = {true_sigma}, MLE σ = {sigma_mle:.4f}")Statistics
Expectation and Variance
Expected value (mean):
Variance — measures spread:
# expectation_variance.py — Computing and verifying
import numpy as np
# Discrete: fair six-sided die
outcomes = np.arange(1, 7)
probs = np.ones(6) / 6
E_X = np.sum(outcomes * probs)
E_X2 = np.sum(outcomes**2 * probs)
Var_X = E_X2 - E_X**2
print(f"E[X] = {E_X:.4f}") # 3.5
print(f"Var(X) = {Var_X:.4f}") # 2.9167
# Verify with simulation
rolls = np.random.randint(1, 7, 100000)
print(f"\nSimulated E[X] = {rolls.mean():.4f}")
print(f"Simulated Var(X) = {rolls.var():.4f}")Bias-Variance Tradeoff
The expected prediction error can be decomposed:
Worked Example — Bias-Variance Tradeoff
True f(x) = 5. Noise sigma = 0.5. Three models trained on different data:
| Model Complexity | Predictions over datasets | E[f_hat] | Bias^2 | Var |
|---|---|---|---|---|
| Too simple (const=3) | [3, 3, 3, 3, 3] | 3.0 | (5-3)^2 = 4.0 | 0.0 |
| Just right | [4.8, 5.3, 4.7, 5.1, 5.1] | 5.0 | (5-5)^2 = 0.0 | 0.04 |
| Too complex | [3.5, 6.5, 4.0, 7.0, 4.0] | 5.0 | (5-5)^2 = 0.0 | 2.10 |
Expected prediction errors (sigma^2 = 0.25): Simple: EPE = 4.0 + 0.0 + 0.25 = 4.25 Right: EPE = 0.0 + 0.04 + 0.25 = 0.29 Complex: EPE = 0.0 + 2.10 + 0.25 = 2.35
Interpret: "The simple model has zero variance but huge bias (always predicts 3 instead of 5). The complex model is unbiased on average but varies wildly. The just-right model achieves the lowest total error (0.29) with a good bias-variance balance. No model can go below 0.25 (the irreducible noise)."
where:
- Bias =
— error from wrong assumptions - Variance =
— error from sensitivity to training data = irreducible noise
Derivation:
Let
Since
Now expand
# bias_variance.py — Demonstrate the tradeoff
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
# True function
def true_f(x):
return np.sin(2 * x) + 0.5 * x
# Generate many datasets to estimate bias and variance
np.random.seed(42)
n_datasets = 200
n_train = 30
x_test = np.linspace(0, 5, 100).reshape(-1, 1)
y_true = true_f(x_test.ravel())
for degree in [1, 4, 15]:
predictions = np.zeros((n_datasets, len(x_test)))
for i in range(n_datasets):
x_train = np.sort(np.random.uniform(0, 5, n_train)).reshape(-1, 1)
y_train = true_f(x_train.ravel()) + np.random.normal(0, 0.5, n_train)
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(x_train, y_train)
predictions[i] = model.predict(x_test)
mean_pred = predictions.mean(axis=0)
bias_sq = np.mean((mean_pred - y_true)**2)
variance = np.mean(predictions.var(axis=0))
mse = np.mean((predictions - y_true)**2)
print(f"Degree {degree:2d}: Bias²={bias_sq:.4f}, Var={variance:.4f}, "
f"Bias²+Var={bias_sq + variance:.4f}, MSE={mse:.4f}")
# Degree 1 — high bias (underfitting)
# Degree 4 — good balance
# Degree 15 — high variance (overfitting)Covariance and Correlation
Covariance measures how two variables move together:
Pearson correlation normalizes to
# correlation.py — Covariance, correlation, and their matrix forms
import numpy as np
# Generate correlated data
np.random.seed(42)
n = 500
x1 = np.random.randn(n)
x2 = 0.8 * x1 + 0.6 * np.random.randn(n) # correlated with x1
x3 = np.random.randn(n) # independent
data = np.column_stack([x1, x2, x3])
# Covariance matrix — used in PCA, Gaussian distributions
cov_matrix = np.cov(data.T)
print("Covariance matrix:")
print(np.round(cov_matrix, 3))
# Correlation matrix — used in feature selection
corr_matrix = np.corrcoef(data.T)
print("\nCorrelation matrix:")
print(np.round(corr_matrix, 3))
# x1 and x2 are correlated (~0.8), x3 is independent (~0.0)
print(f"\ncorr(x1, x2) = {corr_matrix[0, 1]:.3f}")
print(f"corr(x1, x3) = {corr_matrix[0, 2]:.3f}")Hypothesis Testing Intuition
While you rarely do formal hypothesis testing in ML, the concepts appear everywhere:
- p-value in feature importance tests
- Confidence intervals for model performance
- A/B testing for model comparison in production
# hypothesis.py — Comparing two model performances
import numpy as np
from scipy import stats
# Two models evaluated on 30 test folds
np.random.seed(42)
model_a_scores = np.random.normal(0.82, 0.03, 30)
model_b_scores = np.random.normal(0.85, 0.03, 30)
# Paired t-test — are the models significantly different?
t_stat, p_value = stats.ttest_rel(model_a_scores, model_b_scores)
print(f"Model A mean: {model_a_scores.mean():.4f}")
print(f"Model B mean: {model_b_scores.mean():.4f}")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")
if p_value < 0.05:
print("Difference is statistically significant at α=0.05")
else:
print("No statistically significant difference")
# Bootstrap confidence interval for Model B accuracy
n_bootstrap = 10000
bootstrap_means = np.array([
np.random.choice(model_b_scores, size=len(model_b_scores), replace=True).mean()
for _ in range(n_bootstrap)
])
ci_lower = np.percentile(bootstrap_means, 2.5)
ci_upper = np.percentile(bootstrap_means, 97.5)
print(f"\n95% CI for Model B: [{ci_lower:.4f}, {ci_upper:.4f}]")Information Theory
Entropy
Entropy measures the uncertainty of a random variable:
Worked Example — Shannon Entropy
Compare entropy of three distributions over {A, B, C}:
Distribution 1: Uniform [1/3, 1/3, 1/3] H = -(1/3)log2(1/3) - (1/3)log2(1/3) - (1/3)log2(1/3) = -3 * (1/3)(-1.585) = 3 * 0.528 = 1.585 bits
Distribution 2: Skewed [0.8, 0.1, 0.1] H = -0.8log2(0.8) - 0.1log2(0.1) - 0.1*log2(0.1) = -0.8(-0.322) - 0.1(-3.322) - 0.1(-3.322) = 0.258 + 0.332 + 0.332 = 0.922 bits
Distribution 3: Certain [1.0, 0.0, 0.0] H = -1.0log2(1.0) = -1.00 = 0 bits
Interpret: "Uniform distribution has maximum entropy (1.585 bits) — maximum uncertainty about which class. The skewed distribution has less entropy (0.922) — we can guess 'A' most of the time. A certain distribution has zero entropy — no uncertainty at all. Decision trees seek splits that minimize child entropy."
Maximum entropy: uniform distribution (maximum uncertainty). Minimum entropy: deterministic (zero uncertainty).
# entropy.py — Entropy and its role in decision trees
import numpy as np
def entropy(probs):
"""Shannon entropy in bits."""
probs = np.array(probs)
probs = probs[probs > 0] # avoid log(0)
return -np.sum(probs * np.log2(probs))
# Fair coin: maximum entropy for binary
print(f"Fair coin entropy: {entropy([0.5, 0.5]):.4f} bits") # 1.0
# Biased coin: lower entropy
print(f"Biased (0.9, 0.1): {entropy([0.9, 0.1]):.4f} bits") # 0.469
# Certain outcome: zero entropy
print(f"Certain (1.0, 0.0): {entropy([1.0, 0.0]):.4f} bits") # 0.0
# 3-class classification: how pure is this split?
print(f"\nPure split [1, 0, 0]: {entropy([1.0, 0.0, 0.0]):.4f}")
print(f"Mixed [0.33,0.33,0.34]: {entropy([0.33, 0.33, 0.34]):.4f}")
print(f"Skewed [0.8, 0.1, 0.1]: {entropy([0.8, 0.1, 0.1]):.4f}")KL Divergence
Measures how one probability distribution differs from another:
Worked Example — KL Divergence and Cross-Entropy
True distribution P = [0, 0, 1] (class 3 is correct). Two model predictions:
- Good model: Q1 = [0.05, 0.05, 0.90]
- Bad model: Q2 = [0.40, 0.30, 0.30]
Step 1: Cross-entropy H(P, Q1) (good model) H(P, Q1) = -0ln(0.05) - 0ln(0.05) - 1*ln(0.90) = -ln(0.90) = 0.105
Step 2: Cross-entropy H(P, Q2) (bad model) H(P, Q2) = -0ln(0.40) - 0ln(0.30) - 1*ln(0.30) = -ln(0.30) = 1.204
Step 3: KL divergence (since P is one-hot, KL = cross-entropy - entropy of P) H(P) = 0 (deterministic distribution) D_KL(P||Q1) = 0.105 - 0 = 0.105 D_KL(P||Q2) = 1.204 - 0 = 1.204
Interpret: "The good model's cross-entropy (0.105) is much lower than the bad model's (1.204). Minimizing cross-entropy during training is equivalent to minimizing KL divergence — making the model's predicted distribution closer to the true distribution."
KL divergence is not symmetric:
Cross-entropy (used as loss in classification) is related:
Since
# kl_divergence.py — Cross-entropy loss connection
import numpy as np
def cross_entropy(p, q):
"""Cross-entropy between true distribution p and predicted q."""
q = np.clip(q, 1e-15, 1) # avoid log(0)
return -np.sum(p * np.log(q))
def kl_divergence(p, q):
"""KL divergence D_KL(P || Q)."""
mask = p > 0
return np.sum(p[mask] * np.log(p[mask] / np.clip(q[mask], 1e-15, 1)))
# True distribution (one-hot for class 2 in 3-class problem)
p = np.array([0.0, 0.0, 1.0])
# Good prediction
q_good = np.array([0.05, 0.05, 0.90])
# Bad prediction
q_bad = np.array([0.40, 0.30, 0.30])
print(f"Cross-entropy (good): {cross_entropy(p, q_good):.4f}")
print(f"Cross-entropy (bad): {cross_entropy(p, q_bad):.4f}")
print(f"KL divergence (good): {kl_divergence(p, q_good):.4f}")
print(f"KL divergence (bad): {kl_divergence(p, q_bad):.4f}")Putting It All Together: Math in a Real ML Pipeline
# full_pipeline_math.py — See all four pillars in action
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load data — LINEAR ALGEBRA: data is a matrix
iris = load_iris()
X, y = iris.data, iris.target
print(f"Data matrix shape: {X.shape}") # (150, 4)
# Train/test split — PROBABILITY: random sampling
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scaling — STATISTICS: standardize to zero mean, unit variance
# z = (x - μ) / σ
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Train mean: {X_train_scaled.mean(axis=0).round(10)}") # ~0
print(f"Train std: {X_train_scaled.std(axis=0).round(4)}") # ~1
# Logistic Regression — CALCULUS: gradient descent minimizes cross-entropy
# The model learns: P(y=k|x) = softmax(Wx + b)
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)
# Weights — LINEAR ALGEBRA: weight matrix W ∈ R^(3×4)
print(f"\nWeight matrix shape: {model.coef_.shape}")
print(f"Weights:\n{model.coef_.round(3)}")
# Prediction — LINEAR ALGEBRA: matrix multiply + CALCULUS: softmax
y_pred = model.predict(X_test_scaled)
# Evaluation — STATISTICS: accuracy is an estimator of true performance
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.4f}")
# PROBABILITY: predicted probabilities via softmax
probs = model.predict_proba(X_test_scaled)[:3]
print(f"\nPredicted probabilities (first 3):\n{probs.round(4)}")Quick Reference
| Math Concept | ML Application | Key Formula |
|---|---|---|
| Dot product | Similarity, linear models | |
| Matrix inverse | Normal equation | |
| Eigenvalues | PCA | |
| Derivative | Gradient descent | |
| Chain rule | Backpropagation | |
| Bayes' theorem | Naive Bayes, Bayesian models | |
| MLE | Parameter estimation | |
| Bias-variance | Model selection | |
| Entropy | Decision trees, cross-entropy loss | |
| KL divergence | Cross-entropy loss |
Further Reading
- Linear Regression — Normal equation and gradient descent applied
- Logistic Regression — MLE for classification
- Decision Trees — Entropy and information gain in practice
- SVM — Lagrangian optimization and the dual problem
- Naive Bayes — Bayes' theorem as a classifier