Skip to content
Unverified — AI-generated content. Help verify this page

Neural Network Basics

Everything in deep learning rests on a handful of ideas: neurons that compute weighted sums, nonlinear activations, a forward pass that turns input into prediction, a backward pass that turns error into gradients, and an optimizer that updates weights. This page derives every one of those pieces from first principles, then glues them together into a working MLP trained on MNIST with nothing but NumPy.

The Perceptron

The perceptron is the simplest neural network --- a single neuron that makes binary decisions.

Given an input vector xRn and a weight vector wRn with bias b, the perceptron computes:

y^={1if wx+b>00otherwise

The decision boundary is a hyperplane wx+b=0.

Worked Example — Perceptron Decision

Input:

x1x2
Sample10.5

Weights: w=[0.6,0.4], bias b=0.1

Step 1: Compute weighted sum

wx+b=(0.6)(1)+(0.4)(0.5)+(0.1)=0.60.20.1=0.3

Step 2: Apply threshold

0.3>0y^=1

Result: The perceptron predicts class 1. The sample is on the positive side of the decision boundary 0.6x10.4x20.1=0.

Perceptron Learning Rule

The weight update for a misclassified sample is:

ww+η(yy^)xbb+η(yy^)

where η is the learning rate and y is the true label.

Worked Example — Perceptron Learning Rule

Input: Sample x=[1,0.5], true label y=0, predicted y^=1, learning rate η=0.1

Current: w=[0.6,0.4], b=0.1

Step 1: Compute error

yy^=01=1

Step 2: Update weights

w10.6+0.1×(1)×1=0.5w20.4+0.1×(1)×0.5=0.45

Step 3: Update bias

b0.1+0.1×(1)=0.2

Result: New weights w=[0.5,0.45], b=0.2. The decision boundary shifted to reduce the score for this misclassified sample. Recheck: 0.5(1)+(0.45)(0.5)+(0.2)=0.50.2250.2=0.075 --- still positive, so more iterations needed.

The XOR Problem

A single perceptron can only learn linearly separable functions. It cannot learn XOR. This limitation drove the development of multi-layer networks.

Perceptron in Python

python
import numpy as np

class Perceptron:
    def __init__(self, n_features, lr=0.01):
        self.w = np.zeros(n_features)
        self.b = 0.0
        self.lr = lr

    def predict(self, x):
        return 1 if np.dot(self.w, x) + self.b > 0 else 0

    def train(self, X, y, epochs=100):
        for _ in range(epochs):
            for xi, yi in zip(X, y):
                pred = self.predict(xi)
                error = yi - pred
                self.w += self.lr * error * xi
                self.b += self.lr * error

# AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])
p = Perceptron(2)
p.train(X, y)
print([p.predict(xi) for xi in X])  # [0, 0, 0, 1]

Multi-Layer Perceptron (MLP)

An MLP stacks multiple layers of neurons. Each neuron in layer l receives input from every neuron in layer l1 (fully connected). With nonlinear activations between layers, MLPs can approximate any continuous function.

Architecture

For an MLP with L layers:

a(0)=x(input)z(l)=W(l)a(l1)+b(l)(pre-activation)a(l)=σ(l)(z(l))(activation)y^=a(L)(output)

where W(l)Rnl×nl1 is the weight matrix for layer l, b(l)Rnl is the bias, and σ(l) is the activation function.

Worked Example — MLP Forward Pass (2-layer)

Input: x=[1.0,0.5] (2 features), hidden layer size 2, output size 2

Weights:

W(1)=[0.30.70.20.4],b(1)=[0.10.1]W(2)=[0.50.30.20.6],b(2)=[0.00.0]

Step 1: Pre-activation layer 1

z(1)=W(1)x+b(1)=[0.3(1)+0.7(0.5)+0.10.2(1)+0.4(0.5)0.1]=[0.750.1]

Step 2: Apply ReLU

a(1)=ReLU(z(1))=[0.750]

The second neuron is "dead" (negative pre-activation zeroed by ReLU).

Step 3: Pre-activation layer 2

z(2)=W(2)a(1)+b(2)=[0.5(0.75)+(0.3)(0)0.2(0.75)+0.6(0)]=[0.3750.15]

Step 4: Softmax output

y^=softmax([0.375,0.15])=[e0.375e0.375+e0.15,e0.15e0.375+e0.15]=[0.556,0.444]

Result: The network predicts class 0 with 55.6% probability. Only one hidden neuron contributed because the other had a negative pre-activation and was zeroed by ReLU.

Activation Functions

Activation functions introduce nonlinearity. Without them, stacking linear layers collapses to a single linear transformation: W2(W1x+b1)+b2=Wx+b.

Sigmoid

σ(x)=11+ex

Derivative:

σ(x)=σ(x)(1σ(x))

Derivation: Let σ=(1+ex)1. By the chain rule:

dσdx=1(1+ex)2(ex)=ex(1+ex)2

Since ex=1σ1=1σσ:

σ=1σσ1σ2=σ(1σ)

Properties: Output in (0,1). Suffers from vanishing gradients --- when |x| is large, σ0, which kills gradient flow. Used mainly in output layers for binary classification.

Worked Example — Sigmoid Activation and Derivative

Input values: x=[2,0,0.75,3]

Step 1: Compute σ(x)=11+ex

xexσ(x)
-27.3891/8.389=0.119
01.0001/2=0.500
0.750.4721/1.472=0.679
30.0501/1.050=0.953

Step 2: Compute derivative σ(x)=σ(x)(1σ(x))

xσ(x)σ(x)
-20.1190.119×0.881=0.105
00.5000.500×0.500=0.250
0.750.6790.679×0.321=0.218
30.9530.953×0.047=0.045

Result: The derivative peaks at x=0 (max value 0.25) and decays toward zero for large |x|. At x=3, the gradient is only 0.045 --- this is why sigmoid causes vanishing gradients in deep networks.

Tanh

tanh(x)=exexex+ex

Derivative:

tanh(x)=1tanh2(x)

Derivation: Using the quotient rule on tanh=exexex+ex:

tanh=(ex+ex)2(exex)2(ex+ex)2=1tanh2(x)

Properties: Output in (1,1), zero-centered (unlike sigmoid). Still suffers from vanishing gradients at extremes. Common in RNN hidden states.

ReLU (Rectified Linear Unit)

ReLU(x)=max(0,x)

Derivative:

ReLU(x)={1if x>00if x<0

(Undefined at x=0; conventionally set to 0 or 1.)

Properties: No vanishing gradient for positive inputs. Sparse activations (many neurons output 0). Computationally efficient. Suffers from "dying ReLU" --- neurons with negative input always output 0 and never recover.

Leaky ReLU

LeakyReLU(x)={xif x>0αxif x0

where α is typically 0.01. Fixes the dying ReLU problem by allowing a small gradient for negative inputs.

GELU (Gaussian Error Linear Unit)

GELU(x)=xΦ(x)=x12[1+erf(x2)]

Approximate form:

GELU(x)0.5x(1+tanh[2π(x+0.044715x3)])

Properties: Smooth, non-monotonic. Used in BERT, GPT, and most modern transformers. Combines the benefits of ReLU (non-saturating for large positive x) with a smooth transition near zero.

Softmax (for output layers)

softmax(zi)=ezij=1Kezj

Converts a vector of logits into a probability distribution. Used in the output layer of multi-class classifiers.

Worked Example — Softmax

Input logits: z=[2.0,1.0,0.1] (3-class classifier)

Step 1: Compute exponentials

e2.0=7.389,e1.0=2.718,e0.1=1.105

Step 2: Sum

=7.389+2.718+1.105=11.212

Step 3: Normalize

softmax=[7.38911.212,2.71811.212,1.10511.212]=[0.659,0.242,0.099]

Result: The model assigns 65.9% probability to class 0, 24.2% to class 1, and 9.9% to class 2. Probabilities sum to 1.0. The largest logit (2.0) dominates after softmax amplification.

Activation Function Comparison

python
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 1000)

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Sigmoid
axes[0, 0].plot(x, 1 / (1 + np.exp(-x)), 'b-', linewidth=2)
axes[0, 0].set_title('Sigmoid')
axes[0, 0].axhline(y=0, color='k', linewidth=0.5)
axes[0, 0].axvline(x=0, color='k', linewidth=0.5)
axes[0, 0].grid(True, alpha=0.3)

# Tanh
axes[0, 1].plot(x, np.tanh(x), 'r-', linewidth=2)
axes[0, 1].set_title('Tanh')
axes[0, 1].axhline(y=0, color='k', linewidth=0.5)
axes[0, 1].axvline(x=0, color='k', linewidth=0.5)
axes[0, 1].grid(True, alpha=0.3)

# ReLU
axes[0, 2].plot(x, np.maximum(0, x), 'g-', linewidth=2)
axes[0, 2].set_title('ReLU')
axes[0, 2].axhline(y=0, color='k', linewidth=0.5)
axes[0, 2].axvline(x=0, color='k', linewidth=0.5)
axes[0, 2].grid(True, alpha=0.3)

# Leaky ReLU
axes[1, 0].plot(x, np.where(x > 0, x, 0.01 * x), 'm-', linewidth=2)
axes[1, 0].set_title('Leaky ReLU (α=0.01)')
axes[1, 0].axhline(y=0, color='k', linewidth=0.5)
axes[1, 0].axvline(x=0, color='k', linewidth=0.5)
axes[1, 0].grid(True, alpha=0.3)

# GELU
from scipy.special import erf
gelu = 0.5 * x * (1 + erf(x / np.sqrt(2)))
axes[1, 1].plot(x, gelu, 'c-', linewidth=2)
axes[1, 1].set_title('GELU')
axes[1, 1].axhline(y=0, color='k', linewidth=0.5)
axes[1, 1].axvline(x=0, color='k', linewidth=0.5)
axes[1, 1].grid(True, alpha=0.3)

# Derivatives comparison
sig = 1 / (1 + np.exp(-x))
axes[1, 2].plot(x, sig * (1 - sig), 'b--', label="Sigmoid'", linewidth=2)
axes[1, 2].plot(x, 1 - np.tanh(x)**2, 'r--', label="Tanh'", linewidth=2)
axes[1, 2].plot(x, np.where(x > 0, 1, 0).astype(float), 'g--', label="ReLU'", linewidth=2)
axes[1, 2].set_title('Derivatives')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('activations.png', dpi=150)

The Forward Pass

The forward pass computes the output of the network layer by layer.

For a 3-layer MLP with input x, hidden sizes h1,h2, and output size K:

z(1)=W(1)x+b(1),a(1)=ReLU(z(1))z(2)=W(2)a(1)+b(2),a(2)=ReLU(z(2))z(3)=W(3)a(2)+b(3),y^=softmax(z(3))

Each step is a matrix multiplication followed by an element-wise nonlinearity. The forward pass stores all intermediate values (z(l) and a(l)) because backpropagation needs them.

Loss Functions

The loss function measures how far the prediction y^ is from the true label y. Training minimizes this loss.

Mean Squared Error (MSE)

LMSE=1Ni=1N(yiy^i)2

Used for regression. Derivative: Ly^i=2N(y^iyi).

Cross-Entropy Loss

For classification with K classes:

LCE=i=1Nk=1Kyiklog(y^ik)

With one-hot labels (yik=1 only for the correct class c):

LCE=1Ni=1Nlog(y^i,ci)

Why cross-entropy, not MSE, for classification? MSE gradients near 0 or 1 are tiny (due to sigmoid saturation), so learning is very slow. Cross-entropy produces large gradients when the prediction is wrong, enabling faster learning.

Worked Example — MSE and Cross-Entropy Loss

Setup: 3-class problem, true label = class 1 (one-hot: y=[0,1,0])

Prediction (softmax output): y^=[0.1,0.7,0.2]

MSE Loss:

LMSE=13[(00.1)2+(10.7)2+(00.2)2]=0.01+0.09+0.043=0.0467

Cross-Entropy Loss:

LCE=[0log(0.1)+1log(0.7)+0log(0.2)]=log(0.7)=0.357

Now consider a bad prediction: y^=[0.1,0.1,0.8] (confident but wrong)

MSE: (0.01+0.81+0.64)3=0.487

CE: log(0.1)=2.303

Result: Cross-entropy penalizes the bad prediction 6.5x more harshly (2.303/0.357), while MSE only penalizes 10x (0.487/0.0467). CE produces much stronger gradients for wrong confident predictions, enabling faster learning.

Binary Cross-Entropy

LBCE=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]

Backpropagation: Full Derivation

Backpropagation is just the chain rule applied systematically from the loss back through the network.

Setup

Consider a single training example with a 2-layer network:

z(1)=W(1)x+b(1),a(1)=σ(z(1))z(2)=W(2)a(1)+b(2),y^=softmax(z(2))L=kyklog(y^k)

Step 1: Output Layer Gradient

We need Lz(2). For softmax + cross-entropy, this simplifies beautifully:

Lzk(2)=y^kyk

Full derivation: The softmax output is y^k=ezkjezj. The cross-entropy is L=kyklogy^k.

For y^kzi, we have two cases:

When i=k:

y^kzk=y^k(1y^k)

When ik:

y^kzi=y^ky^i

Combining with the cross-entropy derivative Ly^k=yky^k:

Lzi=kLy^ky^kzi=kyky^ky^kzi=yi(1y^i)+kiyky^i=yi+yiy^i+y^ikiyk=yi+y^ikyk=y^iyi

(using kyk=1 for one-hot labels).

Let δ(2)=y^y.

Worked Example — Backpropagation Output Layer Gradient

Setup: 2-class output, true label = class 0 (y=[1,0]), softmax output y^=[0.6,0.4]

Step 1: Compute δ(2)=y^y

δ(2)=[0.61,0.40]=[0.4,0.4]

Result: The gradient for class 0 is 0.4 (negative because prediction was too low --- push it up). The gradient for class 1 is +0.4 (positive because prediction was too high --- push it down). This elegantly simple gradient is why softmax + cross-entropy is the standard pairing.

Step 2: Weight Gradients (Output Layer)

LW(2)=δ(2)(a(1))TLb(2)=δ(2)

Step 3: Propagate to Hidden Layer

δ(1)=(W(2))Tδ(2)σ(z(1))

where is element-wise multiplication and σ is the derivative of the activation function.

Step 4: Weight Gradients (Hidden Layer)

LW(1)=δ(1)xTLb(1)=δ(1)

General Backpropagation Rule

For layer l:

δ(l)=[(W(l+1))Tδ(l+1)]σ(z(l))LW(l)=δ(l)(a(l1))TLb(l)=δ(l)

This is why we store z(l) and a(l) during the forward pass --- we need them to compute the gradients.

Worked Example — Full Backprop Through a 2-Layer Network

Setup: Input x=[1.0,0.5], true label y=[1,0] (class 0), sigmoid activation, learning rate η=0.5

W(1)=[0.30.70.20.4], b(1)=[0,0], W(2)=[0.50.30.20.6], b(2)=[0,0]

Forward pass:

  • z(1)=[0.3(1)+0.7(0.5),0.2(1)+0.4(0.5)]=[0.65,0.0]
  • a(1)=σ([0.65,0.0])=[0.657,0.500]
  • z(2)=[0.5(0.657)+(0.3)(0.5),0.2(0.657)+0.6(0.5)]=[0.179,0.431]
  • y^=softmax([0.179,0.431])=[0.438,0.562]

Backprop Step 1: Output gradient

δ(2)=y^y=[0.4381,0.5620]=[0.562,0.562]

Backprop Step 2: Weight gradients (output layer)

LW(2)=δ(2)(a(1))T=[0.5620.562][0.6570.500]=[0.3690.2810.3690.281]

Backprop Step 3: Propagate to hidden layer

δ(1)=(W(2))Tδ(2)σ(z(1))(W(2))Tδ(2)=[0.50.20.30.6][0.5620.562]=[0.1690.506]σ(z(1))=[0.657(10.657),0.5(10.5)]=[0.225,0.250]δ(1)=[0.169×0.225,0.506×0.250]=[0.038,0.127]

Backprop Step 4: Update W(2)

Wnew(2)=W(2)0.5×LW(2)=[0.5+0.1850.3+0.1410.20.1850.60.141]=[0.6850.1590.0150.459]

Result: The weight connecting the active hidden neuron (0.657) to the correct output class increased from 0.5 to 0.685, strengthening the path to the correct prediction.

Gradient Descent Variants

Once we have gradients, we need an optimizer to update the weights.

Vanilla SGD

θt+1=θtηθL

Simple but slow. Oscillates in ravines (dimensions with very different curvatures).

SGD with Momentum

vt=βvt1+(1β)θLθt+1=θtηvt

Momentum (β0.9) accumulates past gradients, damping oscillations and accelerating movement along consistent gradient directions. Think of a ball rolling downhill with inertia.

RMSProp

st=βst1+(1β)gt2θt+1=θtηst+ϵgt

Adapts the learning rate per parameter. Parameters with large gradients get smaller effective learning rates.

Adam (Adaptive Moment Estimation)

Adam combines momentum with RMSProp:

First moment (mean):

mt=β1mt1+(1β1)gt

Second moment (uncentered variance):

vt=β2vt1+(1β2)gt2

Bias correction (crucial for early steps when m and v are biased toward zero):

m^t=mt1β1t,v^t=vt1β2t

Parameter update:

θt+1=θtηv^t+ϵm^t

Default hyperparameters: β1=0.9, β2=0.999, ϵ=108.

Worked Example — Adam Optimizer (3 Steps)

Setup: Single parameter θ0=5.0, η=0.1, β1=0.9, β2=0.999, ϵ=108

Gradients: g1=2.0, g2=1.5, g3=3.0, init m0=0, v0=0

Step 1 (t=1, g1=2.0):

  • m1=0.9(0)+0.1(2.0)=0.2
  • v1=0.999(0)+0.001(4.0)=0.004
  • m^1=0.2/(10.91)=0.2/0.1=2.0 (bias correction is huge at t=1)
  • v^1=0.004/(10.9991)=0.004/0.001=4.0
  • θ1=5.00.1×2.0/(4.0+108)=5.00.1=4.9

Step 2 (t=2, g2=1.5):

  • m2=0.9(0.2)+0.1(1.5)=0.33
  • v2=0.999(0.004)+0.001(2.25)=0.00625
  • m^2=0.33/(10.81)=1.737
  • v^2=0.00625/(10.998)=3.125
  • θ2=4.90.1×1.737/3.125=4.90.098=4.802

Result: Adam adapts the effective learning rate per parameter. The bias correction at early steps is critical --- without it, m1=0.2 would severely underestimate the true mean gradient, causing tiny updates.

Adam Is the Default

Adam works well out of the box for most problems. Start with Adam and a learning rate of 3×104. Switch to SGD+momentum only if you need the last bit of generalization (SGD often finds flatter minima).

AdamW (Adam with Decoupled Weight Decay)

Standard Adam applies weight decay through the gradient, which interacts poorly with adaptive learning rates. AdamW decouples them:

θt+1=θtη(m^tv^t+ϵ+λθt)

where λ is the weight decay coefficient. AdamW is the standard optimizer for transformer training.

From-Scratch MLP in NumPy: MNIST

Here is a complete MLP trained on MNIST using only NumPy, implementing everything derived above.

python
import numpy as np

# ── Activation functions ─────────────────────────────────────────────
def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def softmax(z):
    # Subtract max for numerical stability
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

# ── Loss ─────────────────────────────────────────────────────────────
def cross_entropy_loss(y_pred, y_true):
    """y_true is one-hot, y_pred is softmax output."""
    N = y_true.shape[0]
    # Clip to avoid log(0)
    y_pred = np.clip(y_pred, 1e-12, 1 - 1e-12)
    loss = -np.sum(y_true * np.log(y_pred)) / N
    return loss

# ── Weight initialization (He) ───────────────────────────────────────
def he_init(fan_in, fan_out):
    return np.random.randn(fan_in, fan_out) * np.sqrt(2.0 / fan_in)

# ── MLP class ────────────────────────────────────────────────────────
class MLP:
    def __init__(self, layer_sizes, lr=0.001):
        """
        layer_sizes: list, e.g., [784, 128, 64, 10]
        """
        self.lr = lr
        self.weights = []
        self.biases = []
        for i in range(len(layer_sizes) - 1):
            W = he_init(layer_sizes[i], layer_sizes[i + 1])
            b = np.zeros((1, layer_sizes[i + 1]))
            self.weights.append(W)
            self.biases.append(b)

    def forward(self, X):
        """Forward pass. Store intermediates for backprop."""
        self.activations = [X]
        self.pre_activations = []
        a = X
        for i in range(len(self.weights) - 1):
            z = a @ self.weights[i] + self.biases[i]
            self.pre_activations.append(z)
            a = relu(z)
            self.activations.append(a)
        # Output layer: softmax
        z = a @ self.weights[-1] + self.biases[-1]
        self.pre_activations.append(z)
        a = softmax(z)
        self.activations.append(a)
        return a

    def backward(self, y_true):
        """Backpropagation. Returns gradients and updates weights."""
        N = y_true.shape[0]
        n_layers = len(self.weights)

        # Output layer: softmax + cross-entropy gradient
        delta = (self.activations[-1] - y_true) / N

        for i in reversed(range(n_layers)):
            # Weight and bias gradients
            dW = self.activations[i].T @ delta
            db = np.sum(delta, axis=0, keepdims=True)

            # Propagate delta to previous layer (if not input layer)
            if i > 0:
                delta = (delta @ self.weights[i].T) * relu_derivative(
                    self.pre_activations[i - 1]
                )

            # Update weights
            self.weights[i] -= self.lr * dW
            self.biases[i] -= self.lr * db

    def predict(self, X):
        return np.argmax(self.forward(X), axis=1)

# ── Data loading ─────────────────────────────────────────────────────
def load_mnist():
    """Load MNIST using sklearn (for simplicity)."""
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1, as_frame=False, parser='auto')
    X, y = mnist.data, mnist.target.astype(int)

    # Normalize to [0, 1]
    X = X / 255.0

    # One-hot encode labels
    y_onehot = np.zeros((y.shape[0], 10))
    y_onehot[np.arange(y.shape[0]), y] = 1

    # Split: 60K train, 10K test
    X_train, X_test = X[:60000], X[60000:]
    y_train, y_test = y_onehot[:60000], y_onehot[60000:]
    y_test_labels = y[60000:]

    return X_train, y_train, X_test, y_test, y_test_labels

# ── Training ─────────────────────────────────────────────────────────
def train():
    X_train, y_train, X_test, y_test, y_test_labels = load_mnist()

    model = MLP(layer_sizes=[784, 256, 128, 10], lr=0.001)
    batch_size = 64
    epochs = 20

    for epoch in range(epochs):
        # Shuffle training data
        indices = np.random.permutation(X_train.shape[0])
        X_shuffled = X_train[indices]
        y_shuffled = y_train[indices]

        epoch_loss = 0
        n_batches = 0

        for i in range(0, X_train.shape[0], batch_size):
            X_batch = X_shuffled[i:i + batch_size]
            y_batch = y_shuffled[i:i + batch_size]

            # Forward pass
            y_pred = model.forward(X_batch)
            loss = cross_entropy_loss(y_pred, y_batch)
            epoch_loss += loss
            n_batches += 1

            # Backward pass (updates weights internally)
            model.backward(y_batch)

        # Evaluate
        test_preds = model.predict(X_test)
        accuracy = np.mean(test_preds == y_test_labels)
        print(
            f"Epoch {epoch + 1:2d} | "
            f"Loss: {epoch_loss / n_batches:.4f} | "
            f"Test Accuracy: {accuracy:.4f}"
        )

if __name__ == "__main__":
    train()

Expected output:

Epoch  1 | Loss: 0.5832 | Test Accuracy: 0.9112
Epoch  5 | Loss: 0.1254 | Test Accuracy: 0.9598
Epoch 10 | Loss: 0.0612 | Test Accuracy: 0.9701
Epoch 20 | Loss: 0.0198 | Test Accuracy: 0.9752

From Scratch to Framework

This NumPy implementation does exactly what PyTorch does under the hood. The key difference is that PyTorch builds a computation graph automatically and computes gradients via autograd, so you never write backward() by hand. See PyTorch Fundamentals for the framework version.

The Computational Graph

Backpropagation works by traversing a directed acyclic graph (DAG) of operations in reverse order. Every operation records its inputs and the local gradient.

Each node stores outputinput (local gradient). Backprop multiplies these local gradients along every path from the loss to each parameter.

Numerical Gradient Checking

Always verify your analytical gradients with numerical gradients during development:

python
def numerical_gradient(f, x, eps=1e-5):
    """Compute numerical gradient of f at x."""
    grad = np.zeros_like(x)
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        idx = it.multi_index
        old_val = x[idx]

        x[idx] = old_val + eps
        f_plus = f(x)
        x[idx] = old_val - eps
        f_minus = f(x)
        x[idx] = old_val

        grad[idx] = (f_plus - f_minus) / (2 * eps)
        it.iternext()
    return grad

# Usage: compare with analytical gradient
# relative_error = |analytical - numerical| / max(|analytical|, |numerical|)
# Should be < 1e-5 for float64

Mini-Batch SGD: Why Batches?

StrategyBatch SizeProsCons
Full-batch GDAll dataStable gradientSlow, needs lots of memory
Stochastic GD1 sampleFast updatesVery noisy
Mini-batch SGD32--512Balance of speed and stabilityRequires batch size tuning

Mini-batch SGD is the standard. Typical batch sizes are powers of 2 (32, 64, 128, 256) for GPU memory alignment.

The gradient of a mini-batch of size B is:

gB=1Bi=1BθL(xi,yi)

This is an unbiased estimate of the true gradient: E[gB]=θL.

Common Pitfalls

MistakeSymptomFix
Learning rate too highLoss explodes or oscillates wildlyReduce LR by 10x
Learning rate too lowLoss decreases extremely slowlyIncrease LR or use scheduler
No input normalizationTraining unstable, slow convergenceNormalize to mean 0, std 1
Incorrect shapesDimension mismatch errorsPrint shapes at every layer
Forgetting biasUnderfittingAlways include bias terms
Not shuffling dataModel learns order, not patternsShuffle each epoch
Integer labels with cross-entropyWrong loss valuesOne-hot encode labels

Cross-References

"What I cannot create, I do not understand." — Richard Feynman