Neural Network Basics

Everything in deep learning rests on a handful of ideas: neurons that compute weighted sums, nonlinear activations, a forward pass that turns input into prediction, a backward pass that turns error into gradients, and an optimizer that updates weights. This page derives every one of those pieces from first principles, then glues them together into a working MLP trained on MNIST with nothing but NumPy.

The Perceptron

The perceptron is the simplest neural network --- a single neuron that makes binary decisions.

Given an input vector $x \in R^{n}$ and a weight vector $w \in R^{n}$ with bias $b$ , the perceptron computes:

\hat{y} = {\begin{cases} 1 & if w \cdot x + b > 0 \\ 0 & otherwise \end{cases}

The decision boundary is a hyperplane $w \cdot x + b = 0$ .

Worked Example — Perceptron Decision

Input:

	$x_{1}$	$x_{2}$
Sample	1	0.5

Weights: $w = [0.6, - 0.4]$ , bias $b = - 0.1$

Step 1: Compute weighted sum

w \cdot x + b = (0.6) (1) + (- 0.4) (0.5) + (- 0.1) = 0.6 - 0.2 - 0.1 = 0.3

Step 2: Apply threshold

0.3 > 0 ⟹ \hat{y} = 1

Result: The perceptron predicts class 1. The sample is on the positive side of the decision boundary $0.6 x_{1} - 0.4 x_{2} - 0.1 = 0$ .

Perceptron Learning Rule

The weight update for a misclassified sample is:

w \leftarrow w + η (y - \hat{y}) x

b \leftarrow b + η (y - \hat{y})

where $η$ is the learning rate and $y$ is the true label.

Worked Example — Perceptron Learning Rule

Input: Sample $x = [1, 0.5]$ , true label $y = 0$ , predicted $\hat{y} = 1$ , learning rate $η = 0.1$

Current: $w = [0.6, - 0.4]$ , $b = - 0.1$

Step 1: Compute error

y - \hat{y} = 0 - 1 = - 1

Step 2: Update weights

w_{1} \leftarrow 0.6 + 0.1 \times (- 1) \times 1 = 0.5

w_{2} \leftarrow - 0.4 + 0.1 \times (- 1) \times 0.5 = - 0.45

Step 3: Update bias

b \leftarrow - 0.1 + 0.1 \times (- 1) = - 0.2

Result: New weights $w = [0.5, - 0.45]$ , $b = - 0.2$ . The decision boundary shifted to reduce the score for this misclassified sample. Recheck: $0.5 (1) + (- 0.45) (0.5) + (- 0.2) = 0.5 - 0.225 - 0.2 = 0.075$ --- still positive, so more iterations needed.

The XOR Problem

A single perceptron can only learn linearly separable functions. It cannot learn XOR. This limitation drove the development of multi-layer networks.

Perceptron in Python

python

import numpy as np

class Perceptron:
    def __init__(self, n_features, lr=0.01):
        self.w = np.zeros(n_features)
        self.b = 0.0
        self.lr = lr

    def predict(self, x):
        return 1 if np.dot(self.w, x) + self.b > 0 else 0

    def train(self, X, y, epochs=100):
        for _ in range(epochs):
            for xi, yi in zip(X, y):
                pred = self.predict(xi)
                error = yi - pred
                self.w += self.lr * error * xi
                self.b += self.lr * error

# AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])
p = Perceptron(2)
p.train(X, y)
print([p.predict(xi) for xi in X])  # [0, 0, 0, 1]

Multi-Layer Perceptron (MLP)

An MLP stacks multiple layers of neurons. Each neuron in layer $l$ receives input from every neuron in layer $l - 1$ (fully connected). With nonlinear activations between layers, MLPs can approximate any continuous function.

Architecture

For an MLP with $L$ layers:

a^{(0)} = x (input)

z^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)} (pre-activation)

a^{(l)} = σ^{(l)} (z^{(l)}) (activation)

\hat{y} = a^{(L)} (output)

where $W^{(l)} \in R^{n_{l} \times n_{l - 1}}$ is the weight matrix for layer $l$ , $b^{(l)} \in R^{n_{l}}$ is the bias, and $σ^{(l)}$ is the activation function.

Worked Example — MLP Forward Pass (2-layer)

Input: $x = [1.0, 0.5]$ (2 features), hidden layer size 2, output size 2

Weights:

W^{(1)} = [\begin{matrix} 0.3 & 0.7 \\ - 0.2 & 0.4 \end{matrix}], b^{(1)} = [\begin{matrix} 0.1 \\ - 0.1 \end{matrix}]

W^{(2)} = [\begin{matrix} 0.5 & - 0.3 \\ 0.2 & 0.6 \end{matrix}], b^{(2)} = [\begin{matrix} 0.0 \\ 0.0 \end{matrix}]

Step 1: Pre-activation layer 1

z^{(1)} = W^{(1)} x + b^{(1)} = [\begin{matrix} 0.3 (1) + 0.7 (0.5) + 0.1 \\ - 0.2 (1) + 0.4 (0.5) - 0.1 \end{matrix}] = [\begin{matrix} 0.75 \\ - 0.1 \end{matrix}]

Step 2: Apply ReLU

a^{(1)} = ReLU (z^{(1)}) = [\begin{matrix} 0.75 \\ 0 \end{matrix}]

The second neuron is "dead" (negative pre-activation zeroed by ReLU).

Step 3: Pre-activation layer 2

z^{(2)} = W^{(2)} a^{(1)} + b^{(2)} = [\begin{matrix} 0.5 (0.75) + (- 0.3) (0) \\ 0.2 (0.75) + 0.6 (0) \end{matrix}] = [\begin{matrix} 0.375 \\ 0.15 \end{matrix}]

Step 4: Softmax output

\hat{y} = softmax ([0.375, 0.15]) = [\frac{e^{0.375}}{e^{0.375} + e^{0.15}}, \frac{e^{0.15}}{e^{0.375} + e^{0.15}}] = [0.556, 0.444]

Result: The network predicts class 0 with 55.6% probability. Only one hidden neuron contributed because the other had a negative pre-activation and was zeroed by ReLU.

Activation Functions

Activation functions introduce nonlinearity. Without them, stacking linear layers collapses to a single linear transformation: $W_{2} (W_{1} x + b_{1}) + b_{2} = W^{'} x + b^{'}$ .

Sigmoid

σ (x) = \frac{1}{1 + e^{- x}}

Derivative:

σ^{'} (x) = σ (x) (1 - σ (x))

Derivation: Let $σ = (1 + e^{- x})^{- 1}$ . By the chain rule:

\frac{d σ}{d x} = - 1 \cdot (1 + e^{- x})^{- 2} \cdot (- e^{- x}) = \frac{e^{- x}}{(1 + e^{- x})^{2}}

Since $e^{- x} = \frac{1}{σ} - 1 = \frac{1 - σ}{σ}$ :

σ^{'} = \frac{\frac{1 - σ}{σ}}{\frac{1}{σ^{2}}} = σ (1 - σ)

Properties: Output in $(0, 1)$ . Suffers from vanishing gradients --- when $| x |$ is large, $σ^{'} \approx 0$ , which kills gradient flow. Used mainly in output layers for binary classification.

Worked Example — Sigmoid Activation and Derivative

Input values: $x = [- 2, 0, 0.75, 3]$

Step 1: Compute $σ (x) = \frac{1}{1 + e^{- x}}$

$x$	$e^{- x}$	$σ (x)$
-2	7.389	$1 / 8.389 = 0.119$
0	1.000	$1 / 2 = 0.500$
0.75	0.472	$1 / 1.472 = 0.679$
3	0.050	$1 / 1.050 = 0.953$

Step 2: Compute derivative $σ^{'} (x) = σ (x) (1 - σ (x))$

$x$	$σ (x)$	$σ^{'} (x)$
-2	0.119	$0.119 \times 0.881 = 0.105$
0	0.500	$0.500 \times 0.500 = 0.250$
0.75	0.679	$0.679 \times 0.321 = 0.218$
3	0.953	$0.953 \times 0.047 = 0.045$

Result: The derivative peaks at $x = 0$ (max value 0.25) and decays toward zero for large $| x |$ . At $x = 3$ , the gradient is only 0.045 --- this is why sigmoid causes vanishing gradients in deep networks.

Tanh

\tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

Derivative:

\tanh^{'} (x) = 1 - \tanh^{2} (x)

Derivation: Using the quotient rule on $\tanh = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}$ :

\tanh^{'} = \frac{(e^{x} + e^{- x})^{2} - (e^{x} - e^{- x})^{2}}{(e^{x} + e^{- x})^{2}} = 1 - \tanh^{2} (x)

Properties: Output in $(- 1, 1)$ , zero-centered (unlike sigmoid). Still suffers from vanishing gradients at extremes. Common in RNN hidden states.

ReLU (Rectified Linear Unit)

ReLU (x) = max (0, x)

Derivative:

{ReLU}^{'} (x) = {\begin{cases} 1 & if x > 0 \\ 0 & if x < 0 \end{cases}

(Undefined at $x = 0$ ; conventionally set to 0 or 1.)

Properties: No vanishing gradient for positive inputs. Sparse activations (many neurons output 0). Computationally efficient. Suffers from "dying ReLU" --- neurons with negative input always output 0 and never recover.

Leaky ReLU

LeakyReLU (x) = {\begin{cases} x & if x > 0 \\ α x & if x \leq 0 \end{cases}

where $α$ is typically 0.01. Fixes the dying ReLU problem by allowing a small gradient for negative inputs.

GELU (Gaussian Error Linear Unit)

GELU (x) = x \cdot Φ (x) = x \cdot \frac{1}{2} [1 + erf (\frac{x}{\sqrt{2}})]

Approximate form:

GELU (x) \approx 0.5 x (1 + \tanh [\sqrt{\frac{2}{π}} (x + 0.044715 x^{3})])

Properties: Smooth, non-monotonic. Used in BERT, GPT, and most modern transformers. Combines the benefits of ReLU (non-saturating for large positive $x$ ) with a smooth transition near zero.

Softmax (for output layers)

softmax (z_{i}) = \frac{e^{z_{i}}}{\sum_{j = 1}^{K} e^{z_{j}}}

Converts a vector of logits into a probability distribution. Used in the output layer of multi-class classifiers.

Worked Example — Softmax

Input logits: $z = [2.0, 1.0, 0.1]$ (3-class classifier)

Step 1: Compute exponentials

e^{2.0} = 7.389, e^{1.0} = 2.718, e^{0.1} = 1.105

Step 2: Sum

\sum = 7.389 + 2.718 + 1.105 = 11.212

Step 3: Normalize

softmax = [\frac{7.389}{11.212}, \frac{2.718}{11.212}, \frac{1.105}{11.212}] = [0.659, 0.242, 0.099]

Result: The model assigns 65.9% probability to class 0, 24.2% to class 1, and 9.9% to class 2. Probabilities sum to 1.0. The largest logit (2.0) dominates after softmax amplification.

Activation Function Comparison

python

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 1000)

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Sigmoid
axes[0, 0].plot(x, 1 / (1 + np.exp(-x)), 'b-', linewidth=2)
axes[0, 0].set_title('Sigmoid')
axes[0, 0].axhline(y=0, color='k', linewidth=0.5)
axes[0, 0].axvline(x=0, color='k', linewidth=0.5)
axes[0, 0].grid(True, alpha=0.3)

# Tanh
axes[0, 1].plot(x, np.tanh(x), 'r-', linewidth=2)
axes[0, 1].set_title('Tanh')
axes[0, 1].axhline(y=0, color='k', linewidth=0.5)
axes[0, 1].axvline(x=0, color='k', linewidth=0.5)
axes[0, 1].grid(True, alpha=0.3)

# ReLU
axes[0, 2].plot(x, np.maximum(0, x), 'g-', linewidth=2)
axes[0, 2].set_title('ReLU')
axes[0, 2].axhline(y=0, color='k', linewidth=0.5)
axes[0, 2].axvline(x=0, color='k', linewidth=0.5)
axes[0, 2].grid(True, alpha=0.3)

# Leaky ReLU
axes[1, 0].plot(x, np.where(x > 0, x, 0.01 * x), 'm-', linewidth=2)
axes[1, 0].set_title('Leaky ReLU (α=0.01)')
axes[1, 0].axhline(y=0, color='k', linewidth=0.5)
axes[1, 0].axvline(x=0, color='k', linewidth=0.5)
axes[1, 0].grid(True, alpha=0.3)

# GELU
from scipy.special import erf
gelu = 0.5 * x * (1 + erf(x / np.sqrt(2)))
axes[1, 1].plot(x, gelu, 'c-', linewidth=2)
axes[1, 1].set_title('GELU')
axes[1, 1].axhline(y=0, color='k', linewidth=0.5)
axes[1, 1].axvline(x=0, color='k', linewidth=0.5)
axes[1, 1].grid(True, alpha=0.3)

# Derivatives comparison
sig = 1 / (1 + np.exp(-x))
axes[1, 2].plot(x, sig * (1 - sig), 'b--', label="Sigmoid'", linewidth=2)
axes[1, 2].plot(x, 1 - np.tanh(x)**2, 'r--', label="Tanh'", linewidth=2)
axes[1, 2].plot(x, np.where(x > 0, 1, 0).astype(float), 'g--', label="ReLU'", linewidth=2)
axes[1, 2].set_title('Derivatives')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('activations.png', dpi=150)

The Forward Pass

The forward pass computes the output of the network layer by layer.

For a 3-layer MLP with input $x$ , hidden sizes $h_{1}, h_{2}$ , and output size $K$ :

z^{(1)} = W^{(1)} x + b^{(1)}, a^{(1)} = ReLU (z^{(1)})

z^{(2)} = W^{(2)} a^{(1)} + b^{(2)}, a^{(2)} = ReLU (z^{(2)})

z^{(3)} = W^{(3)} a^{(2)} + b^{(3)}, \hat{y} = softmax (z^{(3)})

Each step is a matrix multiplication followed by an element-wise nonlinearity. The forward pass stores all intermediate values ( $z^{(l)}$ and $a^{(l)}$ ) because backpropagation needs them.

Loss Functions

The loss function measures how far the prediction $\hat{y}$ is from the true label $y$ . Training minimizes this loss.

Mean Squared Error (MSE)

L_{MSE} = \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})^{2}

Used for regression. Derivative: $\frac{\partial L}{\partial {\hat{y}}_{i}} = \frac{2}{N} ({\hat{y}}_{i} - y_{i})$ .

Cross-Entropy Loss

For classification with $K$ classes:

L_{CE} = - \sum_{i = 1}^{N} \sum_{k = 1}^{K} y_{i k} \log ({\hat{y}}_{i k})

With one-hot labels ( $y_{i k} = 1$ only for the correct class $c$ ):

L_{CE} = - \frac{1}{N} \sum_{i = 1}^{N} \log ({\hat{y}}_{i, c_{i}})

Why cross-entropy, not MSE, for classification? MSE gradients near 0 or 1 are tiny (due to sigmoid saturation), so learning is very slow. Cross-entropy produces large gradients when the prediction is wrong, enabling faster learning.

Worked Example — MSE and Cross-Entropy Loss

Setup: 3-class problem, true label = class 1 (one-hot: $y = [0, 1, 0]$ )

Prediction (softmax output): $\hat{y} = [0.1, 0.7, 0.2]$

MSE Loss:

L_{MSE} = \frac{1}{3} [(0 - 0.1)^{2} + (1 - 0.7)^{2} + (0 - 0.2)^{2}] = \frac{0.01 + 0.09 + 0.04}{3} = 0.0467

Cross-Entropy Loss:

L_{CE} = - [0 \cdot \log (0.1) + 1 \cdot \log (0.7) + 0 \cdot \log (0.2)] = - \log (0.7) = 0.357

Now consider a bad prediction: $\hat{y} = [0.1, 0.1, 0.8]$ (confident but wrong)

MSE: $\frac{(0.01 + 0.81 + 0.64)}{3} = 0.487$

CE: $- \log (0.1) = 2.303$

Result: Cross-entropy penalizes the bad prediction 6.5x more harshly ( $2.303 / 0.357$ ), while MSE only penalizes 10x ( $0.487 / 0.0467$ ). CE produces much stronger gradients for wrong confident predictions, enabling faster learning.

Binary Cross-Entropy

L_{BCE} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]

Backpropagation: Full Derivation

Backpropagation is just the chain rule applied systematically from the loss back through the network.

Setup

Consider a single training example with a 2-layer network:

z^{(1)} = W^{(1)} x + b^{(1)}, a^{(1)} = σ (z^{(1)})

z^{(2)} = W^{(2)} a^{(1)} + b^{(2)}, \hat{y} = softmax (z^{(2)})

L = - \sum_{k} y_{k} \log ({\hat{y}}_{k})

Step 1: Output Layer Gradient

We need $\frac{\partial L}{\partial z^{(2)}}$ . For softmax + cross-entropy, this simplifies beautifully:

\frac{\partial L}{\partial z_{k}^{(2)}} = {\hat{y}}_{k} - y_{k}

Full derivation: The softmax output is ${\hat{y}}_{k} = \frac{e^{z_{k}}}{\sum_{j} e^{z_{j}}}$ . The cross-entropy is $L = - \sum_{k} y_{k} \log {\hat{y}}_{k}$ .

For $\frac{\partial {\hat{y}}_{k}}{\partial z_{i}}$ , we have two cases:

When $i = k$ :

\frac{\partial {\hat{y}}_{k}}{\partial z_{k}} = {\hat{y}}_{k} (1 - {\hat{y}}_{k})

When $i \neq k$ :

\frac{\partial {\hat{y}}_{k}}{\partial z_{i}} = - {\hat{y}}_{k} {\hat{y}}_{i}

Combining with the cross-entropy derivative $\frac{\partial L}{\partial {\hat{y}}_{k}} = - \frac{y_{k}}{{\hat{y}}_{k}}$ :

\frac{\partial L}{\partial z_{i}} = \sum_{k} \frac{\partial L}{\partial {\hat{y}}_{k}} \frac{\partial {\hat{y}}_{k}}{\partial z_{i}} = - \sum_{k} \frac{y_{k}}{{\hat{y}}_{k}} \frac{\partial {\hat{y}}_{k}}{\partial z_{i}}

= - y_{i} (1 - {\hat{y}}_{i}) + \sum_{k \neq i} y_{k} {\hat{y}}_{i} = - y_{i} + y_{i} {\hat{y}}_{i} + {\hat{y}}_{i} \sum_{k \neq i} y_{k}

= - y_{i} + {\hat{y}}_{i} \sum_{k} y_{k} = {\hat{y}}_{i} - y_{i}

(using $\sum_{k} y_{k} = 1$ for one-hot labels).

Let $δ^{(2)} = \hat{y} - y$ .

Worked Example — Backpropagation Output Layer Gradient

Setup: 2-class output, true label = class 0 ( $y = [1, 0]$ ), softmax output $\hat{y} = [0.6, 0.4]$

Step 1: Compute $δ^{(2)} = \hat{y} - y$

δ^{(2)} = [0.6 - 1, 0.4 - 0] = [- 0.4, 0.4]

Result: The gradient for class 0 is $- 0.4$ (negative because prediction was too low --- push it up). The gradient for class 1 is $+ 0.4$ (positive because prediction was too high --- push it down). This elegantly simple gradient is why softmax + cross-entropy is the standard pairing.

Step 2: Weight Gradients (Output Layer)

\frac{\partial L}{\partial W^{(2)}} = δ^{(2)} (a^{(1)})^{T}

\frac{\partial L}{\partial b^{(2)}} = δ^{(2)}

Step 3: Propagate to Hidden Layer

δ^{(1)} = (W^{(2)})^{T} δ^{(2)} ⊙ σ^{'} (z^{(1)})

where $⊙$ is element-wise multiplication and $σ^{'}$ is the derivative of the activation function.

Step 4: Weight Gradients (Hidden Layer)

\frac{\partial L}{\partial W^{(1)}} = δ^{(1)} x^{T}

\frac{\partial L}{\partial b^{(1)}} = δ^{(1)}

General Backpropagation Rule

For layer $l$ :

δ^{(l)} = [(W^{(l + 1)})^{T} δ^{(l + 1)}] ⊙ σ^{'} (z^{(l)})

\frac{\partial L}{\partial W^{(l)}} = δ^{(l)} (a^{(l - 1)})^{T}

\frac{\partial L}{\partial b^{(l)}} = δ^{(l)}

This is why we store $z^{(l)}$ and $a^{(l)}$ during the forward pass --- we need them to compute the gradients.

Worked Example — Full Backprop Through a 2-Layer Network

Setup: Input $x = [1.0, 0.5]$ , true label $y = [1, 0]$ (class 0), sigmoid activation, learning rate $η = 0.5$

$W^{(1)} = [\begin{matrix} 0.3 & 0.7 \\ - 0.2 & 0.4 \end{matrix}]$ , $b^{(1)} = [0, 0]$ , $W^{(2)} = [\begin{matrix} 0.5 & - 0.3 \\ 0.2 & 0.6 \end{matrix}]$ , $b^{(2)} = [0, 0]$

Forward pass:

$z^{(1)} = [0.3 (1) + 0.7 (0.5), - 0.2 (1) + 0.4 (0.5)] = [0.65, 0.0]$
$a^{(1)} = σ ([0.65, 0.0]) = [0.657, 0.500]$
$z^{(2)} = [0.5 (0.657) + (- 0.3) (0.5), 0.2 (0.657) + 0.6 (0.5)] = [0.179, 0.431]$
$\hat{y} = softmax ([0.179, 0.431]) = [0.438, 0.562]$

Backprop Step 1: Output gradient

δ^{(2)} = \hat{y} - y = [0.438 - 1, 0.562 - 0] = [- 0.562, 0.562]

Backprop Step 2: Weight gradients (output layer)

\frac{\partial L}{\partial W^{(2)}} = δ^{(2)} (a^{(1)})^{T} = [\begin{matrix} - 0.562 \\ 0.562 \end{matrix}] [\begin{matrix} 0.657 & 0.500 \end{matrix}] = [\begin{matrix} - 0.369 & - 0.281 \\ 0.369 & 0.281 \end{matrix}]

Backprop Step 3: Propagate to hidden layer

δ^{(1)} = (W^{(2)})^{T} δ^{(2)} ⊙ σ^{'} (z^{(1)})

(W^{(2)})^{T} δ^{(2)} = [\begin{matrix} 0.5 & 0.2 \\ - 0.3 & 0.6 \end{matrix}] [\begin{matrix} - 0.562 \\ 0.562 \end{matrix}] = [\begin{matrix} - 0.169 \\ 0.506 \end{matrix}]

σ^{'} (z^{(1)}) = [0.657 (1 - 0.657), 0.5 (1 - 0.5)] = [0.225, 0.250]

δ^{(1)} = [- 0.169 \times 0.225, 0.506 \times 0.250] = [- 0.038, 0.127]

Backprop Step 4: Update $W^{(2)}$

W_{new}^{(2)} = W^{(2)} - 0.5 \times \frac{\partial L}{\partial W^{(2)}} = [\begin{matrix} 0.5 + 0.185 & - 0.3 + 0.141 \\ 0.2 - 0.185 & 0.6 - 0.141 \end{matrix}] = [\begin{matrix} 0.685 & - 0.159 \\ 0.015 & 0.459 \end{matrix}]

Result: The weight connecting the active hidden neuron (0.657) to the correct output class increased from 0.5 to 0.685, strengthening the path to the correct prediction.

Gradient Descent Variants

Once we have gradients, we need an optimizer to update the weights.

Vanilla SGD

θ_{t + 1} = θ_{t} - η \nabla_{θ} L

Simple but slow. Oscillates in ravines (dimensions with very different curvatures).

SGD with Momentum

v_{t} = β v_{t - 1} + (1 - β) \nabla_{θ} L

θ_{t + 1} = θ_{t} - η v_{t}

Momentum ( $β \approx 0.9$ ) accumulates past gradients, damping oscillations and accelerating movement along consistent gradient directions. Think of a ball rolling downhill with inertia.

RMSProp

s_{t} = β s_{t - 1} + (1 - β) g_{t}^{2}

θ_{t + 1} = θ_{t} - \frac{η}{\sqrt{s_{t} + ϵ}} g_{t}

Adapts the learning rate per parameter. Parameters with large gradients get smaller effective learning rates.

Adam (Adaptive Moment Estimation)

Adam combines momentum with RMSProp:

First moment (mean):

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}

Second moment (uncentered variance):

v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}

Bias correction (crucial for early steps when $m$ and $v$ are biased toward zero):

{\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}, {\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}}

Parameter update:

θ_{t + 1} = θ_{t} - \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} {\hat{m}}_{t}

Default hyperparameters: $β_{1} = 0.9$ , $β_{2} = 0.999$ , $ϵ = 10^{- 8}$ .

Worked Example — Adam Optimizer (3 Steps)

Setup: Single parameter $θ_{0} = 5.0$ , $η = 0.1$ , $β_{1} = 0.9$ , $β_{2} = 0.999$ , $ϵ = 10^{- 8}$

Gradients: $g_{1} = 2.0$ , $g_{2} = 1.5$ , $g_{3} = 3.0$ , init $m_{0} = 0$ , $v_{0} = 0$

Step 1 ( $t = 1$ , $g_{1} = 2.0$ ):

$m_{1} = 0.9 (0) + 0.1 (2.0) = 0.2$
$v_{1} = 0.999 (0) + 0.001 (4.0) = 0.004$
${\hat{m}}_{1} = 0.2 / (1 - {0.9}^{1}) = 0.2 / 0.1 = 2.0$ (bias correction is huge at $t = 1$ )
${\hat{v}}_{1} = 0.004 / (1 - {0.999}^{1}) = 0.004 / 0.001 = 4.0$
$θ_{1} = 5.0 - 0.1 \times 2.0 / (\sqrt{4.0} + 10^{- 8}) = 5.0 - 0.1 = 4.9$

Step 2 ( $t = 2$ , $g_{2} = 1.5$ ):

$m_{2} = 0.9 (0.2) + 0.1 (1.5) = 0.33$
$v_{2} = 0.999 (0.004) + 0.001 (2.25) = 0.00625$
${\hat{m}}_{2} = 0.33 / (1 - 0.81) = 1.737$
${\hat{v}}_{2} = 0.00625 / (1 - 0.998) = 3.125$
$θ_{2} = 4.9 - 0.1 \times 1.737 / \sqrt{3.125} = 4.9 - 0.098 = 4.802$

Result: Adam adapts the effective learning rate per parameter. The bias correction at early steps is critical --- without it, $m_{1} = 0.2$ would severely underestimate the true mean gradient, causing tiny updates.

Adam Is the Default

Adam works well out of the box for most problems. Start with Adam and a learning rate of $3 \times 10^{- 4}$ . Switch to SGD+momentum only if you need the last bit of generalization (SGD often finds flatter minima).

AdamW (Adam with Decoupled Weight Decay)

Standard Adam applies weight decay through the gradient, which interacts poorly with adaptive learning rates. AdamW decouples them:

θ_{t + 1} = θ_{t} - η (\frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} + λ θ_{t})

where $λ$ is the weight decay coefficient. AdamW is the standard optimizer for transformer training.

From-Scratch MLP in NumPy: MNIST

Here is a complete MLP trained on MNIST using only NumPy, implementing everything derived above.

python

import numpy as np

# ── Activation functions ─────────────────────────────────────────────
def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def softmax(z):
    # Subtract max for numerical stability
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

# ── Loss ─────────────────────────────────────────────────────────────
def cross_entropy_loss(y_pred, y_true):
    """y_true is one-hot, y_pred is softmax output."""
    N = y_true.shape[0]
    # Clip to avoid log(0)
    y_pred = np.clip(y_pred, 1e-12, 1 - 1e-12)
    loss = -np.sum(y_true * np.log(y_pred)) / N
    return loss

# ── Weight initialization (He) ───────────────────────────────────────
def he_init(fan_in, fan_out):
    return np.random.randn(fan_in, fan_out) * np.sqrt(2.0 / fan_in)

# ── MLP class ────────────────────────────────────────────────────────
class MLP:
    def __init__(self, layer_sizes, lr=0.001):
        """
        layer_sizes: list, e.g., [784, 128, 64, 10]
        """
        self.lr = lr
        self.weights = []
        self.biases = []
        for i in range(len(layer_sizes) - 1):
            W = he_init(layer_sizes[i], layer_sizes[i + 1])
            b = np.zeros((1, layer_sizes[i + 1]))
            self.weights.append(W)
            self.biases.append(b)

    def forward(self, X):
        """Forward pass. Store intermediates for backprop."""
        self.activations = [X]
        self.pre_activations = []
        a = X
        for i in range(len(self.weights) - 1):
            z = a @ self.weights[i] + self.biases[i]
            self.pre_activations.append(z)
            a = relu(z)
            self.activations.append(a)
        # Output layer: softmax
        z = a @ self.weights[-1] + self.biases[-1]
        self.pre_activations.append(z)
        a = softmax(z)
        self.activations.append(a)
        return a

    def backward(self, y_true):
        """Backpropagation. Returns gradients and updates weights."""
        N = y_true.shape[0]
        n_layers = len(self.weights)

        # Output layer: softmax + cross-entropy gradient
        delta = (self.activations[-1] - y_true) / N

        for i in reversed(range(n_layers)):
            # Weight and bias gradients
            dW = self.activations[i].T @ delta
            db = np.sum(delta, axis=0, keepdims=True)

            # Propagate delta to previous layer (if not input layer)
            if i > 0:
                delta = (delta @ self.weights[i].T) * relu_derivative(
                    self.pre_activations[i - 1]
                )

            # Update weights
            self.weights[i] -= self.lr * dW
            self.biases[i] -= self.lr * db

    def predict(self, X):
        return np.argmax(self.forward(X), axis=1)

# ── Data loading ─────────────────────────────────────────────────────
def load_mnist():
    """Load MNIST using sklearn (for simplicity)."""
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1, as_frame=False, parser='auto')
    X, y = mnist.data, mnist.target.astype(int)

    # Normalize to [0, 1]
    X = X / 255.0

    # One-hot encode labels
    y_onehot = np.zeros((y.shape[0], 10))
    y_onehot[np.arange(y.shape[0]), y] = 1

    # Split: 60K train, 10K test
    X_train, X_test = X[:60000], X[60000:]
    y_train, y_test = y_onehot[:60000], y_onehot[60000:]
    y_test_labels = y[60000:]

    return X_train, y_train, X_test, y_test, y_test_labels

# ── Training ─────────────────────────────────────────────────────────
def train():
    X_train, y_train, X_test, y_test, y_test_labels = load_mnist()

    model = MLP(layer_sizes=[784, 256, 128, 10], lr=0.001)
    batch_size = 64
    epochs = 20

    for epoch in range(epochs):
        # Shuffle training data
        indices = np.random.permutation(X_train.shape[0])
        X_shuffled = X_train[indices]
        y_shuffled = y_train[indices]

        epoch_loss = 0
        n_batches = 0

        for i in range(0, X_train.shape[0], batch_size):
            X_batch = X_shuffled[i:i + batch_size]
            y_batch = y_shuffled[i:i + batch_size]

            # Forward pass
            y_pred = model.forward(X_batch)
            loss = cross_entropy_loss(y_pred, y_batch)
            epoch_loss += loss
            n_batches += 1

            # Backward pass (updates weights internally)
            model.backward(y_batch)

        # Evaluate
        test_preds = model.predict(X_test)
        accuracy = np.mean(test_preds == y_test_labels)
        print(
            f"Epoch {epoch + 1:2d} | "
            f"Loss: {epoch_loss / n_batches:.4f} | "
            f"Test Accuracy: {accuracy:.4f}"
        )

if __name__ == "__main__":
    train()

Expected output:

Epoch  1 | Loss: 0.5832 | Test Accuracy: 0.9112
Epoch  5 | Loss: 0.1254 | Test Accuracy: 0.9598
Epoch 10 | Loss: 0.0612 | Test Accuracy: 0.9701
Epoch 20 | Loss: 0.0198 | Test Accuracy: 0.9752

From Scratch to Framework

This NumPy implementation does exactly what PyTorch does under the hood. The key difference is that PyTorch builds a computation graph automatically and computes gradients via autograd, so you never write backward() by hand. See PyTorch Fundamentals for the framework version.

The Computational Graph

Backpropagation works by traversing a directed acyclic graph (DAG) of operations in reverse order. Every operation records its inputs and the local gradient.

Each node stores $\frac{\partial output}{\partial input}$ (local gradient). Backprop multiplies these local gradients along every path from the loss to each parameter.

Numerical Gradient Checking

Always verify your analytical gradients with numerical gradients during development:

python

def numerical_gradient(f, x, eps=1e-5):
    """Compute numerical gradient of f at x."""
    grad = np.zeros_like(x)
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        idx = it.multi_index
        old_val = x[idx]

        x[idx] = old_val + eps
        f_plus = f(x)
        x[idx] = old_val - eps
        f_minus = f(x)
        x[idx] = old_val

        grad[idx] = (f_plus - f_minus) / (2 * eps)
        it.iternext()
    return grad

# Usage: compare with analytical gradient
# relative_error = |analytical - numerical| / max(|analytical|, |numerical|)
# Should be < 1e-5 for float64

Mini-Batch SGD: Why Batches?

Strategy	Batch Size	Pros	Cons
Full-batch GD	All data	Stable gradient	Slow, needs lots of memory
Stochastic GD	1 sample	Fast updates	Very noisy
Mini-batch SGD	32--512	Balance of speed and stability	Requires batch size tuning

Mini-batch SGD is the standard. Typical batch sizes are powers of 2 (32, 64, 128, 256) for GPU memory alignment.

The gradient of a mini-batch of size $B$ is:

g_{B} = \frac{1}{B} \sum_{i = 1}^{B} \nabla_{θ} L (x_{i}, y_{i})

This is an unbiased estimate of the true gradient: $E [g_{B}] = \nabla_{θ} L$ .

Common Pitfalls

Mistake	Symptom	Fix
Learning rate too high	Loss explodes or oscillates wildly	Reduce LR by 10x
Learning rate too low	Loss decreases extremely slowly	Increase LR or use scheduler
No input normalization	Training unstable, slow convergence	Normalize to mean 0, std 1
Incorrect shapes	Dimension mismatch errors	Print shapes at every layer
Forgetting bias	Underfitting	Always include bias terms
Not shuffling data	Model learns order, not patterns	Shuffle each epoch
Integer labels with cross-entropy	Wrong loss values	One-hot encode labels

Cross-References

Next step: PyTorch Fundamentals --- do this with autograd instead of manual backprop
Training recipes: Training Techniques --- BatchNorm, dropout, LR scheduling
First architecture: Convolutional Neural Networks --- apply these concepts to images
Mathematical foundations: Deep Learning Overview --- the big picture
Optimization deep dive: Reinforcement Learning --- gradient-based optimization in a different setting

Neural Network Basics ​

The Perceptron ​

Perceptron Learning Rule ​

Perceptron in Python ​

Multi-Layer Perceptron (MLP) ​

Architecture ​

Activation Functions ​

Sigmoid ​

Tanh ​

ReLU (Rectified Linear Unit) ​

Leaky ReLU ​

GELU (Gaussian Error Linear Unit) ​

Softmax (for output layers) ​

Activation Function Comparison ​

The Forward Pass ​

Loss Functions ​

Mean Squared Error (MSE) ​

Cross-Entropy Loss ​

Binary Cross-Entropy ​

Backpropagation: Full Derivation ​

Setup ​

Step 1: Output Layer Gradient ​

Step 2: Weight Gradients (Output Layer) ​

Step 3: Propagate to Hidden Layer ​

Step 4: Weight Gradients (Hidden Layer) ​

General Backpropagation Rule ​

Gradient Descent Variants ​

Vanilla SGD ​

SGD with Momentum ​

RMSProp ​

Adam (Adaptive Moment Estimation) ​

AdamW (Adam with Decoupled Weight Decay) ​

From-Scratch MLP in NumPy: MNIST ​

The Computational Graph ​

Numerical Gradient Checking ​

Mini-Batch SGD: Why Batches? ​

Common Pitfalls ​

Cross-References ​

Neural Network Basics

The Perceptron

Perceptron Learning Rule

Perceptron in Python

Multi-Layer Perceptron (MLP)

Architecture

Activation Functions

Sigmoid

Tanh

ReLU (Rectified Linear Unit)

Leaky ReLU

GELU (Gaussian Error Linear Unit)

Softmax (for output layers)

Activation Function Comparison

The Forward Pass

Loss Functions

Mean Squared Error (MSE)

Cross-Entropy Loss

Binary Cross-Entropy

Backpropagation: Full Derivation

Setup

Step 1: Output Layer Gradient

Step 2: Weight Gradients (Output Layer)

Step 3: Propagate to Hidden Layer

Step 4: Weight Gradients (Hidden Layer)

General Backpropagation Rule

Gradient Descent Variants

Vanilla SGD

SGD with Momentum

RMSProp

Adam (Adaptive Moment Estimation)

AdamW (Adam with Decoupled Weight Decay)

From-Scratch MLP in NumPy: MNIST

The Computational Graph

Numerical Gradient Checking

Mini-Batch SGD: Why Batches?

Common Pitfalls

Cross-References