Neural Network Basics
Everything in deep learning rests on a handful of ideas: neurons that compute weighted sums, nonlinear activations, a forward pass that turns input into prediction, a backward pass that turns error into gradients, and an optimizer that updates weights. This page derives every one of those pieces from first principles, then glues them together into a working MLP trained on MNIST with nothing but NumPy.
The Perceptron
The perceptron is the simplest neural network --- a single neuron that makes binary decisions.
Given an input vector
The decision boundary is a hyperplane
Worked Example — Perceptron Decision
Input:
| Sample | 1 | 0.5 |
Weights:
Step 1: Compute weighted sum
Step 2: Apply threshold
Result: The perceptron predicts class 1. The sample is on the positive side of the decision boundary
Perceptron Learning Rule
The weight update for a misclassified sample is:
where
Worked Example — Perceptron Learning Rule
Input: Sample
Current:
Step 1: Compute error
Step 2: Update weights
Step 3: Update bias
Result: New weights
The XOR Problem
A single perceptron can only learn linearly separable functions. It cannot learn XOR. This limitation drove the development of multi-layer networks.
Perceptron in Python
import numpy as np
class Perceptron:
def __init__(self, n_features, lr=0.01):
self.w = np.zeros(n_features)
self.b = 0.0
self.lr = lr
def predict(self, x):
return 1 if np.dot(self.w, x) + self.b > 0 else 0
def train(self, X, y, epochs=100):
for _ in range(epochs):
for xi, yi in zip(X, y):
pred = self.predict(xi)
error = yi - pred
self.w += self.lr * error * xi
self.b += self.lr * error
# AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])
p = Perceptron(2)
p.train(X, y)
print([p.predict(xi) for xi in X]) # [0, 0, 0, 1]Multi-Layer Perceptron (MLP)
An MLP stacks multiple layers of neurons. Each neuron in layer
Architecture
For an MLP with
where
Worked Example — MLP Forward Pass (2-layer)
Input:
Weights:
Step 1: Pre-activation layer 1
Step 2: Apply ReLU
The second neuron is "dead" (negative pre-activation zeroed by ReLU).
Step 3: Pre-activation layer 2
Step 4: Softmax output
Result: The network predicts class 0 with 55.6% probability. Only one hidden neuron contributed because the other had a negative pre-activation and was zeroed by ReLU.
Activation Functions
Activation functions introduce nonlinearity. Without them, stacking linear layers collapses to a single linear transformation:
Sigmoid
Derivative:
Derivation: Let
Since
Properties: Output in
Worked Example — Sigmoid Activation and Derivative
Input values:
Step 1: Compute
| -2 | 7.389 | |
| 0 | 1.000 | |
| 0.75 | 0.472 | |
| 3 | 0.050 |
Step 2: Compute derivative
| -2 | 0.119 | |
| 0 | 0.500 | |
| 0.75 | 0.679 | |
| 3 | 0.953 |
Result: The derivative peaks at
Tanh
Derivative:
Derivation: Using the quotient rule on
Properties: Output in
ReLU (Rectified Linear Unit)
Derivative:
(Undefined at
Properties: No vanishing gradient for positive inputs. Sparse activations (many neurons output 0). Computationally efficient. Suffers from "dying ReLU" --- neurons with negative input always output 0 and never recover.
Leaky ReLU
where
GELU (Gaussian Error Linear Unit)
Approximate form:
Properties: Smooth, non-monotonic. Used in BERT, GPT, and most modern transformers. Combines the benefits of ReLU (non-saturating for large positive
Softmax (for output layers)
Converts a vector of logits into a probability distribution. Used in the output layer of multi-class classifiers.
Worked Example — Softmax
Input logits:
Step 1: Compute exponentials
Step 2: Sum
Step 3: Normalize
Result: The model assigns 65.9% probability to class 0, 24.2% to class 1, and 9.9% to class 2. Probabilities sum to 1.0. The largest logit (2.0) dominates after softmax amplification.
Activation Function Comparison
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-5, 5, 1000)
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
# Sigmoid
axes[0, 0].plot(x, 1 / (1 + np.exp(-x)), 'b-', linewidth=2)
axes[0, 0].set_title('Sigmoid')
axes[0, 0].axhline(y=0, color='k', linewidth=0.5)
axes[0, 0].axvline(x=0, color='k', linewidth=0.5)
axes[0, 0].grid(True, alpha=0.3)
# Tanh
axes[0, 1].plot(x, np.tanh(x), 'r-', linewidth=2)
axes[0, 1].set_title('Tanh')
axes[0, 1].axhline(y=0, color='k', linewidth=0.5)
axes[0, 1].axvline(x=0, color='k', linewidth=0.5)
axes[0, 1].grid(True, alpha=0.3)
# ReLU
axes[0, 2].plot(x, np.maximum(0, x), 'g-', linewidth=2)
axes[0, 2].set_title('ReLU')
axes[0, 2].axhline(y=0, color='k', linewidth=0.5)
axes[0, 2].axvline(x=0, color='k', linewidth=0.5)
axes[0, 2].grid(True, alpha=0.3)
# Leaky ReLU
axes[1, 0].plot(x, np.where(x > 0, x, 0.01 * x), 'm-', linewidth=2)
axes[1, 0].set_title('Leaky ReLU (α=0.01)')
axes[1, 0].axhline(y=0, color='k', linewidth=0.5)
axes[1, 0].axvline(x=0, color='k', linewidth=0.5)
axes[1, 0].grid(True, alpha=0.3)
# GELU
from scipy.special import erf
gelu = 0.5 * x * (1 + erf(x / np.sqrt(2)))
axes[1, 1].plot(x, gelu, 'c-', linewidth=2)
axes[1, 1].set_title('GELU')
axes[1, 1].axhline(y=0, color='k', linewidth=0.5)
axes[1, 1].axvline(x=0, color='k', linewidth=0.5)
axes[1, 1].grid(True, alpha=0.3)
# Derivatives comparison
sig = 1 / (1 + np.exp(-x))
axes[1, 2].plot(x, sig * (1 - sig), 'b--', label="Sigmoid'", linewidth=2)
axes[1, 2].plot(x, 1 - np.tanh(x)**2, 'r--', label="Tanh'", linewidth=2)
axes[1, 2].plot(x, np.where(x > 0, 1, 0).astype(float), 'g--', label="ReLU'", linewidth=2)
axes[1, 2].set_title('Derivatives')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('activations.png', dpi=150)The Forward Pass
The forward pass computes the output of the network layer by layer.
For a 3-layer MLP with input
Each step is a matrix multiplication followed by an element-wise nonlinearity. The forward pass stores all intermediate values (
Loss Functions
The loss function measures how far the prediction
Mean Squared Error (MSE)
Used for regression. Derivative:
Cross-Entropy Loss
For classification with
With one-hot labels (
Why cross-entropy, not MSE, for classification? MSE gradients near 0 or 1 are tiny (due to sigmoid saturation), so learning is very slow. Cross-entropy produces large gradients when the prediction is wrong, enabling faster learning.
Worked Example — MSE and Cross-Entropy Loss
Setup: 3-class problem, true label = class 1 (one-hot:
Prediction (softmax output):
MSE Loss:
Cross-Entropy Loss:
Now consider a bad prediction:
MSE:
CE:
Result: Cross-entropy penalizes the bad prediction 6.5x more harshly (
Binary Cross-Entropy
Backpropagation: Full Derivation
Backpropagation is just the chain rule applied systematically from the loss back through the network.
Setup
Consider a single training example with a 2-layer network:
Step 1: Output Layer Gradient
We need
Full derivation: The softmax output is
For
When
When
Combining with the cross-entropy derivative
(using
Let
Worked Example — Backpropagation Output Layer Gradient
Setup: 2-class output, true label = class 0 (
Step 1: Compute
Result: The gradient for class 0 is
Step 2: Weight Gradients (Output Layer)
Step 3: Propagate to Hidden Layer
where
Step 4: Weight Gradients (Hidden Layer)
General Backpropagation Rule
For layer
This is why we store
Worked Example — Full Backprop Through a 2-Layer Network
Setup: Input
Forward pass:
Backprop Step 1: Output gradient
Backprop Step 2: Weight gradients (output layer)
Backprop Step 3: Propagate to hidden layer
Backprop Step 4: Update
Result: The weight connecting the active hidden neuron (0.657) to the correct output class increased from 0.5 to 0.685, strengthening the path to the correct prediction.
Gradient Descent Variants
Once we have gradients, we need an optimizer to update the weights.
Vanilla SGD
Simple but slow. Oscillates in ravines (dimensions with very different curvatures).
SGD with Momentum
Momentum (
RMSProp
Adapts the learning rate per parameter. Parameters with large gradients get smaller effective learning rates.
Adam (Adaptive Moment Estimation)
Adam combines momentum with RMSProp:
First moment (mean):
Second moment (uncentered variance):
Bias correction (crucial for early steps when
Parameter update:
Default hyperparameters:
Worked Example — Adam Optimizer (3 Steps)
Setup: Single parameter
Gradients:
Step 1 (
(bias correction is huge at )
Step 2 (
Result: Adam adapts the effective learning rate per parameter. The bias correction at early steps is critical --- without it,
Adam Is the Default
Adam works well out of the box for most problems. Start with Adam and a learning rate of
AdamW (Adam with Decoupled Weight Decay)
Standard Adam applies weight decay through the gradient, which interacts poorly with adaptive learning rates. AdamW decouples them:
where
From-Scratch MLP in NumPy: MNIST
Here is a complete MLP trained on MNIST using only NumPy, implementing everything derived above.
import numpy as np
# ── Activation functions ─────────────────────────────────────────────
def relu(z):
return np.maximum(0, z)
def relu_derivative(z):
return (z > 0).astype(float)
def softmax(z):
# Subtract max for numerical stability
exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
return exp_z / np.sum(exp_z, axis=1, keepdims=True)
# ── Loss ─────────────────────────────────────────────────────────────
def cross_entropy_loss(y_pred, y_true):
"""y_true is one-hot, y_pred is softmax output."""
N = y_true.shape[0]
# Clip to avoid log(0)
y_pred = np.clip(y_pred, 1e-12, 1 - 1e-12)
loss = -np.sum(y_true * np.log(y_pred)) / N
return loss
# ── Weight initialization (He) ───────────────────────────────────────
def he_init(fan_in, fan_out):
return np.random.randn(fan_in, fan_out) * np.sqrt(2.0 / fan_in)
# ── MLP class ────────────────────────────────────────────────────────
class MLP:
def __init__(self, layer_sizes, lr=0.001):
"""
layer_sizes: list, e.g., [784, 128, 64, 10]
"""
self.lr = lr
self.weights = []
self.biases = []
for i in range(len(layer_sizes) - 1):
W = he_init(layer_sizes[i], layer_sizes[i + 1])
b = np.zeros((1, layer_sizes[i + 1]))
self.weights.append(W)
self.biases.append(b)
def forward(self, X):
"""Forward pass. Store intermediates for backprop."""
self.activations = [X]
self.pre_activations = []
a = X
for i in range(len(self.weights) - 1):
z = a @ self.weights[i] + self.biases[i]
self.pre_activations.append(z)
a = relu(z)
self.activations.append(a)
# Output layer: softmax
z = a @ self.weights[-1] + self.biases[-1]
self.pre_activations.append(z)
a = softmax(z)
self.activations.append(a)
return a
def backward(self, y_true):
"""Backpropagation. Returns gradients and updates weights."""
N = y_true.shape[0]
n_layers = len(self.weights)
# Output layer: softmax + cross-entropy gradient
delta = (self.activations[-1] - y_true) / N
for i in reversed(range(n_layers)):
# Weight and bias gradients
dW = self.activations[i].T @ delta
db = np.sum(delta, axis=0, keepdims=True)
# Propagate delta to previous layer (if not input layer)
if i > 0:
delta = (delta @ self.weights[i].T) * relu_derivative(
self.pre_activations[i - 1]
)
# Update weights
self.weights[i] -= self.lr * dW
self.biases[i] -= self.lr * db
def predict(self, X):
return np.argmax(self.forward(X), axis=1)
# ── Data loading ─────────────────────────────────────────────────────
def load_mnist():
"""Load MNIST using sklearn (for simplicity)."""
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False, parser='auto')
X, y = mnist.data, mnist.target.astype(int)
# Normalize to [0, 1]
X = X / 255.0
# One-hot encode labels
y_onehot = np.zeros((y.shape[0], 10))
y_onehot[np.arange(y.shape[0]), y] = 1
# Split: 60K train, 10K test
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y_onehot[:60000], y_onehot[60000:]
y_test_labels = y[60000:]
return X_train, y_train, X_test, y_test, y_test_labels
# ── Training ─────────────────────────────────────────────────────────
def train():
X_train, y_train, X_test, y_test, y_test_labels = load_mnist()
model = MLP(layer_sizes=[784, 256, 128, 10], lr=0.001)
batch_size = 64
epochs = 20
for epoch in range(epochs):
# Shuffle training data
indices = np.random.permutation(X_train.shape[0])
X_shuffled = X_train[indices]
y_shuffled = y_train[indices]
epoch_loss = 0
n_batches = 0
for i in range(0, X_train.shape[0], batch_size):
X_batch = X_shuffled[i:i + batch_size]
y_batch = y_shuffled[i:i + batch_size]
# Forward pass
y_pred = model.forward(X_batch)
loss = cross_entropy_loss(y_pred, y_batch)
epoch_loss += loss
n_batches += 1
# Backward pass (updates weights internally)
model.backward(y_batch)
# Evaluate
test_preds = model.predict(X_test)
accuracy = np.mean(test_preds == y_test_labels)
print(
f"Epoch {epoch + 1:2d} | "
f"Loss: {epoch_loss / n_batches:.4f} | "
f"Test Accuracy: {accuracy:.4f}"
)
if __name__ == "__main__":
train()Expected output:
Epoch 1 | Loss: 0.5832 | Test Accuracy: 0.9112
Epoch 5 | Loss: 0.1254 | Test Accuracy: 0.9598
Epoch 10 | Loss: 0.0612 | Test Accuracy: 0.9701
Epoch 20 | Loss: 0.0198 | Test Accuracy: 0.9752From Scratch to Framework
This NumPy implementation does exactly what PyTorch does under the hood. The key difference is that PyTorch builds a computation graph automatically and computes gradients via autograd, so you never write backward() by hand. See PyTorch Fundamentals for the framework version.
The Computational Graph
Backpropagation works by traversing a directed acyclic graph (DAG) of operations in reverse order. Every operation records its inputs and the local gradient.
Each node stores
Numerical Gradient Checking
Always verify your analytical gradients with numerical gradients during development:
def numerical_gradient(f, x, eps=1e-5):
"""Compute numerical gradient of f at x."""
grad = np.zeros_like(x)
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
old_val = x[idx]
x[idx] = old_val + eps
f_plus = f(x)
x[idx] = old_val - eps
f_minus = f(x)
x[idx] = old_val
grad[idx] = (f_plus - f_minus) / (2 * eps)
it.iternext()
return grad
# Usage: compare with analytical gradient
# relative_error = |analytical - numerical| / max(|analytical|, |numerical|)
# Should be < 1e-5 for float64Mini-Batch SGD: Why Batches?
| Strategy | Batch Size | Pros | Cons |
|---|---|---|---|
| Full-batch GD | All data | Stable gradient | Slow, needs lots of memory |
| Stochastic GD | 1 sample | Fast updates | Very noisy |
| Mini-batch SGD | 32--512 | Balance of speed and stability | Requires batch size tuning |
Mini-batch SGD is the standard. Typical batch sizes are powers of 2 (32, 64, 128, 256) for GPU memory alignment.
The gradient of a mini-batch of size
This is an unbiased estimate of the true gradient:
Common Pitfalls
| Mistake | Symptom | Fix |
|---|---|---|
| Learning rate too high | Loss explodes or oscillates wildly | Reduce LR by 10x |
| Learning rate too low | Loss decreases extremely slowly | Increase LR or use scheduler |
| No input normalization | Training unstable, slow convergence | Normalize to mean 0, std 1 |
| Incorrect shapes | Dimension mismatch errors | Print shapes at every layer |
| Forgetting bias | Underfitting | Always include bias terms |
| Not shuffling data | Model learns order, not patterns | Shuffle each epoch |
| Integer labels with cross-entropy | Wrong loss values | One-hot encode labels |
Cross-References
- Next step: PyTorch Fundamentals --- do this with autograd instead of manual backprop
- Training recipes: Training Techniques --- BatchNorm, dropout, LR scheduling
- First architecture: Convolutional Neural Networks --- apply these concepts to images
- Mathematical foundations: Deep Learning Overview --- the big picture
- Optimization deep dive: Reinforcement Learning --- gradient-based optimization in a different setting