Skip to content
Unverified — AI-generated content. Help verify this page

Convolutional Neural Networks

Convolutional neural networks exploit the spatial structure of images by using local connectivity, weight sharing, and translation equivariance. This page derives the convolution operation mathematically, traces the evolution from LeNet to EfficientNet, implements a CNN from scratch in PyTorch, trains it on CIFAR-10, and visualizes learned features.

The Convolution Operation

Discrete 2D Convolution

For an input image I and a kernel (filter) K of size k×k:

(IK)(i,j)=m=0k1n=0k1I(i+m,j+n)K(m,n)

Technically, this is cross-correlation. True convolution flips the kernel, but deep learning uses cross-correlation and calls it "convolution."

Multi-Channel Convolution

For an input with Cin channels and Cout filters:

Of(i,j)=bf+c=1Cinm=0k1n=0k1Ic(is+m,js+n)Kf,c(m,n)

where f indexes the output filter, c indexes the input channel, s is the stride, and bf is the bias.

Output Size Formula

For input size W, kernel size K, padding P, and stride S:

O=WK+2PS+1

Examples:

InputKernelPaddingStrideOutput
3231132 (same padding)
3230130
3252132
3231216 (halved)
224732112
Worked Example — Convolution on a 5x5 Image with a 3x3 Filter

Input image (5×5, single channel):

10210
01301
21021
10102
01210

3x3 Filter (edge detector):

10-1
10-1
10-1

Stride S=1, Padding P=0

Step 1: Output size = (53+0)/1+1=3. Output is 3×3.

Step 2: Compute top-left output (0,0):

=1(1)+0(0)+2(1)+0(1)+1(0)+3(1)+2(1)+1(0)+0(1)=1+02+0+03+2+0+0=2

Step 3: Compute all 9 output values (sliding the filter):

-21-1
00-2
-1-1-2

Result: The filter detects vertical edges (differences between left and right columns). Position (0,0)=2 indicates a strong right-brighter-than-left edge in that region.

Parameter Count

A Conv2d layer with Cin input channels, Cout output channels, and kernel size k:

Parameters=Cout×(Cin×k2+1)

The +1 is for the bias per filter.

Example: Conv2d(3, 64, 3) has 64×(3×9+1)=1,792 parameters.

Why Convolutions Beat Fully Connected

For a 224x224x3 image:

  • Fully connected to 1000 neurons: 224×224×3×1000=150M parameters
  • Conv2d(3, 64, 3): 1,792 parameters

Convolutions exploit:

  1. Local connectivity: Each neuron sees only a small receptive field
  2. Weight sharing: Same filter applied everywhere
  3. Translation equivariance: A cat is a cat regardless of position

Pooling

Pooling reduces spatial dimensions, providing translation invariance and reducing computation.

Max Pooling

MaxPool(i,j)=max(m,n)RijI(m,n)

where Rij is the pooling region centered at (i,j).

Worked Example — Max Pooling (2x2, stride 2)

Input (4×4):

1324
5612
0283
1457

Step 1: Divide into 2x2 regions and take the max:

Region (0,0): max(1,3,5,6)=6 Region (0,1): max(2,4,1,2)=4 Region (1,0): max(0,2,1,4)=4 Region (1,1): max(8,3,5,7)=8

Output (2×2):

64
48

Result: The spatial dimension is halved (4x4 to 2x2) and only the strongest activation in each region survives. This provides a degree of translation invariance --- the "6" output is the same whether the 6 was at any position in the top-left 2x2 region.

Average Pooling

AvgPool(i,j)=1|Rij|(m,n)RijI(m,n)

Global Average Pooling (GAP)

Takes the mean over the entire spatial dimension, reducing H×W×C to 1×1×C. Replaced fully connected layers in modern architectures (starting with GoogLeNet):

python
# PyTorch
gap = nn.AdaptiveAvgPool2d(1)  # Output: (batch, channels, 1, 1)

Architecture Evolution

LeNet-5 (1998)

The first successful CNN. Yann LeCun used it for digit recognition.

python
class LeNet5(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 6, 5),         # 28x28 -> 24x24
            nn.Tanh(),
            nn.AvgPool2d(2, 2),         # 24x24 -> 12x12
            nn.Conv2d(6, 16, 5),        # 12x12 -> 8x8
            nn.Tanh(),
            nn.AvgPool2d(2, 2),         # 8x8 -> 4x4
        )
        self.classifier = nn.Sequential(
            nn.Linear(16 * 4 * 4, 120),
            nn.Tanh(),
            nn.Linear(120, 84),
            nn.Tanh(),
            nn.Linear(84, 10),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

AlexNet (2012)

Won ImageNet 2012 by a huge margin. Key innovations: ReLU, dropout, data augmentation, GPU training.

  • 5 conv layers, 3 FC layers, ~60M parameters
  • First to use ReLU instead of tanh
  • Trained on 2 GTX 580 GPUs (3GB VRAM each)
  • Top-5 error: 15.3% (vs 26.2% for second place)

VGG (2014)

Showed that depth matters. Used only 3x3 convolutions stacked deep.

Key insight: Two 3x3 convolutions have the same receptive field as one 5x5, but with fewer parameters and more nonlinearity:

Two 3x3: 2×(32C2)=18C2vs.One 5x5: 25C2

VGG-16: 16 weight layers, ~138M parameters. Simple and elegant but very memory-hungry.

GoogLeNet/Inception (2014)

Introduced the Inception module --- parallel convolutions of different kernel sizes concatenated:

1x1 convolutions reduce channel dimensions before expensive 3x3 and 5x5 convolutions (bottleneck).

ResNet (2015)

The breakthrough that enabled very deep networks (100+ layers). Introduced skip connections (residual learning).

The degradation problem: Deeper networks had higher training error than shallow ones --- not overfitting, but optimization difficulty. Adding layers to a good network should not hurt (at worst, the extra layers learn identity).

Residual block:

y=F(x,{Wi})+x

Instead of learning H(x) directly, the network learns the residual F(x)=H(x)x. Learning "how to modify the input" is easier than learning the output from scratch.

python
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1,
                         stride=stride, bias=False),
                nn.BatchNorm2d(out_channels),
            )

    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)  # Skip connection
        out = torch.relu(out)
        return out

Bottleneck Block (ResNet-50+)

For deeper ResNets, a bottleneck reduces computation:

1×1 (reduce)3×3 (process)1×1 (expand)
python
class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, in_channels, mid_channels, stride=1):
        super().__init__()
        out_channels = mid_channels * self.expansion
        self.conv1 = nn.Conv2d(in_channels, mid_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(mid_channels)
        self.conv2 = nn.Conv2d(mid_channels, mid_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(mid_channels)
        self.conv3 = nn.Conv2d(mid_channels, out_channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1,
                         stride=stride, bias=False),
                nn.BatchNorm2d(out_channels),
            )

    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = torch.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        return torch.relu(out)

EfficientNet (2019)

Systematically scales depth, width, and resolution together using compound scaling:

depth: d=αϕ,width: w=βϕ,resolution: r=γϕ

subject to αβ2γ22 (to roughly double FLOPS when ϕ increases by 1).

Uses MBConv blocks (mobile inverted bottleneck with squeeze-and-excitation).

Architecture Timeline

YearArchitectureTop-5 ErrorKey Innovation
1998LeNet-5N/AFirst CNN
2012AlexNet15.3%ReLU, GPU training
2014VGG-167.3%Deep 3x3 stacks
2014GoogLeNet6.7%Inception module
2015ResNet-1523.6%Skip connections
2017SENet2.3%Channel attention
2019EfficientNet-B72.9%Compound scaling
2021ViT~1.5%Transformer on patches

Transfer Learning with Pretrained CNNs

Use pretrained ImageNet weights as a starting point for your task.

python
import torchvision.models as models

# Load pretrained ResNet-50
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Replace final layer for your number of classes
num_classes = 5
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Option 1: Fine-tune all layers
optimizer = optim.Adam(model.parameters(), lr=1e-4)

# Option 2: Freeze backbone, train only head
for param in model.parameters():
    param.requires_grad = False
for param in model.fc.parameters():
    param.requires_grad = True
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)

# Option 3: Gradual unfreezing (train head first, then unfreeze)
# Train head for 5 epochs, then unfreeze all and train with lower LR

Complete CNN: CIFAR-10

python
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as T
from torch.utils.data import DataLoader

# ── Data ─────────────────────────────────────────────────────────────
train_transform = T.Compose([
    T.RandomCrop(32, padding=4),
    T.RandomHorizontalFlip(),
    T.ToTensor(),
    T.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])

train_set = torchvision.datasets.CIFAR10(
    './data', train=True, download=True, transform=train_transform)
test_set = torchvision.datasets.CIFAR10(
    './data', train=False, download=True,
    transform=T.Compose([
        T.ToTensor(),
        T.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
    ]))

train_loader = DataLoader(train_set, batch_size=128, shuffle=True, num_workers=2)
test_loader = DataLoader(test_set, batch_size=128, shuffle=False, num_workers=2)

# ── Model ────────────────────────────────────────────────────────────
class SmallResNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)

        self.layer1 = self._make_layer(64, 64, 2, stride=1)
        self.layer2 = self._make_layer(64, 128, 2, stride=2)
        self.layer3 = self._make_layer(128, 256, 2, stride=2)

        self.gap = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(256, num_classes)

    def _make_layer(self, in_ch, out_ch, num_blocks, stride):
        layers = [ResidualBlock(in_ch, out_ch, stride)]
        for _ in range(1, num_blocks):
            layers.append(ResidualBlock(out_ch, out_ch, 1))
        return nn.Sequential(*layers)

    def forward(self, x):
        x = torch.relu(self.bn1(self.conv1(x)))
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.gap(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)

# ── Train ────────────────────────────────────────────────────────────
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SmallResNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=200)

for epoch in range(200):
    model.train()
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        loss = criterion(model(inputs), targets)
        loss.backward()
        optimizer.step()
    scheduler.step()

    if (epoch + 1) % 10 == 0:
        model.eval()
        correct = sum(
            (model(x.to(device)).argmax(1) == y.to(device)).sum().item()
            for x, y in test_loader
        )
        print(f"Epoch {epoch+1}: {100*correct/len(test_set):.2f}%")

# Expected: ~93% after 200 epochs

Visualizing Feature Maps

python
import matplotlib.pyplot as plt

def visualize_features(model, image, layer_name='conv1'):
    """Visualize feature maps of a specific layer."""
    activation = {}

    def hook_fn(module, input, output):
        activation[layer_name] = output.detach()

    # Register hook
    layer = dict(model.named_modules())[layer_name]
    hook = layer.register_forward_hook(hook_fn)

    # Forward pass
    model.eval()
    with torch.no_grad():
        model(image.unsqueeze(0).to(next(model.parameters()).device))

    hook.remove()

    # Plot first 16 feature maps
    feat = activation[layer_name].cpu().squeeze(0)
    fig, axes = plt.subplots(4, 4, figsize=(10, 10))
    for i, ax in enumerate(axes.flat):
        if i < feat.size(0):
            ax.imshow(feat[i], cmap='viridis')
        ax.axis('off')
    plt.suptitle(f'Feature Maps: {layer_name}')
    plt.tight_layout()
    plt.show()

1x1 Convolutions

A 1×1 convolution is a per-pixel fully connected layer across channels. It:

  1. Reduces/increases channel dimensions (bottleneck)
  2. Adds nonlinearity (with activation after it)
  3. Mixes channel information without spatial mixing
python
# Reduce 512 channels to 64
bottleneck = nn.Conv2d(512, 64, kernel_size=1)
# Parameters: 512 * 64 + 64 = 32,832

Receptive Field

The receptive field is the region of the input that influences a particular output neuron. It grows with depth.

For a stack of n layers with kernel size k and stride 1:

Receptive field=n(k1)+1
Worked Example — Receptive Field Growth

Setup: Stack of 3 convolutional layers, each with kernel size k=3, stride 1.

After 1 layer: r=1(31)+1=3 pixels (each output neuron "sees" a 3x3 patch) After 2 layers: r=2(31)+1=5 pixels After 3 layers: r=3(31)+1=7 pixels

Compare: A single 7x7 convolution also has receptive field 7, but uses 72=49 parameters per channel pair vs three 3x3 layers using 3×9=27 parameters. This is why VGG uses stacked 3x3 convolutions --- same receptive field, fewer parameters, more nonlinearities.

For a layer with stride s on top of receptive field r:

rnew=r+(k1)×isi

Cross-References

"What I cannot create, I do not understand." — Richard Feynman