Transfer Learning
Transfer learning uses knowledge learned on one task (source) to improve performance on a different task (target). It is the single most impactful technique in modern deep learning --- almost nobody trains from scratch anymore. This page explains why transfer learning works, compares feature extraction with fine-tuning, covers domain adaptation, builds Siamese networks for few-shot learning, and demonstrates zero-shot classification with CLIP.
Why Transfer Learning Works
Hierarchical Feature Learning
Deep networks learn increasingly abstract features:
| Layer | Learns | Transferable? |
|---|---|---|
| Layer 1-2 | Edges, textures | Highly (universal) |
| Layer 3-5 | Parts, patterns | Moderately |
| Layer 6+ | Task-specific concepts | Least |
Early layers learn features that are useful for virtually any visual task. Later layers specialize. Transfer learning reuses the universal features and only retrains the specialized ones.
Mathematical Intuition
Let
If
When Does It Help?
Feature Extraction vs Fine-Tuning
Feature Extraction (Frozen Backbone)
Use the pretrained model as a fixed feature extractor:
import torch
import torch.nn as nn
import torchvision.models as models
# Load pretrained ResNet-50
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Replace classifier
model.fc = nn.Sequential(
nn.Linear(2048, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, num_classes),
)
# Only classifier parameters train
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)Pros: Fast (fewer parameters to train), works with tiny datasets. Cons: Limited adaptation to target domain.
Fine-Tuning
Unfreeze some or all pretrained layers:
# Strategy 1: Discriminative learning rates
optimizer = torch.optim.AdamW([
{'params': model.layer1.parameters(), 'lr': 1e-5},
{'params': model.layer2.parameters(), 'lr': 5e-5},
{'params': model.layer3.parameters(), 'lr': 1e-4},
{'params': model.layer4.parameters(), 'lr': 5e-4},
{'params': model.fc.parameters(), 'lr': 1e-3},
], weight_decay=0.01)
# Strategy 2: Gradual unfreezing
# Epoch 1-3: Only train fc
# Epoch 4-6: Unfreeze layer4
# Epoch 7+: Unfreeze all
def unfreeze_schedule(model, epoch):
if epoch == 4:
for param in model.layer4.parameters():
param.requires_grad = True
elif epoch == 7:
for param in model.parameters():
param.requires_grad = TrueComplete Transfer Learning Pipeline
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as T
from torch.utils.data import DataLoader
# ── Data ─────────────────────────────────────────────────────────────
train_transform = T.Compose([
T.RandomResizedCrop(224),
T.RandomHorizontalFlip(),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
val_transform = T.Compose([
T.Resize(256),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
# ImageFolder dataset structure:
# dataset/train/class1/*.jpg, dataset/train/class2/*.jpg, ...
train_set = torchvision.datasets.ImageFolder('dataset/train', train_transform)
val_set = torchvision.datasets.ImageFolder('dataset/val', val_transform)
train_loader = DataLoader(train_set, batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(val_set, batch_size=32, num_workers=4)
# ── Model ────────────────────────────────────────────────────────────
model = torchvision.models.efficientnet_b0(
weights=torchvision.models.EfficientNet_B0_Weights.IMAGENET1K_V1
)
model.classifier = nn.Sequential(
nn.Dropout(0.2),
nn.Linear(1280, len(train_set.classes)),
)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
# ── Phase 1: Train head only ────────────────────────────────────────
for param in model.features.parameters():
param.requires_grad = False
optimizer = torch.optim.Adam(model.classifier.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
for epoch in range(5):
model.train()
for inputs, targets in train_loader:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
loss = criterion(model(inputs), targets)
loss.backward()
optimizer.step()
# ── Phase 2: Fine-tune all ──────────────────────────────────────────
for param in model.parameters():
param.requires_grad = True
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=20)
for epoch in range(20):
model.train()
for inputs, targets in train_loader:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
loss = criterion(model(inputs), targets)
loss.backward()
optimizer.step()
scheduler.step()Domain Adaptation
When the source and target domains differ significantly (e.g., photos vs sketches), standard transfer learning may not work. Domain adaptation techniques align the feature distributions.
Types
| Type | Source Labels | Target Labels | Example |
|---|---|---|---|
| Supervised | Yes | Yes | Both labeled |
| Semi-supervised | Yes | Some | Few target labels |
| Unsupervised | Yes | No | No target labels |
DANN (Domain-Adversarial Neural Network)
Train a feature extractor that is:
- Discriminative for the task (classification loss)
- Indistinguishable across domains (domain confusion loss via gradient reversal)
The gradient reversal layer flips the gradient sign during backprop, making the feature extractor learn domain-invariant features.
Few-Shot Learning
The Problem
Given only
Siamese Networks
Learn a similarity function rather than class labels:
class SiameseNetwork(nn.Module):
def __init__(self, backbone):
super().__init__()
self.backbone = backbone
self.fc = nn.Sequential(
nn.Linear(2048, 256),
nn.ReLU(),
nn.Linear(256, 1),
)
def forward_one(self, x):
return self.backbone(x)
def forward(self, x1, x2):
f1 = self.forward_one(x1)
f2 = self.forward_one(x2)
diff = torch.abs(f1 - f2)
return torch.sigmoid(self.fc(diff))Contrastive Loss
where
Prototypical Networks
Compute class prototypes (mean embeddings) and classify by nearest prototype:
Zero-Shot Learning with CLIP
CLIP (Contrastive Language-Image Pre-training) learns to align image and text embeddings, enabling zero-shot classification without any task-specific training.
How CLIP Works
CLIP trains on 400M image-text pairs with a contrastive objective:
where
Zero-Shot Classification
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("photo.jpg")
candidate_labels = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]
inputs = processor(
text=candidate_labels,
images=image,
return_tensors="pt",
padding=True,
)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits_per_image
probs = logits.softmax(dim=1)
for label, prob in zip(candidate_labels, probs[0]):
print(f"{label}: {prob:.3f}")
# Example output:
# a photo of a cat: 0.942
# a photo of a dog: 0.051
# a photo of a bird: 0.007CLIP for Image Search
import numpy as np
def build_image_index(model, processor, image_paths):
"""Encode all images into CLIP embeddings."""
embeddings = []
for path in image_paths:
image = Image.open(path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
emb = model.get_image_features(**inputs)
emb = emb / emb.norm(dim=-1, keepdim=True)
embeddings.append(emb.cpu().numpy())
return np.vstack(embeddings)
def search(query, model, processor, image_index, image_paths, top_k=5):
"""Search images by text query."""
inputs = processor(text=[query], return_tensors="pt", padding=True)
with torch.no_grad():
text_emb = model.get_text_features(**inputs)
text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)
similarities = (text_emb.cpu().numpy() @ image_index.T)[0]
top_indices = similarities.argsort()[-top_k:][::-1]
return [(image_paths[i], similarities[i]) for i in top_indices]Transfer Learning for NLP
Transfer learning in NLP follows the same principles as vision but with different pretrained models.
HuggingFace Transfer Learning
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer, TrainingArguments
# Step 1: Choose a pretrained model
model_name = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=3
)
# Step 2: Prepare data
def tokenize(examples):
return tokenizer(examples['text'], truncation=True, max_length=256)
# Step 3: Fine-tune
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5, # Low LR for fine-tuning
warmup_ratio=0.1, # Warmup is essential
weight_decay=0.01,
eval_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
trainer.train()Adapter-Based Transfer Learning
Instead of fine-tuning all parameters, insert small adapter modules:
| Method | Trainable Params | Performance | Memory |
|---|---|---|---|
| Full fine-tuning | 100% | Best | High |
| LoRA (r=16) | ~0.2% | 98% of full | Very low |
| Adapters | ~2% | 97% of full | Low |
| Prompt tuning | ~0.01% | 90-95% of full | Minimal |
| Prefix tuning | ~0.1% | 95% of full | Low |
Common Transfer Learning Mistakes
| Mistake | Consequence | Fix |
|---|---|---|
| LR too high for backbone | Pretrained features destroyed | Use 10-100x lower LR for backbone |
| No warmup | Training instability | Warmup for 5-10% of steps |
| Wrong image normalization | Features misaligned | Use same normalization as pretraining (ImageNet mean/std) |
| Freezing everything | Insufficient adaptation | At least fine-tune the last few layers |
| Wrong input size | Performance drop | Match pretrained model's expected input size |
| Using wrong tokenizer | Garbage output | Always use the matching tokenizer |
Measuring Transfer Gap
The effectiveness of transfer depends on the similarity between source and target domains:
Worked Example — Feature Extraction vs Fine-Tuning on Same Data
Setup: Custom medical imaging dataset with 500 images, 5 classes. ResNet-50 pretrained on ImageNet.
Approach 1: Feature Extraction (frozen backbone)
- Backbone weights: frozen (25.6M params, 0 trainable)
- New classifier: Linear(2048, 256) + ReLU + Linear(256, 5) = 526K trainable params
- Training: 10 epochs, LR = 1e-3, converges fast
| Epoch | Train Acc | Val Acc |
|---|---|---|
| 1 | 62% | 58% |
| 5 | 85% | 78% |
| 10 | 92% | 80% |
Approach 2: Full Fine-Tuning
- All params trainable (25.6M), LR = 1e-5 for backbone, 1e-3 for head
- Training: 20 epochs with cosine LR decay
| Epoch | Train Acc | Val Acc |
|---|---|---|
| 1 | 55% | 52% |
| 5 | 78% | 75% |
| 10 | 91% | 84% |
| 20 | 97% | 87% |
Approach 3: From Scratch (random init)
- Same architecture, all random weights
- Training: 100 epochs, LR = 1e-2
| Epoch | Train Acc | Val Acc |
|---|---|---|
| 10 | 35% | 30% |
| 50 | 75% | 55% |
| 100 | 95% | 62% |
Transfer Gap:
Result: Feature extraction (80%) beats from-scratch (62%) with 1000x fewer trainable parameters. Fine-tuning (87%) beats both by adapting the backbone features to the medical domain. From-scratch severely overfits (95% train vs 62% val) because 500 images is insufficient to learn good low-level features.
| Source | Expected Benefit |
|---|---|
| ImageNet | Moderate (different domain but shared low-level features) |
| ImageNet | Good (natural images share textures) |
| ImageNet | Moderate (very different visual domain) |
| English BERT | Poor (different language, use multilingual model) |
| General LLM | Excellent (language structure transfers) |
Cross-References
- CNN architectures: CNN --- ResNet, EfficientNet backbones
- Image classification: Image Classification --- ViT, augmentation
- BERT transfer: BERT Family --- fine-tuning NLP models
- Optimization: Model Optimization --- distillation, quantization
- Multimodal: Multimodal Models --- CLIP, vision-language
- Training recipes: Training Techniques --- LR scheduling, dropout