Object Detection
Object detection combines classification (what is it?) with localization (where is it?). This page traces the evolution from R-CNN to DETR, derives IoU and mAP metrics, explains YOLO's grid-cell approach, and trains YOLOv8 on a custom dataset.
The Detection Problem
Given an image, output a set of bounding boxes with class labels:
where
Intersection over Union (IoU)
IoU measures the overlap between predicted and ground-truth boxes:
def compute_iou(box1, box2):
"""Compute IoU between two boxes [x1, y1, x2, y2]."""
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
intersection = max(0, x2 - x1) * max(0, y2 - y1)
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = area1 + area2 - intersection
return intersection / (union + 1e-6)Worked Example — IoU Calculation for Two Boxes
Input: Two bounding boxes in
- Predicted:
(150x150 pixels) - Ground truth:
(150x140 pixels)
Step 1: Compute intersection rectangle:
Intersection area:
Step 2: Compute individual areas:
Step 3: Compute union:
Step 4: IoU:
Result: IoU = 0.381. This would fail the PASCAL VOC threshold (0.5) --- the prediction is not a good enough match. The boxes overlap, but the predicted box is shifted too far left and up from the ground truth.
IoU Thresholds
| IoU | Interpretation |
|---|---|
| 0.5 | PASCAL VOC standard (loose) |
| 0.75 | COCO strict |
| 0.5:0.95 | COCO AP (average over 10 thresholds) |
R-CNN Family Evolution
R-CNN (2014)
- Generate ~2000 region proposals (Selective Search)
- Warp each to fixed size and pass through CNN
- Classify with SVM + regress bounding box
Problem: Process each region independently --- extremely slow (47 seconds per image).
Fast R-CNN (2015)
- Pass the entire image through CNN once to get a feature map
- Project region proposals onto the feature map
- RoI Pooling extracts fixed-size features from each region
- Classify + regress in one forward pass
RoI Pooling: Divide the projected region into a fixed grid (e.g., 7x7) and max-pool within each cell.
Improvement: Sharing computation across proposals. ~0.3 seconds per image.
Faster R-CNN (2016)
Replace Selective Search with a Region Proposal Network (RPN) that generates proposals from the feature map itself.
RPN: At each position in the feature map, predict
- Objectness score: is there an object? (binary)
- Box regression: adjust anchor to fit object (
)
Anchor Boxes
Anchors are predefined boxes at each feature map location. For 3 scales and 3 aspect ratios,
The RPN predicts offsets from anchors:
Non-Maximum Suppression (NMS)
Multiple overlapping detections of the same object must be merged:
def nms(boxes, scores, iou_threshold=0.5):
"""Apply Non-Maximum Suppression.
boxes: (N, 4) [x1, y1, x2, y2]
scores: (N,) confidence scores
"""
order = scores.argsort(descending=True)
keep = []
while len(order) > 0:
i = order[0]
keep.append(i)
if len(order) == 1:
break
ious = compute_iou_batch(boxes[i].unsqueeze(0), boxes[order[1:]])
remaining = (ious < iou_threshold).nonzero(as_tuple=True)[0]
order = order[remaining + 1]
return keepYOLO: You Only Look Once
YOLO (Redmon et al., 2016) frames detection as a single regression problem.
Grid Cell Approach
- Divide the image into an
grid - Each cell predicts
bounding boxes + confidence + class probabilities - Output tensor:
For YOLOv1:
Loss function:
The coordinate loss uses
YOLO Evolution
| Version | Year | Key Innovation | mAP (COCO) |
|---|---|---|---|
| YOLOv1 | 2016 | Single-stage detection | 63.4 |
| YOLOv2 | 2017 | Anchor boxes, batch norm | 78.6 |
| YOLOv3 | 2018 | Multi-scale predictions, Darknet-53 | 33.0 |
| YOLOv4 | 2020 | CSPDarknet, Mosaic augmentation | 43.5 |
| YOLOv5 | 2020 | PyTorch native, Focus module | 50.7 |
| YOLOv8 | 2023 | Anchor-free, decoupled head | 53.9 |
| YOLOv11 | 2024 | Efficient architecture, GELAN | 54.7 |
SSD: Single Shot MultiBox Detector
SSD detects objects at multiple scales using feature maps from different layers:
- Early layers (high resolution): detect small objects
- Later layers (low resolution): detect large objects
Each feature map cell predicts
DETR: Detection Transformer
DETR (Carion et al., 2020) eliminates anchors, NMS, and region proposals by treating detection as a set prediction problem.
Architecture
Object Queries:
Bipartite Matching Loss: Uses the Hungarian algorithm to find the optimal one-to-one matching between predictions and ground truth:
Mean Average Precision (mAP)
Precision and Recall
For each class, sort detections by confidence. At each threshold:
AP (Average Precision)
Area under the precision-recall curve:
Approximated using 11-point interpolation (PASCAL VOC) or all-point interpolation (COCO).
mAP
Average AP across all classes:
COCO mAP averages over IoU thresholds from 0.5 to 0.95 (step 0.05).
Worked Example — mAP Computation (Simplified)
Setup: 2 classes (cat, dog). 5 detections sorted by confidence:
| Detection | Class | Confidence | IoU with GT | TP/FP |
|---|---|---|---|---|
| D1 | cat | 0.95 | 0.82 | TP |
| D2 | cat | 0.90 | 0.10 | FP |
| D3 | dog | 0.85 | 0.71 | TP |
| D4 | cat | 0.70 | 0.65 | TP |
| D5 | dog | 0.60 | 0.30 | FP |
Ground truth: 2 cats, 2 dogs. IoU threshold = 0.5.
Cat class (2 GT objects, detections: D1-TP, D2-FP, D4-TP):
| Step | Precision | Recall |
|---|---|---|
| After D1 (TP) | 1/1 = 1.000 | 1/2 = 0.500 |
| After D2 (FP) | 1/2 = 0.500 | 1/2 = 0.500 |
| After D4 (TP) | 2/3 = 0.667 | 2/2 = 1.000 |
Dog class (2 GT objects, detections: D3-TP, D5-FP):
| Step | Precision | Recall |
|---|---|---|
| After D3 (TP) | 1/1 = 1.000 | 1/2 = 0.500 |
| After D5 (FP) | 1/2 = 0.500 | 1/2 = 0.500 |
mAP:
Result: mAP = 0.667. The model finds cats well (83.3% AP) but misses one dog entirely, dragging the average down. COCO would also average this across 10 IoU thresholds.
def compute_ap(precision, recall):
"""Compute AP using all-point interpolation."""
# Add sentinel values
mrec = np.concatenate(([0.0], recall, [1.0]))
mpre = np.concatenate(([1.0], precision, [0.0]))
# Smooth precision curve (make it monotonically decreasing)
for i in range(len(mpre) - 2, -1, -1):
mpre[i] = max(mpre[i], mpre[i + 1])
# Compute area under curve
i = np.where(mrec[1:] != mrec[:-1])[0]
ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
return apYOLOv8 on Custom Dataset
from ultralytics import YOLO
# ── Load pretrained YOLOv8 ───────────────────────────────────────────
model = YOLO('yolov8n.pt') # nano model (fastest)
# ── Dataset structure (YOLO format) ──────────────────────────────────
# dataset/
# train/
# images/
# labels/ (one .txt per image: class cx cy w h)
# val/
# images/
# labels/
# data.yaml
# data.yaml content:
# train: ./train/images
# val: ./val/images
# nc: 3
# names: ['cat', 'dog', 'bird']
# ── Training ─────────────────────────────────────────────────────────
results = model.train(
data='dataset/data.yaml',
epochs=100,
imgsz=640,
batch=16,
lr0=0.01,
lrf=0.01, # Final LR factor
momentum=0.937,
weight_decay=0.0005,
warmup_epochs=3,
augment=True,
mosaic=1.0, # Mosaic augmentation
mixup=0.1,
copy_paste=0.1,
device=0, # GPU
name='custom_detector',
)
# ── Inference ────────────────────────────────────────────────────────
model = YOLO('runs/detect/custom_detector/weights/best.pt')
# Single image
results = model('test_image.jpg')
for r in results:
boxes = r.boxes
for box in boxes:
xyxy = box.xyxy[0] # [x1, y1, x2, y2]
conf = box.conf[0] # confidence
cls = int(box.cls[0]) # class index
print(f"Class: {r.names[cls]}, Conf: {conf:.2f}, Box: {xyxy}")
# ── Export for deployment ────────────────────────────────────────────
model.export(format='onnx') # ONNX
model.export(format='torchscript') # TorchScript
model.export(format='tflite') # TensorFlow Lite (mobile)Two-Stage vs One-Stage Detectors
| Feature | Two-Stage (Faster R-CNN) | One-Stage (YOLO) |
|---|---|---|
| Speed | ~5-15 FPS | ~30-160 FPS |
| Accuracy | Higher (especially small objects) | Slightly lower |
| Architecture | RPN + detection head | Single network |
| Use case | Accuracy-critical | Real-time applications |
| Training | More complex | Simpler |
Loss Functions for Object Detection
Classification Loss
Standard cross-entropy or focal loss for class prediction:
Focal loss (Lin et al., 2017) down-weights easy negatives. With
Box Regression Loss
Smooth L1 Loss (Faster R-CNN):
CIoU Loss (Complete IoU, used in YOLO):
where
GIoU (Generalized IoU)
where
def giou_loss(pred_boxes, target_boxes):
"""Compute GIoU loss between predicted and target boxes."""
# Compute IoU
inter_x1 = torch.max(pred_boxes[:, 0], target_boxes[:, 0])
inter_y1 = torch.max(pred_boxes[:, 1], target_boxes[:, 1])
inter_x2 = torch.min(pred_boxes[:, 2], target_boxes[:, 2])
inter_y2 = torch.min(pred_boxes[:, 3], target_boxes[:, 3])
inter_area = torch.clamp(inter_x2 - inter_x1, min=0) * torch.clamp(inter_y2 - inter_y1, min=0)
pred_area = (pred_boxes[:, 2] - pred_boxes[:, 0]) * (pred_boxes[:, 3] - pred_boxes[:, 1])
target_area = (target_boxes[:, 2] - target_boxes[:, 0]) * (target_boxes[:, 3] - target_boxes[:, 1])
union = pred_area + target_area - inter_area
iou = inter_area / (union + 1e-6)
# Compute enclosing box
enc_x1 = torch.min(pred_boxes[:, 0], target_boxes[:, 0])
enc_y1 = torch.min(pred_boxes[:, 1], target_boxes[:, 1])
enc_x2 = torch.max(pred_boxes[:, 2], target_boxes[:, 2])
enc_y2 = torch.max(pred_boxes[:, 3], target_boxes[:, 3])
enc_area = (enc_x2 - enc_x1) * (enc_y2 - enc_y1)
giou = iou - (enc_area - union) / (enc_area + 1e-6)
return 1 - giou.mean()Data Annotation for Detection
Annotation Formats
| Format | Used By | Structure |
|---|---|---|
| PASCAL VOC | VOC dataset | XML per image |
| COCO JSON | COCO dataset | Single JSON for all images |
| YOLO | Ultralytics | TXT per image: class cx cy w h |
| Label Studio | Generic | JSON with bounding boxes |
YOLO Label Format
Each image has a corresponding .txt file with one line per object:
# class_id center_x center_y width height (all normalized 0-1)
0 0.5 0.4 0.3 0.6
1 0.2 0.7 0.15 0.2Converting Between Formats
def voc_to_yolo(voc_box, img_w, img_h):
"""Convert VOC [xmin, ymin, xmax, ymax] to YOLO [cx, cy, w, h] (normalized)."""
xmin, ymin, xmax, ymax = voc_box
cx = (xmin + xmax) / 2.0 / img_w
cy = (ymin + ymax) / 2.0 / img_h
w = (xmax - xmin) / img_w
h = (ymax - ymin) / img_h
return [cx, cy, w, h]
def yolo_to_voc(yolo_box, img_w, img_h):
"""Convert YOLO [cx, cy, w, h] to VOC [xmin, ymin, xmax, ymax]."""
cx, cy, w, h = yolo_box
xmin = (cx - w / 2) * img_w
ymin = (cy - h / 2) * img_h
xmax = (cx + w / 2) * img_w
ymax = (cy + h / 2) * img_h
return [xmin, ymin, xmax, ymax]
def coco_to_voc(coco_box):
"""Convert COCO [x, y, w, h] to VOC [xmin, ymin, xmax, ymax]."""
x, y, w, h = coco_box
return [x, y, x + w, y + h]Feature Pyramid Network (FPN)
FPN (Lin et al., 2017) constructs multi-scale feature maps for detecting objects of different sizes:
Each level of the pyramid handles a different object scale. Small objects are detected at high-resolution, low-level features; large objects at low-resolution, high-level features.
Common Pitfalls
| Mistake | Symptom | Fix |
|---|---|---|
| Wrong annotation format | Model trains but mAP is 0 | Verify box coordinates and class IDs |
| Images not resized | CUDA OOM | Resize to 640x640 for YOLO |
| Forgetting NMS | Duplicate detections | Apply NMS (IoU threshold 0.5) |
| Too few anchors | Missing small/large objects | Use FPN or multi-scale anchors |
| Imbalanced classes | Model ignores rare classes | Use focal loss or oversampling |
| Train/val leakage | Inflated mAP | Ensure no duplicate images across splits |
Cross-References
- CNN backbones: CNN --- ResNet, EfficientNet for feature extraction
- Transformer-based: Transformers --- DETR architecture
- Segmentation: Image Segmentation --- pixel-level detection
- Classification: Image Classification --- ViT, transfer learning
- Deployment: Model Optimization --- quantization for real-time