Object Detection

Object detection combines classification (what is it?) with localization (where is it?). This page traces the evolution from R-CNN to DETR, derives IoU and mAP metrics, explains YOLO's grid-cell approach, and trains YOLOv8 on a custom dataset.

The Detection Problem

Given an image, output a set of bounding boxes with class labels:

{(x_{1}, y_{1}, x_{2}, y_{2}, c, p) ∣ for each detected object}

where $(x_{1}, y_{1}, x_{2}, y_{2})$ define the box, $c$ is the class, and $p$ is the confidence.

Intersection over Union (IoU)

IoU measures the overlap between predicted and ground-truth boxes:

IoU = \frac{| A \cap B |}{| A \cup B |} = \frac{Area of Intersection}{Area of Union}

python

def compute_iou(box1, box2):
    """Compute IoU between two boxes [x1, y1, x2, y2]."""
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    intersection = max(0, x2 - x1) * max(0, y2 - y1)
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    return intersection / (union + 1e-6)

Worked Example — IoU Calculation for Two Boxes

Input: Two bounding boxes in $[x_{1}, y_{1}, x_{2}, y_{2}]$ format:

Predicted: $A = [50, 50, 200, 200]$ (150x150 pixels)
Ground truth: $B = [100, 80, 250, 220]$ (150x140 pixels)

Step 1: Compute intersection rectangle:

x_{1}^{\cap} = max (50, 100) = 100, y_{1}^{\cap} = max (50, 80) = 80

x_{2}^{\cap} = min (200, 250) = 200, y_{2}^{\cap} = min (200, 220) = 200

Intersection area: $(200 - 100) \times (200 - 80) = 100 \times 120 = 12,000$ pixels

Step 2: Compute individual areas:

| A | = (200 - 50) (200 - 50) = 150 \times 150 = 22,500

| B | = (250 - 100) (220 - 80) = 150 \times 140 = 21,000

Step 3: Compute union:

| A \cup B | = 22,500 + 21,000 - 12,000 = 31,500

Step 4: IoU:

IoU = \frac{12,000}{31,500} = 0.381

Result: IoU = 0.381. This would fail the PASCAL VOC threshold (0.5) --- the prediction is not a good enough match. The boxes overlap, but the predicted box is shifted too far left and up from the ground truth.

IoU Thresholds

IoU	Interpretation
0.5	PASCAL VOC standard (loose)
0.75	COCO strict
0.5:0.95	COCO AP (average over 10 thresholds)

R-CNN Family Evolution

R-CNN (2014)

Generate ~2000 region proposals (Selective Search)
Warp each to fixed size and pass through CNN
Classify with SVM + regress bounding box

Problem: Process each region independently --- extremely slow (47 seconds per image).

Fast R-CNN (2015)

Pass the entire image through CNN once to get a feature map
Project region proposals onto the feature map
RoI Pooling extracts fixed-size features from each region
Classify + regress in one forward pass

RoI Pooling: Divide the projected region into a fixed grid (e.g., 7x7) and max-pool within each cell.

Improvement: Sharing computation across proposals. ~0.3 seconds per image.

Faster R-CNN (2016)

Replace Selective Search with a Region Proposal Network (RPN) that generates proposals from the feature map itself.

RPN: At each position in the feature map, predict $k$ anchor boxes (different scales and aspect ratios):

Objectness score: is there an object? (binary)
Box regression: adjust anchor to fit object ( $Δ x, Δ y, Δ w, Δ h$ )

Anchor Boxes

Anchors are predefined boxes at each feature map location. For 3 scales and 3 aspect ratios, $k = 9$ anchors per position.

The RPN predicts offsets from anchors:

t_{x} = \frac{x - x_{a}}{w_{a}}, t_{y} = \frac{y - y_{a}}{h_{a}}

t_{w} = \log \frac{w}{w_{a}}, t_{h} = \log \frac{h}{h_{a}}

Non-Maximum Suppression (NMS)

Multiple overlapping detections of the same object must be merged:

python

def nms(boxes, scores, iou_threshold=0.5):
    """Apply Non-Maximum Suppression.
    boxes: (N, 4) [x1, y1, x2, y2]
    scores: (N,) confidence scores
    """
    order = scores.argsort(descending=True)
    keep = []

    while len(order) > 0:
        i = order[0]
        keep.append(i)

        if len(order) == 1:
            break

        ious = compute_iou_batch(boxes[i].unsqueeze(0), boxes[order[1:]])
        remaining = (ious < iou_threshold).nonzero(as_tuple=True)[0]
        order = order[remaining + 1]

    return keep

YOLO: You Only Look Once

YOLO (Redmon et al., 2016) frames detection as a single regression problem.

Grid Cell Approach

Divide the image into an $S \times S$ grid
Each cell predicts $B$ bounding boxes + confidence + $C$ class probabilities
Output tensor: $S \times S \times (B \times 5 + C)$

For YOLOv1: $S = 7$ , $B = 2$ , $C = 20$ (PASCAL VOC) $\to$ $7 \times 7 \times 30$

Loss function:

L = λ_{coord} L_{box} + L_{obj} + λ_{noobj} L_{noobj} + L_{class}

The coordinate loss uses $\sqrt{w}$ and $\sqrt{h}$ to weight small box errors more heavily than large box errors.

YOLO Evolution

Version	Year	Key Innovation	mAP (COCO)
YOLOv1	2016	Single-stage detection	63.4
YOLOv2	2017	Anchor boxes, batch norm	78.6
YOLOv3	2018	Multi-scale predictions, Darknet-53	33.0
YOLOv4	2020	CSPDarknet, Mosaic augmentation	43.5
YOLOv5	2020	PyTorch native, Focus module	50.7
YOLOv8	2023	Anchor-free, decoupled head	53.9
YOLOv11	2024	Efficient architecture, GELAN	54.7

SSD: Single Shot MultiBox Detector

SSD detects objects at multiple scales using feature maps from different layers:

Early layers (high resolution): detect small objects
Later layers (low resolution): detect large objects

Each feature map cell predicts $k$ boxes with class scores.

DETR: Detection Transformer

DETR (Carion et al., 2020) eliminates anchors, NMS, and region proposals by treating detection as a set prediction problem.

Architecture

Object Queries: $N$ learnable embeddings (e.g., $N = 100$ ) that each learn to detect one object. The decoder uses cross-attention to attend to relevant image regions.

Bipartite Matching Loss: Uses the Hungarian algorithm to find the optimal one-to-one matching between predictions and ground truth:

\hat{σ} = \arg min_{σ \in S_{N}} \sum_{i = 1}^{N} L_{match} (y_{i}, {\hat{y}}_{σ (i)})

Mean Average Precision (mAP)

Precision and Recall

For each class, sort detections by confidence. At each threshold:

Precision = \frac{T P}{T P + F P}, Recall = \frac{T P}{T P + F N}

AP (Average Precision)

Area under the precision-recall curve:

A P = \int_{0}^{1} p (r) d r

Approximated using 11-point interpolation (PASCAL VOC) or all-point interpolation (COCO).

mAP

Average AP across all classes:

m A P = \frac{1}{C} \sum_{c = 1}^{C} A P_{c}

COCO mAP averages over IoU thresholds from 0.5 to 0.95 (step 0.05).

Worked Example — mAP Computation (Simplified)

Setup: 2 classes (cat, dog). 5 detections sorted by confidence:

Detection	Class	Confidence	IoU with GT	TP/FP
D1	cat	0.95	0.82	TP
D2	cat	0.90	0.10	FP
D3	dog	0.85	0.71	TP
D4	cat	0.70	0.65	TP
D5	dog	0.60	0.30	FP

Ground truth: 2 cats, 2 dogs. IoU threshold = 0.5.

Cat class (2 GT objects, detections: D1-TP, D2-FP, D4-TP):

Step	Precision	Recall
After D1 (TP)	1/1 = 1.000	1/2 = 0.500
After D2 (FP)	1/2 = 0.500	1/2 = 0.500
After D4 (TP)	2/3 = 0.667	2/2 = 1.000

$A P_{cat} = area under P-R curve \approx 1.0 \times 0.5 + 0.667 \times 0.5 = 0.833$

Dog class (2 GT objects, detections: D3-TP, D5-FP):

Step	Precision	Recall
After D3 (TP)	1/1 = 1.000	1/2 = 0.500
After D5 (FP)	1/2 = 0.500	1/2 = 0.500

$A P_{dog} = 1.0 \times 0.5 = 0.500$ (recall never reaches 1.0 --- one dog was never detected)

mAP:

m A P = \frac{A P_{cat} + A P_{dog}}{2} = \frac{0.833 + 0.500}{2} = 0.667

Result: mAP = 0.667. The model finds cats well (83.3% AP) but misses one dog entirely, dragging the average down. COCO would also average this across 10 IoU thresholds.

python

def compute_ap(precision, recall):
    """Compute AP using all-point interpolation."""
    # Add sentinel values
    mrec = np.concatenate(([0.0], recall, [1.0]))
    mpre = np.concatenate(([1.0], precision, [0.0]))

    # Smooth precision curve (make it monotonically decreasing)
    for i in range(len(mpre) - 2, -1, -1):
        mpre[i] = max(mpre[i], mpre[i + 1])

    # Compute area under curve
    i = np.where(mrec[1:] != mrec[:-1])[0]
    ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
    return ap

YOLOv8 on Custom Dataset

python

from ultralytics import YOLO

# ── Load pretrained YOLOv8 ───────────────────────────────────────────
model = YOLO('yolov8n.pt')  # nano model (fastest)

# ── Dataset structure (YOLO format) ──────────────────────────────────
# dataset/
#   train/
#     images/
#     labels/   (one .txt per image: class cx cy w h)
#   val/
#     images/
#     labels/
#   data.yaml

# data.yaml content:
# train: ./train/images
# val: ./val/images
# nc: 3
# names: ['cat', 'dog', 'bird']

# ── Training ─────────────────────────────────────────────────────────
results = model.train(
    data='dataset/data.yaml',
    epochs=100,
    imgsz=640,
    batch=16,
    lr0=0.01,
    lrf=0.01,       # Final LR factor
    momentum=0.937,
    weight_decay=0.0005,
    warmup_epochs=3,
    augment=True,
    mosaic=1.0,      # Mosaic augmentation
    mixup=0.1,
    copy_paste=0.1,
    device=0,        # GPU
    name='custom_detector',
)

# ── Inference ────────────────────────────────────────────────────────
model = YOLO('runs/detect/custom_detector/weights/best.pt')

# Single image
results = model('test_image.jpg')
for r in results:
    boxes = r.boxes
    for box in boxes:
        xyxy = box.xyxy[0]      # [x1, y1, x2, y2]
        conf = box.conf[0]       # confidence
        cls = int(box.cls[0])    # class index
        print(f"Class: {r.names[cls]}, Conf: {conf:.2f}, Box: {xyxy}")

# ── Export for deployment ────────────────────────────────────────────
model.export(format='onnx')  # ONNX
model.export(format='torchscript')  # TorchScript
model.export(format='tflite')  # TensorFlow Lite (mobile)

Two-Stage vs One-Stage Detectors

Feature	Two-Stage (Faster R-CNN)	One-Stage (YOLO)
Speed	~5-15 FPS	~30-160 FPS
Accuracy	Higher (especially small objects)	Slightly lower
Architecture	RPN + detection head	Single network
Use case	Accuracy-critical	Real-time applications
Training	More complex	Simpler

Loss Functions for Object Detection

Classification Loss

Standard cross-entropy or focal loss for class prediction:

L_{focal} = - α_{t} (1 - p_{t})^{γ} \log (p_{t})

Focal loss (Lin et al., 2017) down-weights easy negatives. With $γ = 2$ , easy examples ( $p_{t} > 0.9$ ) contribute very little to the loss. This is essential because most anchor boxes are background (negative).

Box Regression Loss

Smooth L1 Loss (Faster R-CNN):

{smooth}_{L 1} (x) = {\begin{cases} 0.5 x^{2} & if | x | < 1 \\ | x | - 0.5 & otherwise \end{cases}

CIoU Loss (Complete IoU, used in YOLO):

L_{CIoU} = 1 - IoU + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

where $ρ$ is the Euclidean distance between centers, $c$ is the diagonal of the smallest enclosing box, and $v$ measures aspect ratio consistency.

GIoU (Generalized IoU)

GIoU = IoU - \frac{| C ∖ (A \cup B) |}{| C |}

where $C$ is the smallest convex hull enclosing both $A$ and $B$ . GIoU provides gradients even when boxes do not overlap (IoU = 0).

python

def giou_loss(pred_boxes, target_boxes):
    """Compute GIoU loss between predicted and target boxes."""
    # Compute IoU
    inter_x1 = torch.max(pred_boxes[:, 0], target_boxes[:, 0])
    inter_y1 = torch.max(pred_boxes[:, 1], target_boxes[:, 1])
    inter_x2 = torch.min(pred_boxes[:, 2], target_boxes[:, 2])
    inter_y2 = torch.min(pred_boxes[:, 3], target_boxes[:, 3])

    inter_area = torch.clamp(inter_x2 - inter_x1, min=0) * torch.clamp(inter_y2 - inter_y1, min=0)
    pred_area = (pred_boxes[:, 2] - pred_boxes[:, 0]) * (pred_boxes[:, 3] - pred_boxes[:, 1])
    target_area = (target_boxes[:, 2] - target_boxes[:, 0]) * (target_boxes[:, 3] - target_boxes[:, 1])
    union = pred_area + target_area - inter_area
    iou = inter_area / (union + 1e-6)

    # Compute enclosing box
    enc_x1 = torch.min(pred_boxes[:, 0], target_boxes[:, 0])
    enc_y1 = torch.min(pred_boxes[:, 1], target_boxes[:, 1])
    enc_x2 = torch.max(pred_boxes[:, 2], target_boxes[:, 2])
    enc_y2 = torch.max(pred_boxes[:, 3], target_boxes[:, 3])
    enc_area = (enc_x2 - enc_x1) * (enc_y2 - enc_y1)

    giou = iou - (enc_area - union) / (enc_area + 1e-6)
    return 1 - giou.mean()

Data Annotation for Detection

Annotation Formats

Format	Used By	Structure
PASCAL VOC	VOC dataset	XML per image
COCO JSON	COCO dataset	Single JSON for all images
YOLO	Ultralytics	TXT per image: `class cx cy w h`
Label Studio	Generic	JSON with bounding boxes

YOLO Label Format

Each image has a corresponding .txt file with one line per object:

# class_id center_x center_y width height (all normalized 0-1)
0 0.5 0.4 0.3 0.6
1 0.2 0.7 0.15 0.2

Converting Between Formats

python

def voc_to_yolo(voc_box, img_w, img_h):
    """Convert VOC [xmin, ymin, xmax, ymax] to YOLO [cx, cy, w, h] (normalized)."""
    xmin, ymin, xmax, ymax = voc_box
    cx = (xmin + xmax) / 2.0 / img_w
    cy = (ymin + ymax) / 2.0 / img_h
    w = (xmax - xmin) / img_w
    h = (ymax - ymin) / img_h
    return [cx, cy, w, h]

def yolo_to_voc(yolo_box, img_w, img_h):
    """Convert YOLO [cx, cy, w, h] to VOC [xmin, ymin, xmax, ymax]."""
    cx, cy, w, h = yolo_box
    xmin = (cx - w / 2) * img_w
    ymin = (cy - h / 2) * img_h
    xmax = (cx + w / 2) * img_w
    ymax = (cy + h / 2) * img_h
    return [xmin, ymin, xmax, ymax]

def coco_to_voc(coco_box):
    """Convert COCO [x, y, w, h] to VOC [xmin, ymin, xmax, ymax]."""
    x, y, w, h = coco_box
    return [x, y, x + w, y + h]

Feature Pyramid Network (FPN)

FPN (Lin et al., 2017) constructs multi-scale feature maps for detecting objects of different sizes:

Each level of the pyramid handles a different object scale. Small objects are detected at high-resolution, low-level features; large objects at low-resolution, high-level features.

Common Pitfalls

Mistake	Symptom	Fix
Wrong annotation format	Model trains but mAP is 0	Verify box coordinates and class IDs
Images not resized	CUDA OOM	Resize to 640x640 for YOLO
Forgetting NMS	Duplicate detections	Apply NMS (IoU threshold 0.5)
Too few anchors	Missing small/large objects	Use FPN or multi-scale anchors
Imbalanced classes	Model ignores rare classes	Use focal loss or oversampling
Train/val leakage	Inflated mAP	Ensure no duplicate images across splits

Cross-References

CNN backbones: CNN --- ResNet, EfficientNet for feature extraction
Transformer-based: Transformers --- DETR architecture
Segmentation: Image Segmentation --- pixel-level detection
Classification: Image Classification --- ViT, transfer learning
Deployment: Model Optimization --- quantization for real-time

Object Detection ​

The Detection Problem ​

Intersection over Union (IoU) ​

IoU Thresholds ​

R-CNN Family Evolution ​

R-CNN (2014) ​

Fast R-CNN (2015) ​

Faster R-CNN (2016) ​

Anchor Boxes ​

Non-Maximum Suppression (NMS) ​

YOLO: You Only Look Once ​

Grid Cell Approach ​

YOLO Evolution ​

SSD: Single Shot MultiBox Detector ​

DETR: Detection Transformer ​

Architecture ​

Mean Average Precision (mAP) ​

Precision and Recall ​

AP (Average Precision) ​

mAP ​

YOLOv8 on Custom Dataset ​

Two-Stage vs One-Stage Detectors ​

Loss Functions for Object Detection ​

Classification Loss ​

Box Regression Loss ​

GIoU (Generalized IoU) ​

Data Annotation for Detection ​

Annotation Formats ​

YOLO Label Format ​

Converting Between Formats ​

Feature Pyramid Network (FPN) ​

Common Pitfalls ​

Cross-References ​

Object Detection

The Detection Problem

Intersection over Union (IoU)

IoU Thresholds

R-CNN Family Evolution

R-CNN (2014)

Fast R-CNN (2015)

Faster R-CNN (2016)

Anchor Boxes

Non-Maximum Suppression (NMS)

YOLO: You Only Look Once

Grid Cell Approach

YOLO Evolution

SSD: Single Shot MultiBox Detector

DETR: Detection Transformer

Architecture

Mean Average Precision (mAP)

Precision and Recall

AP (Average Precision)

mAP

YOLOv8 on Custom Dataset

Two-Stage vs One-Stage Detectors

Loss Functions for Object Detection

Classification Loss

Box Regression Loss

GIoU (Generalized IoU)

Data Annotation for Detection

Annotation Formats

YOLO Label Format

Converting Between Formats

Feature Pyramid Network (FPN)

Common Pitfalls

Cross-References