Skip to content
Unverified — AI-generated content. Help verify this page

Object Detection

Object detection combines classification (what is it?) with localization (where is it?). This page traces the evolution from R-CNN to DETR, derives IoU and mAP metrics, explains YOLO's grid-cell approach, and trains YOLOv8 on a custom dataset.

The Detection Problem

Given an image, output a set of bounding boxes with class labels:

{(x1,y1,x2,y2,c,p)for each detected object}

where (x1,y1,x2,y2) define the box, c is the class, and p is the confidence.

Intersection over Union (IoU)

IoU measures the overlap between predicted and ground-truth boxes:

IoU=|AB||AB|=Area of IntersectionArea of Union
python
def compute_iou(box1, box2):
    """Compute IoU between two boxes [x1, y1, x2, y2]."""
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    intersection = max(0, x2 - x1) * max(0, y2 - y1)
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    return intersection / (union + 1e-6)
Worked Example — IoU Calculation for Two Boxes

Input: Two bounding boxes in [x1,y1,x2,y2] format:

  • Predicted: A=[50,50,200,200] (150x150 pixels)
  • Ground truth: B=[100,80,250,220] (150x140 pixels)

Step 1: Compute intersection rectangle:

x1=max(50,100)=100,y1=max(50,80)=80x2=min(200,250)=200,y2=min(200,220)=200

Intersection area: (200100)×(20080)=100×120=12,000 pixels

Step 2: Compute individual areas:

|A|=(20050)(20050)=150×150=22,500|B|=(250100)(22080)=150×140=21,000

Step 3: Compute union:

|AB|=22,500+21,00012,000=31,500

Step 4: IoU:

IoU=12,00031,500=0.381

Result: IoU = 0.381. This would fail the PASCAL VOC threshold (0.5) --- the prediction is not a good enough match. The boxes overlap, but the predicted box is shifted too far left and up from the ground truth.

IoU Thresholds

IoUInterpretation
0.5PASCAL VOC standard (loose)
0.75COCO strict
0.5:0.95COCO AP (average over 10 thresholds)

R-CNN Family Evolution

R-CNN (2014)

  1. Generate ~2000 region proposals (Selective Search)
  2. Warp each to fixed size and pass through CNN
  3. Classify with SVM + regress bounding box

Problem: Process each region independently --- extremely slow (47 seconds per image).

Fast R-CNN (2015)

  1. Pass the entire image through CNN once to get a feature map
  2. Project region proposals onto the feature map
  3. RoI Pooling extracts fixed-size features from each region
  4. Classify + regress in one forward pass

RoI Pooling: Divide the projected region into a fixed grid (e.g., 7x7) and max-pool within each cell.

Improvement: Sharing computation across proposals. ~0.3 seconds per image.

Faster R-CNN (2016)

Replace Selective Search with a Region Proposal Network (RPN) that generates proposals from the feature map itself.

RPN: At each position in the feature map, predict k anchor boxes (different scales and aspect ratios):

  • Objectness score: is there an object? (binary)
  • Box regression: adjust anchor to fit object (Δx,Δy,Δw,Δh)

Anchor Boxes

Anchors are predefined boxes at each feature map location. For 3 scales and 3 aspect ratios, k=9 anchors per position.

The RPN predicts offsets from anchors:

tx=xxawa,ty=yyahatw=logwwa,th=loghha

Non-Maximum Suppression (NMS)

Multiple overlapping detections of the same object must be merged:

python
def nms(boxes, scores, iou_threshold=0.5):
    """Apply Non-Maximum Suppression.
    boxes: (N, 4) [x1, y1, x2, y2]
    scores: (N,) confidence scores
    """
    order = scores.argsort(descending=True)
    keep = []

    while len(order) > 0:
        i = order[0]
        keep.append(i)

        if len(order) == 1:
            break

        ious = compute_iou_batch(boxes[i].unsqueeze(0), boxes[order[1:]])
        remaining = (ious < iou_threshold).nonzero(as_tuple=True)[0]
        order = order[remaining + 1]

    return keep

YOLO: You Only Look Once

YOLO (Redmon et al., 2016) frames detection as a single regression problem.

Grid Cell Approach

  1. Divide the image into an S×S grid
  2. Each cell predicts B bounding boxes + confidence + C class probabilities
  3. Output tensor: S×S×(B×5+C)

For YOLOv1: S=7, B=2, C=20 (PASCAL VOC) 7×7×30

Loss function:

L=λcoordLbox+Lobj+λnoobjLnoobj+Lclass

The coordinate loss uses w and h to weight small box errors more heavily than large box errors.

YOLO Evolution

VersionYearKey InnovationmAP (COCO)
YOLOv12016Single-stage detection63.4
YOLOv22017Anchor boxes, batch norm78.6
YOLOv32018Multi-scale predictions, Darknet-5333.0
YOLOv42020CSPDarknet, Mosaic augmentation43.5
YOLOv52020PyTorch native, Focus module50.7
YOLOv82023Anchor-free, decoupled head53.9
YOLOv112024Efficient architecture, GELAN54.7

SSD: Single Shot MultiBox Detector

SSD detects objects at multiple scales using feature maps from different layers:

  • Early layers (high resolution): detect small objects
  • Later layers (low resolution): detect large objects

Each feature map cell predicts k boxes with class scores.

DETR: Detection Transformer

DETR (Carion et al., 2020) eliminates anchors, NMS, and region proposals by treating detection as a set prediction problem.

Architecture

Object Queries: N learnable embeddings (e.g., N=100) that each learn to detect one object. The decoder uses cross-attention to attend to relevant image regions.

Bipartite Matching Loss: Uses the Hungarian algorithm to find the optimal one-to-one matching between predictions and ground truth:

σ^=argminσSNi=1NLmatch(yi,y^σ(i))

Mean Average Precision (mAP)

Precision and Recall

For each class, sort detections by confidence. At each threshold:

Precision=TPTP+FP,Recall=TPTP+FN

AP (Average Precision)

Area under the precision-recall curve:

AP=01p(r)dr

Approximated using 11-point interpolation (PASCAL VOC) or all-point interpolation (COCO).

mAP

Average AP across all classes:

mAP=1Cc=1CAPc

COCO mAP averages over IoU thresholds from 0.5 to 0.95 (step 0.05).

Worked Example — mAP Computation (Simplified)

Setup: 2 classes (cat, dog). 5 detections sorted by confidence:

DetectionClassConfidenceIoU with GTTP/FP
D1cat0.950.82TP
D2cat0.900.10FP
D3dog0.850.71TP
D4cat0.700.65TP
D5dog0.600.30FP

Ground truth: 2 cats, 2 dogs. IoU threshold = 0.5.

Cat class (2 GT objects, detections: D1-TP, D2-FP, D4-TP):

StepPrecisionRecall
After D1 (TP)1/1 = 1.0001/2 = 0.500
After D2 (FP)1/2 = 0.5001/2 = 0.500
After D4 (TP)2/3 = 0.6672/2 = 1.000

APcat=area under P-R curve1.0×0.5+0.667×0.5=0.833

Dog class (2 GT objects, detections: D3-TP, D5-FP):

StepPrecisionRecall
After D3 (TP)1/1 = 1.0001/2 = 0.500
After D5 (FP)1/2 = 0.5001/2 = 0.500

APdog=1.0×0.5=0.500 (recall never reaches 1.0 --- one dog was never detected)

mAP:

mAP=APcat+APdog2=0.833+0.5002=0.667

Result: mAP = 0.667. The model finds cats well (83.3% AP) but misses one dog entirely, dragging the average down. COCO would also average this across 10 IoU thresholds.

python
def compute_ap(precision, recall):
    """Compute AP using all-point interpolation."""
    # Add sentinel values
    mrec = np.concatenate(([0.0], recall, [1.0]))
    mpre = np.concatenate(([1.0], precision, [0.0]))

    # Smooth precision curve (make it monotonically decreasing)
    for i in range(len(mpre) - 2, -1, -1):
        mpre[i] = max(mpre[i], mpre[i + 1])

    # Compute area under curve
    i = np.where(mrec[1:] != mrec[:-1])[0]
    ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
    return ap

YOLOv8 on Custom Dataset

python
from ultralytics import YOLO

# ── Load pretrained YOLOv8 ───────────────────────────────────────────
model = YOLO('yolov8n.pt')  # nano model (fastest)

# ── Dataset structure (YOLO format) ──────────────────────────────────
# dataset/
#   train/
#     images/
#     labels/   (one .txt per image: class cx cy w h)
#   val/
#     images/
#     labels/
#   data.yaml

# data.yaml content:
# train: ./train/images
# val: ./val/images
# nc: 3
# names: ['cat', 'dog', 'bird']

# ── Training ─────────────────────────────────────────────────────────
results = model.train(
    data='dataset/data.yaml',
    epochs=100,
    imgsz=640,
    batch=16,
    lr0=0.01,
    lrf=0.01,       # Final LR factor
    momentum=0.937,
    weight_decay=0.0005,
    warmup_epochs=3,
    augment=True,
    mosaic=1.0,      # Mosaic augmentation
    mixup=0.1,
    copy_paste=0.1,
    device=0,        # GPU
    name='custom_detector',
)

# ── Inference ────────────────────────────────────────────────────────
model = YOLO('runs/detect/custom_detector/weights/best.pt')

# Single image
results = model('test_image.jpg')
for r in results:
    boxes = r.boxes
    for box in boxes:
        xyxy = box.xyxy[0]      # [x1, y1, x2, y2]
        conf = box.conf[0]       # confidence
        cls = int(box.cls[0])    # class index
        print(f"Class: {r.names[cls]}, Conf: {conf:.2f}, Box: {xyxy}")

# ── Export for deployment ────────────────────────────────────────────
model.export(format='onnx')  # ONNX
model.export(format='torchscript')  # TorchScript
model.export(format='tflite')  # TensorFlow Lite (mobile)

Two-Stage vs One-Stage Detectors

FeatureTwo-Stage (Faster R-CNN)One-Stage (YOLO)
Speed~5-15 FPS~30-160 FPS
AccuracyHigher (especially small objects)Slightly lower
ArchitectureRPN + detection headSingle network
Use caseAccuracy-criticalReal-time applications
TrainingMore complexSimpler

Loss Functions for Object Detection

Classification Loss

Standard cross-entropy or focal loss for class prediction:

Lfocal=αt(1pt)γlog(pt)

Focal loss (Lin et al., 2017) down-weights easy negatives. With γ=2, easy examples (pt>0.9) contribute very little to the loss. This is essential because most anchor boxes are background (negative).

Box Regression Loss

Smooth L1 Loss (Faster R-CNN):

smoothL1(x)={0.5x2if |x|<1|x|0.5otherwise

CIoU Loss (Complete IoU, used in YOLO):

LCIoU=1IoU+ρ2(b,bgt)c2+αv

where ρ is the Euclidean distance between centers, c is the diagonal of the smallest enclosing box, and v measures aspect ratio consistency.

GIoU (Generalized IoU)

GIoU=IoU|C(AB)||C|

where C is the smallest convex hull enclosing both A and B. GIoU provides gradients even when boxes do not overlap (IoU = 0).

python
def giou_loss(pred_boxes, target_boxes):
    """Compute GIoU loss between predicted and target boxes."""
    # Compute IoU
    inter_x1 = torch.max(pred_boxes[:, 0], target_boxes[:, 0])
    inter_y1 = torch.max(pred_boxes[:, 1], target_boxes[:, 1])
    inter_x2 = torch.min(pred_boxes[:, 2], target_boxes[:, 2])
    inter_y2 = torch.min(pred_boxes[:, 3], target_boxes[:, 3])

    inter_area = torch.clamp(inter_x2 - inter_x1, min=0) * torch.clamp(inter_y2 - inter_y1, min=0)
    pred_area = (pred_boxes[:, 2] - pred_boxes[:, 0]) * (pred_boxes[:, 3] - pred_boxes[:, 1])
    target_area = (target_boxes[:, 2] - target_boxes[:, 0]) * (target_boxes[:, 3] - target_boxes[:, 1])
    union = pred_area + target_area - inter_area
    iou = inter_area / (union + 1e-6)

    # Compute enclosing box
    enc_x1 = torch.min(pred_boxes[:, 0], target_boxes[:, 0])
    enc_y1 = torch.min(pred_boxes[:, 1], target_boxes[:, 1])
    enc_x2 = torch.max(pred_boxes[:, 2], target_boxes[:, 2])
    enc_y2 = torch.max(pred_boxes[:, 3], target_boxes[:, 3])
    enc_area = (enc_x2 - enc_x1) * (enc_y2 - enc_y1)

    giou = iou - (enc_area - union) / (enc_area + 1e-6)
    return 1 - giou.mean()

Data Annotation for Detection

Annotation Formats

FormatUsed ByStructure
PASCAL VOCVOC datasetXML per image
COCO JSONCOCO datasetSingle JSON for all images
YOLOUltralyticsTXT per image: class cx cy w h
Label StudioGenericJSON with bounding boxes

YOLO Label Format

Each image has a corresponding .txt file with one line per object:

# class_id center_x center_y width height (all normalized 0-1)
0 0.5 0.4 0.3 0.6
1 0.2 0.7 0.15 0.2

Converting Between Formats

python
def voc_to_yolo(voc_box, img_w, img_h):
    """Convert VOC [xmin, ymin, xmax, ymax] to YOLO [cx, cy, w, h] (normalized)."""
    xmin, ymin, xmax, ymax = voc_box
    cx = (xmin + xmax) / 2.0 / img_w
    cy = (ymin + ymax) / 2.0 / img_h
    w = (xmax - xmin) / img_w
    h = (ymax - ymin) / img_h
    return [cx, cy, w, h]

def yolo_to_voc(yolo_box, img_w, img_h):
    """Convert YOLO [cx, cy, w, h] to VOC [xmin, ymin, xmax, ymax]."""
    cx, cy, w, h = yolo_box
    xmin = (cx - w / 2) * img_w
    ymin = (cy - h / 2) * img_h
    xmax = (cx + w / 2) * img_w
    ymax = (cy + h / 2) * img_h
    return [xmin, ymin, xmax, ymax]

def coco_to_voc(coco_box):
    """Convert COCO [x, y, w, h] to VOC [xmin, ymin, xmax, ymax]."""
    x, y, w, h = coco_box
    return [x, y, x + w, y + h]

Feature Pyramid Network (FPN)

FPN (Lin et al., 2017) constructs multi-scale feature maps for detecting objects of different sizes:

Each level of the pyramid handles a different object scale. Small objects are detected at high-resolution, low-level features; large objects at low-resolution, high-level features.

Common Pitfalls

MistakeSymptomFix
Wrong annotation formatModel trains but mAP is 0Verify box coordinates and class IDs
Images not resizedCUDA OOMResize to 640x640 for YOLO
Forgetting NMSDuplicate detectionsApply NMS (IoU threshold 0.5)
Too few anchorsMissing small/large objectsUse FPN or multi-scale anchors
Imbalanced classesModel ignores rare classesUse focal loss or oversampling
Train/val leakageInflated mAPEnsure no duplicate images across splits

Cross-References

"What I cannot create, I do not understand." — Richard Feynman