Skip to content
Unverified — AI-generated content. Help verify this page

Architecture Selection Guide

Choosing the right architecture is the most important decision in a deep learning project. The wrong architecture wastes weeks of training time. This guide provides decision trees and comparison tables to help you pick the right model for your task.

Master Decision Tree

Architecture Comparison by Task

Image Tasks

TaskRecommendedAlternativeNotes
Image ClassificationEfficientNet-B0/B2ViT (if pretrained)CNN wins on small data, ViT wins with pretraining
Fine-Grained ClassificationViT + fine-tuneEfficientNet with attentionNeeds strong augmentation
Object DetectionYOLOv8 (real-time)Faster R-CNN (accuracy)DETR for end-to-end
Instance SegmentationMask R-CNNYOLO-segSAM for zero-shot
Semantic SegmentationDeepLabv3+U-Net (medical)SegFormer for efficiency
Image GenerationStable DiffusionStyleGAN3Diffusion models dominate
Super ResolutionESRGANSwinIRReal-ESRGAN for practical use
Style TransferAdaINNeural style transferFast inference with feedforward

NLP Tasks

TaskRecommendedAlternativeNotes
Text ClassificationDeBERTa-v3-baseRoBERTaFine-tune on your data
Named Entity RecognitionDeBERTa + token clsSpaCy (for speed)Use BIO tagging
Question AnsweringDeBERTa-v3BERT-largeExtractive QA
Text GenerationLLaMA 3 / MistralGPT-2 (smaller)Use RLHF/DPO for alignment
SummarizationBART / T5LLaMA (few-shot)Encoder-decoder preferred
TranslationNLLB / mBARTT5Multilingual models
Sentiment AnalysisDistilBERT (fast)DeBERTa (accurate)Often fine-tune BERT
Semantic Searchall-MiniLM-L6-v2E5-largeSentence-transformers

Sequence Tasks

TaskRecommendedAlternativeNotes
Time Series ForecastTransformerLSTMPatchTST for long series
Speech RecognitionWhisperWav2Vec2Whisper is multilingual
Music GenerationTransformerWaveNetMusicGen
Anomaly DetectionAutoencoderLSTMVAE for probabilistic

Structured Data

TaskRecommendedAlternativeNotes
Tabular ClassificationXGBoost / LightGBMCatBoostDL rarely wins on tabular
Tabular RegressionXGBoostTabNet (if DL needed)Feature engineering matters more
Recommender SystemsTwo-tower + ANNGraph NNMatrix factorization baseline
Molecular PropertyGNN (SchNet, DimeNet)SMILES + Transformer3D-aware GNNs best

CNN vs RNN vs Transformer vs GNN

Fundamental Comparison

PropertyCNNRNN/LSTMTransformerGNN
Input structureGrid (images)SequenceSequence or setGraph
Key operationConvolutionRecurrenceSelf-attentionMessage passing
ParallelizableYesNo (sequential)YesPartially
Long-range depsLimited (receptive field)Moderate (LSTM)ExcellentK-hop neighborhood
Complexity per layerO(k2C2HW)O(nh2)O(n2d)O(Ed)
Inductive biasLocality, translation equivarianceSequential, temporalNone (learns from data)Permutation invariance
Data efficiencyGood (strong bias)ModeratePoor (needs lots of data)Depends on graph size
Training speedFastSlowFast (parallel)Moderate
Best forImages, spatial dataShort sequencesText, long sequencesGraphs, molecules

When Each Architecture Wins

CNN wins when:

  • Data has spatial structure (images, spectrograms)
  • Dataset is small (CNN's inductive bias helps)
  • Real-time inference needed (efficient on hardware)
  • Translation equivariance is desired

RNN/LSTM wins when:

  • Processing streaming data (one token at a time)
  • Memory budget is very tight (O(1) per step vs O(n) for attention)
  • Sequence is short and simple

Transformer wins when:

  • Long-range dependencies matter
  • Large amounts of training data available
  • Parallelism is important (GPU utilization)
  • Pretrained models exist for the domain
  • Default choice for 2026

GNN wins when:

  • Data is naturally a graph (molecules, social networks, knowledge bases)
  • Relationships between entities matter more than entity features
  • Variable-size inputs with arbitrary connectivity

Model Size Guidelines

By Dataset Size

Training SamplesMax Model SizeArchitecture
< 500Pretrained (freeze backbone)Transfer learning
500 -- 5KPretrained (fine-tune)BERT-base, ResNet-50
5K -- 50KMedium (10-100M params)Train from scratch or fine-tune
50K -- 500KLarge (100M-1B params)Full training viable
> 1MVery large (1B+)Scale up aggressively

By Compute Budget

BudgetModelTraining Time
1 GPU-hourResNet-18 on CIFAR-10Quick experiment
10 GPU-hoursResNet-50 on custom datasetSerious project
100 GPU-hoursBERT fine-tuningProduction NLP
1000 GPU-hoursTrain medium LMResearch
10K+ GPU-hoursTrain large LMCompany-scale

Decision Checklist

Before choosing an architecture, answer these questions:

  1. What is my input modality? (image, text, tabular, graph, multimodal)
  2. What is my output? (class label, bounding box, generated text, embedding)
  3. How much data do I have? (determines if you can train from scratch)
  4. What is my latency requirement? (real-time vs batch)
  5. What is my compute budget? (GPU hours available)
  6. Does a pretrained model exist? (almost always start here)
  7. What are my accuracy requirements? (determines model size)

Quick Picks for 2026

ScenarioJust Use This
Image classificationtorchvision.models.efficientnet_b0(weights='DEFAULT')
Text classificationAutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-base')
Object detectionYOLO('yolov8n.pt')
Text generationFine-tuned LLaMA 3 or Mistral
Semantic searchSentenceTransformer('all-MiniLM-L6-v2')
Image generationStable Diffusion XL + LoRA
Speech-to-textwhisper-large-v3
Tabular dataXGBoost (not deep learning)

Complexity and Memory Analysis

Understanding the computational and memory cost of each architecture is critical for choosing the right model under hardware constraints.

FLOPs per Layer

Layer TypeFLOPsMemory
Linear (nm)2nmnm params
Conv2d (Cin,Cout,k)2k2CinCoutHWk2CinCout params
Self-Attention (n tokens, d dim)4nd2+2n2d4d2 params, n2 attention matrix
LSTM (d hidden)8d2 per step8d2 params
GCN (d features, E edges)2Ed+2nd2d2 params

Memory Budget Calculator

python
def estimate_memory(num_params, batch_size=32, seq_len=512, precision='fp32'):
    """Estimate GPU memory for training.

    Components:
    1. Model parameters
    2. Gradients (same size as params)
    3. Optimizer states (Adam: 2x params for m and v)
    4. Activations (depends on batch size and architecture)
    """
    bytes_per_param = {'fp32': 4, 'fp16': 2, 'bf16': 2, 'int8': 1}[precision]

    # Parameters + gradients
    param_memory = num_params * bytes_per_param
    grad_memory = num_params * bytes_per_param

    # Adam optimizer states (always fp32)
    optimizer_memory = num_params * 4 * 2  # m and v in fp32

    # Rough activation estimate (varies hugely by architecture)
    activation_memory = batch_size * seq_len * 4096 * bytes_per_param  # Rough

    total = param_memory + grad_memory + optimizer_memory + activation_memory
    print(f"Parameters:  {param_memory / 1e9:.2f} GB")
    print(f"Gradients:   {grad_memory / 1e9:.2f} GB")
    print(f"Optimizer:   {optimizer_memory / 1e9:.2f} GB")
    print(f"Activations: {activation_memory / 1e9:.2f} GB (estimate)")
    print(f"Total:       {total / 1e9:.2f} GB")
    return total

# ResNet-50: ~25M params
estimate_memory(25e6, batch_size=32)

# BERT-base: ~110M params
estimate_memory(110e6, batch_size=16, seq_len=512)

# LLaMA-7B: ~7B params
estimate_memory(7e9, batch_size=1, seq_len=2048, precision='fp16')

Inference Latency Comparison

ModelParametersGPU LatencyCPU LatencyMobile
MobileNetV3-S2.5M1 ms15 ms5 ms
ResNet-5025M4 ms80 msN/A
EfficientNet-B05M3 ms50 ms20 ms
ViT-B/1686M8 ms200 msN/A
BERT-base110M10 ms300 msN/A
DistilBERT66M6 ms150 ms80 ms
YOLOv8n3M2 ms30 ms15 ms
Whisper-small244M50 ms/sN/AN/A

Architecture Anti-Patterns

Common mistakes when choosing architectures:

1. Using Deep Learning for Tabular Data

Symptom: You have a CSV with 50 features and 10K rows.

Wrong: Train a 5-layer MLP with batch norm and dropout.

Right: Use XGBoost or LightGBM. They almost always win on tabular data, require no GPU, and train in seconds. Only consider deep learning for tabular if you have >100K rows AND complex feature interactions.

2. Training ViT from Scratch on Small Data

Symptom: You have 5K images and train ViT-B from scratch.

Wrong: ViT has no inductive bias for spatial locality --- it needs massive data.

Right: Use a pretrained ViT or use a CNN (ResNet, EfficientNet) which has built-in spatial bias.

3. Using RNNs for Long Sequences in 2026

Symptom: You have 2000-token sequences and use a bidirectional LSTM.

Wrong: LSTMs cannot effectively capture dependencies beyond ~200 tokens even with gating.

Right: Use a transformer. Flash Attention makes long contexts practical.

4. Ignoring Pretrained Models

Symptom: You train everything from scratch.

Wrong: Training from random initialization when pretrained weights exist.

Right: Always check HuggingFace Model Hub, torchvision, or timm first. Transfer learning is almost always better.

5. Over-Engineering the Architecture

Symptom: Custom attention mechanisms, novel activation functions, bespoke normalization.

Wrong: Architecture novelty rarely matters as much as data quality and training recipe.

Right: Use a standard architecture (ResNet, BERT, ViT) with proper training techniques. Only innovate on architecture if you have a specific, measurable reason.

Real-World Architecture Choices

Autonomous Driving

ComponentArchitectureWhy
Object detectionBEVFormer / DETR3D3D detection from cameras
Lane detectionCNN + curve fittingFast, reliable
Traffic signEfficientNetSmall, accurate
Depth estimationMiDaS (ViT)Monocular depth
PlanningTransformerSequence decision making

Medical Imaging

TaskArchitectureWhy
X-ray classificationDenseNet-121 (CheXpert)Transfer learning, well-studied
Tumor segmentationU-Net + attentionSkip connections for fine detail
PathologyViT-H (pretrained)Large images, patch-based
Retinal screeningEfficientNet + CAMInterpretability needed

Recommendation Systems

ComponentArchitectureWhy
User/item embeddingsTwo-tower (MLP)Efficient retrieval
RankingDeepFM / DCNFeature interactions
SequentialTransformer (SASRec)Session-based
Graph-basedGNN (PinSage)Social signals

Search and Retrieval

ComponentArchitectureWhy
Text encodingSentence-BERT / E5Dense retrieval
Image encodingCLIPCross-modal search
Re-rankingCross-encoder (DeBERTa)Accurate but slow
MultimodalCLIP + fusionImage + text queries

Choosing a Pretrained Model

HuggingFace Model Selection

python
# Text classification
from transformers import pipeline

# Quick selection guide:
# Speed-critical: distilbert-base-uncased
# Accuracy-critical: microsoft/deberta-v3-large
# Balanced: microsoft/deberta-v3-base

classifier = pipeline("text-classification",
                      model="microsoft/deberta-v3-base")

# Image classification
import timm

# Speed-critical: mobilenetv3_small_100
# Accuracy-critical: eva02_large_patch14_448
# Balanced: efficientnet_b0

model = timm.create_model('efficientnet_b0', pretrained=True, num_classes=10)

# Object detection
from ultralytics import YOLO

# Speed-critical: yolov8n (nano)
# Balanced: yolov8m (medium)
# Accuracy-critical: yolov8x (extra-large)

model = YOLO('yolov8m.pt')

Benchmark Resources

ResourceWhat It Measures
Papers With CodeSOTA on standardized benchmarks
Hugging Face Open LLM LeaderboardLLM benchmark comparisons
timm leaderboardVision model accuracy vs speed
MTEB LeaderboardSentence embedding quality

Cross-References

"What I cannot create, I do not understand." — Richard Feynman