Design Content Moderation System

Content moderation is the invisible shield between users and the worst content the internet can produce. Every platform with user-generated content — social media, marketplaces, forums, dating apps — needs a moderation system that balances safety with free expression, scales to billions of pieces of content, and makes defensible decisions in milliseconds.

This is a system where both failure modes are catastrophic: missing harmful content causes real-world harm; over-moderating silences legitimate speech and drives users away.

1. Problem Statement & Requirements

Functional Requirements

Multi-modal classification — Classify text, images, and video for policy violations
Policy engine — Configurable rules that map classification signals to actions
Pre-publish screening — Score content before it goes live (latency-sensitive)
Post-publish scanning — Retroactively scan existing content when policies change
Human review queue — Route borderline cases to trained human reviewers
Appeal flow — Users can appeal moderation decisions, triggering re-review
Escalation — Severe content (CSAM, terrorism) triggers immediate action and legal reporting
Audit trail — Every decision is logged with reasoning, model version, and reviewer ID
Transparency — Users see why their content was removed and how to appeal

Non-Functional Requirements

Latency — Pre-publish scoring < 500 ms (text), < 2 s (image), < 30 s (video)
Throughput — 500K pieces of content per minute at peak
Availability — 99.99% (content creation path cannot be blocked)
Accuracy — Precision > 95% for auto-removal (very few false positives)
Recall — > 99% for severe categories (CSAM, terrorism)
Consistency — Same content gets the same decision regardless of timing or reviewer
Adaptability — New policy categories deployable within hours

Clarifying Questions

Questions to Ask

What content types? Text, images, video, audio, live streams?
What are the policy categories? (Hate speech, nudity, violence, spam, copyright, CSAM, terrorism)
Is this pre-publish (before content goes live) or post-publish (retroactive)?
What is the reviewer workforce size and cost per review?
What jurisdictions do we operate in? (Different laws per country)
What is the SLA for human review turnaround?
Do we need to handle adversarial attacks (text obfuscation, steganography)?

2. Back-of-Envelope Estimation

Traffic

500K content items per minute at peak = ~8,300 per second
Mix: 60% text, 30% images, 10% video
Text: 5K/s, Images: 2.5K/s, Video: 830/s

Daily content volume = 500 K \times 60 \times 24 = 720 M items/day

ML Inference

Text classifier: ~5 ms per item (GPU), 5K QPS = manageable on a few GPUs
Image classifier: ~50 ms per image, 2.5K QPS = ~125 GPUs needed
Video: sample 1 frame/second, 30-second average video = 30 frames per video
- 830 videos/s x 30 frames = 24.9K image classifications/s additional

Total image classifications = 2,500 + 24,900 = 27,400 /second

Human Review

Assume 2% of content needs human review
720M x 2% = 14.4M reviews/day
At 100 reviews/reviewer/day = 144K reviewers needed (this is why Meta employs 40K+ moderators)
More realistically: use ML to triage priority, only review highest-impact content

Storage

Decision log per item: ~1 KB (scores, policy, action, reviewer, timestamp)
Daily: 720M x 1 KB = 720 GB/day
1-year retention: ~263 TB

3. High-Level Design

API Design

typescript

// POST /api/v1/content/moderate
interface ModerationRequest {
  contentId: string;
  contentType: 'text' | 'image' | 'video';
  text?: string;
  mediaUrls?: string[];
  userId: string;
  context: {
    platform: string;
    contentContext: 'post' | 'comment' | 'message' | 'profile';
    targetAudience?: 'public' | 'friends' | 'private';
  };
}

interface ModerationResponse {
  contentId: string;
  decision: 'ALLOW' | 'REMOVE' | 'REVIEW' | 'ESCALATE';
  scores: {
    category: string;       // 'hate_speech', 'nudity', 'violence', etc.
    score: number;          // 0.0 - 1.0
    subcategory?: string;   // 'racial_slur', 'explicit_nudity'
  }[];
  policyViolations: string[];
  appealEligible: boolean;
  explanation: string;       // Human-readable reason
}

// POST /api/v1/content/{contentId}/appeal
interface AppealRequest {
  contentId: string;
  userId: string;
  reason: string;
}

4. Deep Dive: ML Classification Pipeline

Text Classification

Text moderation must handle:

Direct hate speech ("I hate [group]")
Coded language and dogwhistles ("1488", "globalists")
Context-dependent meaning ("kill it" in gaming vs. threat)
Multilingual content (200+ languages)
Adversarial evasion ("h@te", "s p a c e d", Unicode homoglyphs)

python

class TextModerationPipeline:
    """Multi-stage text classification pipeline."""

    def __init__(self):
        self.normalizer = TextNormalizer()
        self.lang_detector = LanguageDetector()
        self.toxicity_model = load_model("toxicity_multilingual_v5")
        self.spam_model = load_model("spam_classifier_v3")

    def classify(self, text: str) -> dict:
        # Step 1: Normalize adversarial text
        normalized = self.normalizer.normalize(text)
        # "h@t3 sp33ch" → "hate speech"
        # Unicode homoglyphs resolved
        # Zero-width characters stripped

        # Step 2: Detect language
        lang = self.lang_detector.detect(normalized)

        # Step 3: Run classifiers
        toxicity_scores = self.toxicity_model.predict(normalized)
        # Returns: {
        #   'hate': 0.92, 'harassment': 0.15,
        #   'violence': 0.03, 'sexual': 0.01,
        #   'self_harm': 0.00
        # }

        spam_score = self.spam_model.predict(normalized)

        return {
            'language': lang,
            'toxicity': toxicity_scores,
            'spam': spam_score,
            'text_length': len(text),
            'normalized_text': normalized,
        }


class TextNormalizer:
    """Defeat common text-based evasion techniques."""

    def normalize(self, text: str) -> str:
        text = self.resolve_homoglyphs(text)     # Cyrillic а → Latin a
        text = self.decode_leetspeak(text)        # h@t3 → hate
        text = self.remove_zero_width(text)       # strip ZWJ, ZWNJ, ZWS
        text = self.collapse_whitespace(text)     # "h a t e" → "hate"
        text = self.normalize_unicode(text)       # NFC normalization
        return text

Image Classification

Technique	Purpose	How
Perceptual hashing	Match known bad images (CSAM, terrorism)	pHash/dHash/PhotoDNA — robust to resizing, cropping
CNN classifier	Detect policy violations in new images	Multi-label classification (nudity, violence, hate symbols)
OCR + text classifier	Detect hateful text in images	Extract text via OCR, run through text pipeline
Object detection	Identify weapons, drugs, logos	YOLO / Faster R-CNN for specific objects
Face detection	Age estimation for CSAM, identify public figures	Face detection + age estimation model

Video Classification

Video is the most expensive modality — you cannot classify every frame.

python

class VideoModerationPipeline:
    """Efficient video moderation via keyframe sampling."""

    def classify(self, video_url: str) -> dict:
        # Step 1: Extract keyframes (not every frame)
        keyframes = self.extract_keyframes(video_url, strategy='scene_change')
        # Typically 1-3 frames per second of content change

        # Step 2: Classify each keyframe
        frame_scores = [self.image_classifier.classify(frame) for frame in keyframes]

        # Step 3: Extract audio and transcribe
        audio = self.extract_audio(video_url)
        transcript = self.speech_to_text(audio)
        text_scores = self.text_classifier.classify(transcript)

        # Step 4: Aggregate — take max score per category across all frames
        aggregated = {}
        for category in ['nudity', 'violence', 'hate', 'gore']:
            aggregated[category] = max(
                fs.get(category, 0) for fs in frame_scores
            )

        # Merge text and visual signals
        aggregated['text_toxicity'] = text_scores.get('hate', 0)

        return aggregated

    def extract_keyframes(self, video_url, strategy='scene_change'):
        """Sample frames intelligently instead of uniformly."""
        if strategy == 'scene_change':
            # Detect scene boundaries, sample 1 frame per scene
            return detect_scene_changes(video_url)
        elif strategy == 'uniform':
            # Fall back to 1 frame per second
            return sample_uniform(video_url, fps=1)

WARNING

Video moderation latency is measured in seconds, not milliseconds. For pre-publish moderation of video, either accept a processing delay (upload, process, then publish) or use a two-phase approach: quick screening on the first few seconds, then full analysis post-publish.

5. Deep Dive: Policy Engine

The policy engine decouples ML scores from business decisions. Models output probability scores; the policy engine maps those scores to actions based on configurable rules.

Policy Configuration

yaml

# policy_config.yaml
policies:
  - name: nudity_explicit
    category: nudity
    subcategory: explicit
    thresholds:
      auto_remove: 0.95      # Very high confidence → remove
      human_review: 0.70      # Moderate confidence → review
      allow: 0.0              # Below review threshold → allow
    context_overrides:
      - context: profile_photo
        auto_remove: 0.85     # Stricter for profile photos
      - context: private_message
        auto_remove: 0.98     # More lenient for private messages
    jurisdictions:
      - region: DE             # Germany: stricter
        auto_remove: 0.90
      - region: US
        auto_remove: 0.95

  - name: hate_speech
    category: toxicity
    subcategory: hate
    thresholds:
      auto_remove: 0.92
      human_review: 0.60
    recidivism_boost:
      prior_violations: 3     # After 3 violations
      threshold_reduction: 0.15  # Lower thresholds by 15%

  - name: csam
    category: csam
    thresholds:
      auto_remove: 0.50       # Very low threshold — always err on side of removal
      escalate: 0.50          # Always escalate for legal reporting
    legal_reporting: true       # Mandatory NCMEC report

python

class PolicyEngine:
    """Map ML scores to moderation actions."""

    def evaluate(self, scores: dict, context: dict) -> dict:
        decisions = []

        for policy in self.policies:
            category_score = scores.get(policy.category, {}).get(
                policy.subcategory, 0
            )

            # Apply context overrides
            thresholds = self.get_thresholds(policy, context)

            # Apply recidivism boost
            if policy.recidivism_boost:
                user_violations = self.get_violation_count(context['user_id'])
                if user_violations >= policy.recidivism_boost['prior_violations']:
                    thresholds = self.reduce_thresholds(
                        thresholds,
                        policy.recidivism_boost['threshold_reduction']
                    )

            # Determine action
            if category_score >= thresholds['auto_remove']:
                action = 'REMOVE'
            elif category_score >= thresholds['human_review']:
                action = 'REVIEW'
            else:
                action = 'ALLOW'

            # Escalation check
            if policy.legal_reporting and action == 'REMOVE':
                action = 'ESCALATE'

            decisions.append({
                'policy': policy.name,
                'score': category_score,
                'action': action,
                'threshold_used': thresholds,
            })

        # Final decision: most severe action wins
        final_action = max(
            decisions,
            key=lambda d: ['ALLOW', 'REVIEW', 'REMOVE', 'ESCALATE'].index(d['action'])
        )

        return {
            'decision': final_action['action'],
            'policy_violations': [d for d in decisions if d['action'] != 'ALLOW'],
            'all_scores': decisions,
        }

DANGER

CSAM (child sexual abuse material) requires mandatory legal reporting in most jurisdictions. Automated detection must have near-zero false negative rate. Use perceptual hashing databases (PhotoDNA, NCMEC hash lists) in addition to ML classifiers. False positives are acceptable for this category — always escalate to human review.

6. Deep Dive: Human Review System

Queue Prioritization

Not all review items are equal. Prioritize by severity and reach:

python

def compute_review_priority(content, scores, context):
    """Score review items for queue ordering."""
    base_priority = {
        'csam': 1000,
        'terrorism': 900,
        'violence_graphic': 800,
        'hate_speech': 700,
        'harassment': 600,
        'nudity': 500,
        'spam': 200,
    }

    # Start with category-based priority
    category = max(scores, key=scores.get)
    priority = base_priority.get(category, 100)

    # Boost for high-reach content
    reach = context.get('author_follower_count', 0)
    if reach > 1_000_000:
        priority += 200
    elif reach > 100_000:
        priority += 100
    elif reach > 10_000:
        priority += 50

    # Boost for viral content
    if context.get('engagement_velocity', 0) > 1000:  # shares/hour
        priority += 150

    # Boost for reports from multiple users
    report_count = context.get('user_report_count', 0)
    priority += min(report_count * 20, 200)

    return priority

Reviewer Quality and Calibration

Metric	Target	Purpose
Inter-rater agreement	> 85% (Cohen's kappa > 0.7)	Consistency between reviewers
Agreement with gold set	> 90%	Accuracy against expert-labeled examples
Review throughput	80-120 items/day	Productivity (varies by content type)
False overturn rate	< 5%	Decisions upheld on appeal
Wellness check compliance	100%	Reviewer mental health (mandatory breaks)

WARNING

Reviewer wellness is a first-class concern. Continuous exposure to violent, abusive, and disturbing content causes PTSD, anxiety, and depression. Implement mandatory breaks, content blurring by default, wellness check-ins, and access to counseling. Rotate reviewers across severity levels.

7. Deep Dive: Appeals Flow

Appeal Auto-Resolution

When models improve, previously borderline content may now score below thresholds:

python

class AppealProcessor:
    def process_appeal(self, content_id: str, user_id: str):
        # Get original decision
        original = self.decision_store.get(content_id)

        # Re-score with latest models
        content = self.content_store.get(content_id)
        new_scores = self.classify(content)

        # Apply current policy (may have changed since original decision)
        new_decision = self.policy_engine.evaluate(
            new_scores,
            original.context
        )

        if new_decision['decision'] == 'ALLOW':
            # Model improvement or policy change reversed the decision
            self.reinstate(content_id)
            self.notify_user(user_id, "appeal_approved_auto")
            self.log_appeal(content_id, "auto_approved", new_scores)
            return

        # Route to human for manual review
        self.enqueue_for_review(
            content_id,
            priority="appeal",
            original_decision=original,
            new_scores=new_scores
        )

8. Handling Adversarial Content

Evasion Techniques and Countermeasures

Evasion Technique	Example	Countermeasure
Leetspeak	h@t3, $h!t	Leetspeak decoder in text normalizer
Character spacing	"h a t e"	Whitespace collapse
Unicode homoglyphs	Using Cyrillic "a" instead of Latin "a"	Unicode confusable mapping
Text in images	Hate text rendered as image	OCR pipeline on all images
Steganography	Hidden content in image pixels	Perceptual hashing still catches visual content
Audio evasion	Hate speech over music	ASR + text classification
Code switching	Mixing languages mid-sentence	Multilingual models
Sarcasm/irony	"Oh great, another [group] person"	Context-aware models, hard problem
Dog whistles	Coded language for in-group	Continuously updated lexicon + embedding detection

python

class HomoglyphResolver:
    """Map Unicode confusables to ASCII equivalents."""

    # Partial mapping — real implementation uses Unicode confusable database
    CONFUSABLES = {
        '\u0430': 'a',  # Cyrillic а
        '\u0435': 'e',  # Cyrillic е
        '\u043e': 'o',  # Cyrillic о
        '\u0440': 'p',  # Cyrillic р
        '\u0441': 'c',  # Cyrillic с
        '\u0443': 'y',  # Cyrillic у
        '\u0445': 'x',  # Cyrillic х
        '\u0456': 'i',  # Ukrainian і
        # ... hundreds more
    }

    def resolve(self, text: str) -> str:
        return ''.join(
            self.CONFUSABLES.get(char, char)
            for char in text
        )

9. Monitoring and Metrics

Operational Dashboard

Metric	Formula	Target
Auto-remove precision	True violations / total auto-removed	> 95%
Recall (severe categories)	Detected / total actual violations	> 99%
Review queue latency	Time from enqueue to decision	P0 < 1h, P1 < 4h
Appeal overturn rate	Overturned appeals / total appeals	< 10%
Model latency P99	Classification time	Text < 500ms, Image < 2s
False positive rate	Wrongly removed / total processed	< 0.1%
Prevalence	Violating content seen by users / total content views	< 0.05%

Prevalence as the North Star Metric

python

def compute_prevalence(time_window):
    """
    Prevalence = violating content impressions / total impressions

    This is the metric that matters — not how much bad content exists,
    but how much bad content users actually SEE.
    """
    total_impressions = get_total_impressions(time_window)
    violating_impressions = get_violating_impressions(time_window)

    # Violating impressions include:
    # 1. Content later removed (was seen before removal)
    # 2. Content caught by retroactive scans
    # 3. Estimated from sampling (human review of random sample)

    prevalence = violating_impressions / total_impressions
    return prevalence

# Track per policy category
# Target: < 0.05% prevalence for hate speech
# Target: < 0.01% prevalence for CSAM

TIP

Prevalence-based measurement requires periodic random sampling — take a random sample of published content, send it through human review, and extrapolate. This catches content that the automated system misses entirely. Meta and YouTube publish prevalence metrics in their transparency reports.

10. Scaling Considerations

Multi-Region Architecture

Cost Optimization

Strategy	Savings	Tradeoff
Cascade classifiers	60% GPU cost	Simple model screens first, expensive model only for borderline
Batch GPU inference	40% GPU cost	Adds 50-100 ms latency for batching
Model distillation	50% inference cost	Slightly lower accuracy
Hash-first pipeline	Skip ML for known content	Only works for exact/near duplicates
Content-type routing	30% overall	Text-only posts skip image pipeline
Auto-resolution for low-risk	20% review cost	May miss subtle violations

Key Takeaways

Multi-modal pipeline — text, image, and video each need specialized classifiers; aggregate signals in a policy engine
Policy engine decouples ML from decisions — thresholds are configurable per category, context, and jurisdiction
Hash databases first — perceptual hashing catches known-bad content instantly (especially CSAM) before expensive ML runs
Adversarial resistance — normalize text (homoglyphs, leetspeak, spacing), OCR images, and continuously update evasion countermeasures
Human review is unavoidable — ML handles volume, humans handle nuance; queue prioritization by severity and reach is critical
Prevalence over precision — the real metric is how much violating content users see, not how much exists
Reviewer wellness is non-negotiable — content moderation causes real psychological harm; build in mandatory safeguards
Appeals must exist — false positives are inevitable; a transparent appeal process is both ethically and legally required

Design Content Moderation System ​

1. Problem Statement & Requirements ​

Functional Requirements ​

Non-Functional Requirements ​

Clarifying Questions ​

2. Back-of-Envelope Estimation ​

Traffic ​

ML Inference ​

Human Review ​

Storage ​

3. High-Level Design ​

API Design ​

4. Deep Dive: ML Classification Pipeline ​

Text Classification ​

Image Classification ​

Video Classification ​

5. Deep Dive: Policy Engine ​

Policy Configuration ​

6. Deep Dive: Human Review System ​

Queue Prioritization ​

Reviewer Quality and Calibration ​

7. Deep Dive: Appeals Flow ​

Appeal Auto-Resolution ​

8. Handling Adversarial Content ​

Evasion Techniques and Countermeasures ​

9. Monitoring and Metrics ​

Operational Dashboard ​

Prevalence as the North Star Metric ​

10. Scaling Considerations ​

Multi-Region Architecture ​

Cost Optimization ​

Key Takeaways ​

Related Pages

Design Content Moderation System

1. Problem Statement & Requirements

Functional Requirements

Non-Functional Requirements

Clarifying Questions

2. Back-of-Envelope Estimation

Traffic

ML Inference

Human Review

Storage

3. High-Level Design

API Design

4. Deep Dive: ML Classification Pipeline

Text Classification

Image Classification

Video Classification

5. Deep Dive: Policy Engine

Policy Configuration

6. Deep Dive: Human Review System

Queue Prioritization

Reviewer Quality and Calibration

7. Deep Dive: Appeals Flow

Appeal Auto-Resolution

8. Handling Adversarial Content

Evasion Techniques and Countermeasures

9. Monitoring and Metrics

Operational Dashboard

Prevalence as the North Star Metric

10. Scaling Considerations

Multi-Region Architecture

Cost Optimization

Key Takeaways