Design GitHub Copilot

An AI code completion system that suggests code in real-time as developers type — inline completions, multi-line suggestions, and chat-based code generation.

1. Requirements Clarification

Functional Requirements

Inline code suggestions as user types (ghost text)
Multi-line / whole-function completion
Chat interface for code Q&A
Context-aware (understands open files, imports, project structure)
Support multiple languages (Python, TypeScript, Go, Java, Rust, etc.)

Non-Functional Requirements

Latency: < 300ms for inline suggestions (user perception threshold)
Availability: 99.9% (degradation = no suggestions, not broken IDE)
Scale: 1M+ concurrent developers
Privacy: Enterprise customers need code to never leave their network

2. Back-of-the-Envelope Estimation

DAU: 5M developers
Suggestions/day per user: ~200 (every few keystrokes trigger)
Total requests/day: 5M × 200 = 1B
QPS: 1B / 86,400 ≈ 11,500 QPS (peak: ~35K QPS)
Average context size: ~4KB (surrounding code)
Average response size: ~500 bytes (suggestion)
Bandwidth: 35K × 4.5KB ≈ 157 MB/s

3. High-Level Design

4. Detailed Design

Context Collection (IDE Side)

The quality of suggestions depends entirely on context quality:

typescript

interface CopilotContext {
  // Current file
  prefix: string        // Code before cursor
  suffix: string        // Code after cursor
  language: string      // File language
  filePath: string      // Relative path

  // Neighboring files
  openTabs: FileSnippet[]      // Other open files (relevant parts)
  importedFiles: FileSnippet[] // Files referenced by imports

  // Project context
  projectType: string          // package.json, go.mod, etc.
  recentEdits: Edit[]          // Last 5 edits across files
}

interface FileSnippet {
  path: string
  content: string        // Truncated to relevant sections
  relevanceScore: number // How related to current file
}

Context budget: ~8K tokens total. Must be selective:

Current file prefix/suffix: 4K tokens
Most relevant open tab: 2K tokens
Import signatures: 1K tokens
Project metadata: 1K tokens

Model Router

Not every keystroke needs GPT-4:

Trigger	Model	Latency Target
Single line completion	Small (1-3B params)	< 100ms
Multi-line / function body	Medium (7-13B params)	< 300ms
Chat / explain / refactor	Large (GPT-4 class)	< 2s
Simple bracket/quote close	Rule-based (no model)	< 10ms

python

def route_request(context: CopilotContext) -> str:
    # Rule-based completions (no model needed)
    if is_bracket_close(context.prefix):
        return "rule_engine"

    # Analyze complexity
    cursor_position = analyze_cursor(context)

    if cursor_position == "mid_line":
        return "small_model"
    elif cursor_position == "empty_line_in_function":
        return "medium_model"
    elif cursor_position == "new_function" or context.is_chat:
        return "large_model"

    return "small_model"  # Default to fast

Inference Pipeline

Post-Processing

Raw model output needs filtering:

python

def post_process(raw_suggestions: list[str], context: CopilotContext) -> list[str]:
    results = []
    for suggestion in raw_suggestions:
        # 1. Syntax validation — must parse
        if not is_valid_syntax(suggestion, context.language):
            continue

        # 2. Security filter — no secrets, no license violations
        if contains_secrets(suggestion) or matches_copyrighted_code(suggestion):
            continue

        # 3. Deduplication — don't suggest what's already there
        if suggestion.strip() in context.prefix:
            continue

        # 4. Trim — stop at natural boundary
        suggestion = trim_at_boundary(suggestion, context.language)

        results.append(suggestion)

    # Rank by confidence score
    return sorted(results, key=lambda s: s.confidence, reverse=True)[:3]

Telemetry & Learning

Track acceptance to improve suggestions:

Event	What it tells us
`suggestion_shown`	Model produced output
`suggestion_accepted`	User pressed Tab — good suggestion
`suggestion_partially_accepted`	User took part of it
`suggestion_rejected`	User kept typing — bad suggestion
`suggestion_accepted_then_deleted`	Looked good but wasn't — worst case

Acceptance rate is the north star metric (~30% is good).

5. Data Model

sql

-- Telemetry events (ClickHouse for analytics)
CREATE TABLE suggestion_events (
    event_id UUID,
    user_id UUID,
    timestamp DateTime64(3),
    language String,
    model_used String,
    latency_ms UInt32,
    tokens_generated UInt16,
    accepted Enum('shown', 'accepted', 'partial', 'rejected', 'deleted'),
    context_hash String  -- For semantic cache
) ENGINE = MergeTree()
ORDER BY (user_id, timestamp);

6. API Design

POST /v1/completions
{
  "prefix": "function fibonacci(n: number): number {\n  ",
  "suffix": "\n}\n",
  "language": "typescript",
  "filePath": "src/utils/math.ts",
  "maxTokens": 128,
  "temperature": 0.0,
  "n": 3
}

Response (streamed):
{
  "suggestions": [
    { "text": "if (n <= 1) return n;\n  return fibonacci(n - 1) + fibonacci(n - 2);", "confidence": 0.92 },
    { "text": "const dp = [0, 1];\n  for (let i = 2; i <= n; i++) dp[i] = dp[i-1] + dp[i-2];\n  return dp[n];", "confidence": 0.87 }
  ],
  "model": "copilot-small-v3",
  "latencyMs": 89
}

7. Scaling

GPU Infrastructure

Small model: Deploy on L4 GPUs, batch inference, 8 replicas per region
Large model: A100/H100 clusters, vLLM with continuous batching
Regions: US-East, US-West, EU-West, AP-Southeast (< 50ms network hop)

Handling 35K QPS

Semantic cache hit rate ~40% → actual GPU load ~21K QPS
Speculative decoding for faster generation
Request coalescing for similar contexts
Graceful degradation: drop to smaller model under load

Enterprise (Self-Hosted)

Deploy model on customer's cloud (Azure, AWS)
Code never leaves their VPC
Smaller model (7B) for cost, fine-tuned on their codebase

8. Trade-offs

Decision	Choice	Alternative	Why
Multiple models	Yes — small + large	Single model	Latency vs quality trade-off
Streaming	No for inline, yes for chat	Always stream	Ghost text needs complete suggestion
Context	Selective 8K tokens	Full file	Token budget, latency
Caching	Semantic (embedding-based)	Exact match	Similar code patterns reuse

9. Common Interview Questions

How do you handle latency? → Model routing, semantic cache, edge deployment, speculative decoding
How do you prevent suggesting copyrighted code? → Code fingerprinting against known OSS, license detection
How do you handle multiple languages? → Polyglot models trained on all languages, language-specific post-processing
How do you personalize? → Fine-tune on org's codebase (enterprise), learn from acceptance patterns
What about privacy? → Self-hosted option, no training on customer code, SOC 2 compliance

Design GitHub Copilot ​

1. Requirements Clarification ​

Functional Requirements ​

Non-Functional Requirements ​

2. Back-of-the-Envelope Estimation ​

3. High-Level Design ​

4. Detailed Design ​

Context Collection (IDE Side) ​

Model Router ​

Inference Pipeline ​

Post-Processing ​

Telemetry & Learning ​

5. Data Model ​

6. API Design ​

7. Scaling ​

GPU Infrastructure ​

Handling 35K QPS ​

Enterprise (Self-Hosted) ​

8. Trade-offs ​

9. Common Interview Questions ​

Further Reading ​

Related Pages