Real-World Design Scenarios

These 15 scenarios test how you think under pressure. Unlike whiteboard system design where you build from scratch, these scenarios drop you into a running system with a real problem. Senior engineers are evaluated on their ability to methodically diagnose issues, prioritize actions, and communicate clearly. For each scenario, use the framework: Assess → Hypothesize → Investigate → Act → Prevent.

Scenario 1: "The API is Slow — What Do You Check?"

Situation: Users are reporting that the application feels sluggish. Your monitoring shows p99 latency has increased from 200ms to 2.5 seconds over the past hour.

Structured Approach

Step 1 — Assess scope:

Question	How to Check
Is it all endpoints or specific ones?	Filter metrics by route
Is it all users or specific regions?	Check by region/PoP
When did it start?	Correlate with recent deployments, config changes
Is it getting worse or stable?	Check trend over past hour

Step 2 — Check distributed traces (fastest path to root cause):

Look at a slow trace. Where is the time spent? Common findings:

Finding	Likely Cause	Fix
Database query takes 2s	Missing index, table lock, full table scan	Add index, optimize query
External API call takes 2s	Third-party is slow	Add timeout, circuit breaker, cache
Time between spans	Thread pool exhaustion, GC pause	Increase pool size, tune GC
First request slow, subsequent fast	Cold start, connection establishment	Connection pooling, warm-up

Step 3 — Quick wins while investigating:

Scale horizontally if the cause is saturation
Enable query caching for expensive read paths
Increase timeouts on the client side (prevent cascading failures)
Redirect traffic away from problematic region/AZ

See our API Slow Debugging Playbook for detailed runbooks.

Scenario 2: "Database is at 90% Capacity"

Situation: CloudWatch alert: RDS storage utilization at 90%. The database is a 500GB PostgreSQL instance. Growth rate is 2GB/day.

Structured Approach

Step 1 — Buy time (you have ~5 days before 100%):

sql

-- What's consuming space?
SELECT
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
    pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS data_size,
    pg_size_pretty(pg_indexes_size(schemaname || '.' || tablename)) AS index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 20;

-- Check for bloat (dead tuples)
SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    round(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 1) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

Step 2 — Immediate actions by priority:

Action	Space Freed	Risk	Time
VACUUM FULL on bloated tables	10-50% of table size	Locks table	Hours
Drop unused indexes	5-20%	None if truly unused	Minutes
Archive old data to S3	Varies	Need to ensure no queries hit it	Days
Increase storage (AWS allows online resize)	N/A (buys time)	None	Minutes
Enable storage autoscaling	N/A (prevents recurrence)	Cost increase	Minutes

Step 3 — Long-term solutions:

Implement data retention policy: archive data older than N months
Partition large tables by date: CREATE TABLE orders PARTITION BY RANGE (created_at)
Move analytics data to a separate store (ClickHouse, S3 + Athena)
Implement CDC to stream old data to cold storage

Scenario 3: "Service is Failing Intermittently"

Situation: Service X returns 503 errors about 5% of the time. Retry usually succeeds. The error rate is not correlated with traffic.

Structured Approach

Intermittent failures are the hardest to debug because they are not reproducible on demand.

Hypothesis	Investigation	Evidence
One unhealthy instance in the pool	Check per-instance error rates	Error rate 0% on 3/4 instances, 20% on 1
DNS resolution failing intermittently	Check DNS cache TTL, resolution times	Spikes in DNS lookup time
Connection timeout to dependency	Check connection pool metrics	Pool exhaustion every few minutes
Memory pressure causing OOM kills	Check for OOM events in system logs	`dmesg` shows killed processes
Database connection limit reached	Check active connections vs max	Periodic spikes to max_connections
TLS certificate issue on some paths	Check certificate chain for all load balancer targets	One target has expired intermediate cert

Step 1 — Isolate the failure pattern:

# Is it one instance?
per_instance_error_rate = metrics_query(
  'sum(rate(http_errors[5m])) by (instance)'
)

# Is it correlated with a dependency?
error_correlation = metrics_query(
  'rate(http_errors[5m]) / rate(dependency_errors[5m])'
)

# Is it time-based?
error_by_time = metrics_query(
  'sum(rate(http_errors[1m])) by (minute_of_hour)'
)

Step 2 — Common culprits for 5% intermittent failure:

One bad instance (most common) — one instance in the pool is unhealthy but health check passes. Remove it, investigate, replace.
Connection pool exhaustion — pool is sized for average load, not burst. Under transient bursts, requests queue then timeout.
Downstream flapping — a dependency toggles between healthy and unhealthy, causing circuit breaker oscillation.

Scenario 4: "Deployment Caused a 5xx Spike"

Situation: The latest deployment went out at 14:00. Error rate jumped from 0.1% to 5% immediately after. The previous version was running fine.

Structured Approach

Step 1 — Stop the bleeding:

Action	When
Rollback immediately if error rate > 1% and impact is customer-facing	Within 5 minutes
Pause canary rollout if in canary phase	Immediately
Check if rollback is safe (database migrations?)	Before rolling back

Step 2 — If rollback is safe, do it now, investigate later.

Step 3 — If rollback is risky (irreversible migration), investigate fast:

Common deployment failure causes:

Cause	How to Detect	Prevention
Breaking API change	Client errors in logs	API versioning, backward compatibility testing
Missing environment variable	"undefined" in error logs	Config validation on startup
Database migration failed	Migration error in deploy logs	Test migrations on staging
Dependency version mismatch	ImportError / ClassNotFound	Lock file, integration tests
Memory/CPU limits too low	OOM killed, throttled	Load test before deploy
Race condition in new code	Intermittent errors only under concurrency	Concurrency testing

Scenario 5: "We Need to Migrate Databases"

Situation: You need to migrate from MySQL to PostgreSQL for your main application database (50 tables, 500GB, 24/7 uptime requirement).

Structured Approach

Phase 1: Preparation (weeks 1-4)

Schema conversion: MySQL types to PostgreSQL types
Application compatibility: ORM changes, raw SQL audit
Set up PostgreSQL alongside MySQL
Build data migration pipeline

Phase 2: Dual-Write (weeks 5-8)

Write to both databases
Read from MySQL (source of truth)
Compare results to find discrepancies

Phase 3: Shadow Read (weeks 9-10)

Read from PostgreSQL, compare with MySQL response
Log any differences
Fix data inconsistencies

Phase 4: Cutover (week 11)

Switch reads to PostgreSQL
Keep MySQL as hot standby for 1 week
Decommission MySQL

Scenario 6: "Traffic Grew 10x Overnight"

Situation: Your product went viral. Traffic went from 1,000 to 10,000 requests per second overnight. Some services are failing.

Structured Approach

Immediate (0-1 hour):

Priority	Action
1	Scale horizontally: increase auto-scaling group max, add nodes
2	Enable aggressive caching: cache everything for 60s
3	Disable non-essential features: analytics tracking, recommendations
4	Rate limit by client to prevent any single client from consuming all capacity
5	Check database connections — likely the first bottleneck

Short-term (1-24 hours):

Add read replicas for the database
Scale cache cluster (more Redis nodes)
Review and increase connection pool sizes
Add CDN for all static and semi-static content
Set up auto-scaling policies if not already present

Medium-term (1-7 days):

Identify the actual bottlenecks from monitoring data
Optimize the top 5 slowest queries
Implement proper horizontal scaling patterns
Add circuit breakers to prevent cascade failures
Review architecture for single points of failure

Scenario 7: "Customer Data Breach Detected"

Situation: Security team detected unusual data access patterns. A database dump of 100,000 customer records may have been exfiltrated.

Structured Approach

Hour 0-1: Contain

Action	Responsible
Revoke compromised credentials immediately	Security team
Block the source IP/access pattern	Network team
Preserve evidence (do not delete logs)	Security team
Activate incident response team	Incident commander
Check if exfiltration is ongoing	SOC / SIEM

Hour 1-4: Assess

What data was accessed? (PII, financial, health?)
How many records affected?
What was the attack vector? (SQL injection, stolen credentials, insider?)
Is the vulnerability still present?
Were other systems affected?

Hour 4-24: Notify and remediate

Legal team: determine regulatory notification requirements (GDPR: 72 hours)
Patch the vulnerability
Rotate all secrets and API keys in the affected scope
Review access logs for the past 30 days for similar patterns
Force password reset for affected users

Week 1-2: Post-incident

Full incident postmortem
Implement missing controls (see what failed)
Penetration test to find similar vulnerabilities
Review and update security monitoring rules

Scenario 8: "Third-Party API is Going Down"

Situation: Your payment provider (Stripe, PayPal, etc.) is experiencing intermittent outages. 20% of payment requests are failing.

Structured Approach

Action	Priority	Implementation
Activate circuit breaker	Immediate	Stop sending requests to failing provider
Failover to backup provider	Immediate	Route to secondary payment provider
Queue failed payments for retry	Immediate	Store payment intent, retry when provider recovers
Notify affected users	Within 30 min	"Payment is being processed, you'll be notified"
Cache authorization tokens	If applicable	Avoid re-auth during outage

Longer-term prevention:

typescript

class PaymentService {
  private providers: PaymentProvider[] = [
    new StripeProvider(),    // Primary
    new AdyenProvider(),     // Failover
  ];

  async processPayment(request: PaymentRequest): Promise<PaymentResult> {
    for (const provider of this.providers) {
      if (!this.circuitBreaker.isOpen(provider.name)) {
        try {
          return await provider.charge(request);
        } catch (error) {
          if (isNetworkError(error)) {
            this.circuitBreaker.recordFailure(provider.name);
            continue; // Try next provider
          }
          throw error; // Business errors (card declined) are not retried
        }
      }
    }

    // All providers down — queue for retry
    await this.retryQueue.enqueue({
      request,
      maxRetries: 10,
      retryAfter: new Date(Date.now() + 5 * 60 * 1000),
    });

    return { status: 'PENDING', message: 'Payment is being processed' };
  }
}

Scenario 9: "Service Latency is High Only in One Region"

Situation: Users in EU are experiencing 3-5 second latency. US and APAC are normal.

Investigation Checklist

Check	Tool	What to Look For
Regional health	CloudWatch by region	Error rate, CPU, memory per region
DNS resolution	`dig api.example.com` from EU	Wrong/slow resolution
Database replication lag	DB monitoring	EU read replica is behind
CDN cache hit rate	CloudFront metrics by PoP	Low hit rate = origin calls for every request
Network path	`traceroute` from EU	Unexpected routing through US
Recent deployment	Deploy history	EU deployed a bad version?

Scenario 10: "Memory Usage is Growing Over Time"

Situation: Service memory usage increases linearly: 500MB at startup, 1.5GB after 24 hours, 3GB after 48 hours. Eventually OOM-killed.

Common Memory Leak Sources

Source	Detection	Fix
Event listeners not removed	Heap dump shows growing listener arrays	Remove listeners on cleanup
Cache without size limit	Heap dump shows large Map/object	Add max size + LRU eviction
Closures holding references	Growing retained size in heap profile	Break closure references
Database connection leaks	Connection pool shows all connections active, none idle	Always close connections in `finally` block
Buffer accumulation	Growing Buffer objects in heap	Stream processing instead of buffering
Global state accumulation	Growing arrays/maps in module scope	Add cleanup / rotation

Scenario 11: "Need to Deprecate an API Version"

Situation: API v1 has 10,000 active clients. v2 has been available for 6 months. You need to sunset v1.

Deprecation Timeline

Phase	Duration	Actions
Announce	Month 0	Email all v1 users, add `Sunset` header to responses
Warn	Month 1-3	Return `Deprecation` header, log v1 usage per client
Degrade	Month 4-5	Rate-limit v1, return 299 warnings
Sunset	Month 6	Return 410 Gone with migration guide link

Scenario 12: "Feature Flags Causing Inconsistent Behavior"

Situation: Users are reporting that a feature appears and disappears randomly. Feature flags are cached with 5-minute TTL.

Root Cause Analysis

The problem is inconsistent flag evaluation across requests. If a user hits Server A (cache has new value) then Server B (cache has old value), the feature flickers.

Fixes:

Sticky sessions — route same user to same server
Reduce cache TTL to 30 seconds
Server-side evaluation with user ID — feature flag service returns consistent result per user ID
Client-side SDK — evaluate flags on the client, not the server

Scenario 13: "Deployment Pipeline Takes 45 Minutes"

Situation: CI/CD pipeline from commit to production takes 45 minutes. Developer velocity is suffering.

Pipeline Optimization

Stage	Current	Optimized	Technique
Build	8 min	2 min	Docker layer caching, parallel builds
Unit tests	12 min	4 min	Parallel test execution, test impact analysis
Integration tests	15 min	5 min	Run only affected tests, use test containers
Security scan	5 min	1 min	Incremental scanning, cache vulnerability DB
Deploy	5 min	3 min	Rolling deploy (not full replacement)
Total	45 min	15 min

Scenario 14: "Microservices Are Too Complex for Our Team"

Situation: 5-person team manages 12 microservices. Most engineering time is spent on infrastructure, not product features.

Honest Assessment

If the number of services exceeds the number of engineers, you likely have premature microservices. See our Anti-Patterns page.

Action plan:

Identify services that are always deployed together — merge them
Identify services with < 1 deploy/month — they are not earning their keep
Target: 2-3 services per engineer
Consider a "majestic monolith" with clear module boundaries

Scenario 15: "Estimated Completion Time for Major Refactor?"

Situation: CTO asks: "How long will it take to refactor the monolith into microservices?"

Honest Answer Framework

Factor	Assessment
Current codebase size	LOC, modules, dependencies
Team size and experience	Have they done this before?
Business continuity requirements	Can you stop new features during refactor?
Data entanglement	How tightly coupled is the database?
Testing coverage	Can you refactor safely?

Rule of thumb: It always takes 2-3x longer than estimated. A 200KLOC monolith with 10 engineers, good test coverage, and clear module boundaries takes 6-12 months for a meaningful split. Without those prerequisites, 12-24 months.

Better answer: "Let me extract one bounded context first as a proof of concept. That will give us a realistic timeline for the rest."

Scenario Answer Template

For any troubleshooting scenario in an interview:

Clarify — Ask what monitoring/tools are available
Assess — Determine scope and impact (how many users? which regions?)
Stabilize — Stop the bleeding (rollback, scale up, enable caching)
Investigate — Check metrics, traces, logs in that order
Fix — Apply the targeted fix
Prevent — What monitoring, testing, or architecture change prevents recurrence?

API Slow Debugging Playbook — detailed slow API runbook
Intermittent 502 Debugging — debugging intermittent failures
Anti-Patterns — architectural mistakes that cause these scenarios
Observability in Design — monitoring that catches issues early
Circuit Breaker — handling dependency failures
Incident Response — structured incident management
Postmortem Framework — learning from incidents

Real-World Design Scenarios ​

Scenario 1: "The API is Slow — What Do You Check?" ​

Structured Approach ​

Scenario 2: "Database is at 90% Capacity" ​

Structured Approach ​

Scenario 3: "Service is Failing Intermittently" ​

Structured Approach ​

Scenario 4: "Deployment Caused a 5xx Spike" ​

Structured Approach ​

Scenario 5: "We Need to Migrate Databases" ​

Structured Approach ​

Scenario 6: "Traffic Grew 10x Overnight" ​

Structured Approach ​

Scenario 7: "Customer Data Breach Detected" ​

Structured Approach ​

Scenario 8: "Third-Party API is Going Down" ​

Structured Approach ​

Scenario 9: "Service Latency is High Only in One Region" ​

Investigation Checklist ​

Scenario 10: "Memory Usage is Growing Over Time" ​

Common Memory Leak Sources ​

Scenario 11: "Need to Deprecate an API Version" ​

Deprecation Timeline ​

Scenario 12: "Feature Flags Causing Inconsistent Behavior" ​

Root Cause Analysis ​

Scenario 13: "Deployment Pipeline Takes 45 Minutes" ​

Pipeline Optimization ​

Scenario 14: "Microservices Are Too Complex for Our Team" ​

Honest Assessment ​

Scenario 15: "Estimated Completion Time for Major Refactor?" ​

Honest Answer Framework ​

Scenario Answer Template ​

Related Pages ​

Related Pages

Real-World Design Scenarios

Scenario 1: "The API is Slow — What Do You Check?"

Structured Approach

Scenario 2: "Database is at 90% Capacity"

Structured Approach

Scenario 3: "Service is Failing Intermittently"

Structured Approach

Scenario 4: "Deployment Caused a 5xx Spike"

Structured Approach

Scenario 5: "We Need to Migrate Databases"

Structured Approach

Scenario 6: "Traffic Grew 10x Overnight"

Structured Approach

Scenario 7: "Customer Data Breach Detected"

Structured Approach

Scenario 8: "Third-Party API is Going Down"

Structured Approach

Scenario 9: "Service Latency is High Only in One Region"

Investigation Checklist

Scenario 10: "Memory Usage is Growing Over Time"

Common Memory Leak Sources

Scenario 11: "Need to Deprecate an API Version"

Deprecation Timeline

Scenario 12: "Feature Flags Causing Inconsistent Behavior"

Root Cause Analysis

Scenario 13: "Deployment Pipeline Takes 45 Minutes"

Pipeline Optimization

Scenario 14: "Microservices Are Too Complex for Our Team"

Honest Assessment

Scenario 15: "Estimated Completion Time for Major Refactor?"

Honest Answer Framework

Scenario Answer Template

Related Pages