Skip to content
Unverified — AI-generated content. Help verify this page

Incident Response Playbook

An incident is an unplanned event that disrupts or degrades service for customers. Every engineering organization will have incidents — the difference between mature and immature teams is not the number of incidents but how quickly and effectively they respond.

This playbook is a complete, executable guide for the entire incident lifecycle: detection through postmortem. It is designed to be read once for learning and referenced during actual incidents. Every template, escalation path, and communication script is written to be copy-pasted under pressure — because during an incident, you do not have the cognitive bandwidth to compose prose from scratch.

The Incident Lifecycle

Every incident, regardless of severity, follows the same five-phase lifecycle:

The phases overlap in practice — you communicate while mitigating, and detection continues as you triage — but the mental model of five phases keeps the incident response structured. Teams that skip phases (especially communication and postmortem) accumulate organizational debt that makes future incidents worse.

Phase Durations by Severity

PhaseSEV-1SEV-2SEV-3SEV-4
Detect → Triage< 5 min< 15 min< 30 min< 2 hr
Triage → Mitigation started< 15 min< 30 min< 1 hr< 4 hr
Mitigation → Customer comms< 5 min< 15 min< 1 hrBest effort
Incident → Postmortem draft< 48 hr< 72 hr< 1 weekOptional

Severity Levels

Severity classification is the first and most consequential decision during an incident. It determines who gets paged, how fast you need to act, what communication channels activate, and whether leadership is informed.

SEV-1: Critical

Definition: Complete service outage, data loss, security breach, or revenue-blocking failure affecting all or a majority of users.

Examples:

  • Production database is down — no reads or writes
  • Payment processing is entirely broken — zero transactions completing
  • Customer data is exposed to unauthorized parties
  • Authentication system is unavailable — no user can log in
  • Primary data center unreachable

Response requirements:

  • On-call engineer responds within 5 minutes
  • Incident commander assigned within 10 minutes
  • War room opened immediately
  • Engineering leadership notified within 15 minutes
  • External status page updated within 20 minutes
  • Customer communications within 30 minutes
  • All-hands-on-deck if initial responders cannot mitigate within 30 minutes

SEV-2: Major

Definition: Significant degradation of a core feature affecting a large subset of users. The service is up but meaningfully impaired.

Examples:

  • Search is returning errors for 40% of queries
  • Mobile app API is responding with 5-second latency (normal: 200ms)
  • Image uploads are failing but all other features work
  • A specific geographic region cannot access the service
  • Email delivery is delayed by 2+ hours

Response requirements:

  • On-call engineer responds within 15 minutes
  • Incident commander assigned within 30 minutes
  • Dedicated Slack channel created
  • Engineering manager notified within 30 minutes
  • External status page updated within 1 hour
  • Customer communications for affected segment within 2 hours

SEV-3: Minor

Definition: Minor feature degradation affecting a small subset of users. Core functionality is unaffected.

Examples:

  • Dashboard analytics are showing stale data (last updated 6 hours ago)
  • Dark mode CSS is broken on one browser
  • Export to CSV times out for reports over 10,000 rows
  • Webhook deliveries are delayed by 10 minutes
  • Admin panel is slow but functional

Response requirements:

  • On-call engineer acknowledges within 30 minutes during business hours
  • No war room required — Slack channel is sufficient
  • Fix during normal working hours
  • Status page update only if customer-visible
  • Internal comms only

SEV-4: Low

Definition: Cosmetic issue, minor inconvenience, or internal tooling problem with no customer impact.

Examples:

  • Internal monitoring dashboard has a broken graph
  • Staging environment CI is flaky
  • Log aggregation is 5 minutes behind
  • Internal wiki search is slow
  • Non-critical cron job failed on a single run

Response requirements:

  • Ticket created, addressed in next sprint
  • No escalation, no war room, no comms
  • On-call may defer to next business day

Severity Decision Tree

WARNING

When in doubt, escalate to a higher severity. You can always downgrade. Upgrading severity mid-incident is slow (pages need to go out, war rooms need to form) — downgrading is instant (just announce it). The cost of over-classifying is a few unnecessary pages; the cost of under-classifying is a prolonged outage.

Phase 1: Detection

You cannot respond to what you do not know about. Detection is the phase where you first learn that something is wrong.

Detection Sources (Ordered by Speed)

SourceTypical LatencyReliabilityExample
Automated alerts (metrics)1-5 minHighError rate > 5% for 2 min
Synthetic monitoring1-5 minHighCanary transaction failed
Anomaly detection3-10 minMediumRequest latency 3 sigma above baseline
Internal report (engineer notices)5-30 minMedium"Something looks weird in Grafana"
Customer support tickets15-60 minLowMultiple tickets about same issue
Social media30-120 minLowTwitter thread about your outage
Customer email to exec1-24 hrLow"Your service has been down all day"

Goal: Detect 90%+ of incidents through automated alerts before any human notices. Every incident detected via customer report or social media is a detection failure worth a postmortem action item.

Alert Design for Incident Detection

Effective alerting requires balancing sensitivity (catching real incidents) against specificity (not generating false alarms that erode on-call trust).

yaml
# Good alert — symptom-based, high signal
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
    > 0.05
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "Error rate above 5% for 2 minutes"
    runbook: "https://wiki.internal/runbooks/high-error-rate"
    dashboard: "https://grafana.internal/d/api-health"
yaml
# Bad alert — cause-based, low signal
- alert: HighCpuUsage
  expr: node_cpu_usage > 0.90
  for: 1m
  labels:
    severity: page
  annotations:
    summary: "CPU above 90%"
    # High CPU might not cause any user impact
    # This will page on-call for non-incidents

TIP

Alert on symptoms, not causes. Users care about error rates, latency, and availability — not CPU usage, disk I/O, or memory pressure. Cause-based alerts generate noise. Symptom-based alerts are actionable.

The Detection Checklist

When an alert fires or someone reports an issue, the on-call engineer runs this checklist:

  1. Verify the signal is real — Check the dashboard linked in the alert annotation. Is the metric actually elevated, or was it a blip that auto-resolved?
  2. Check for known causes — Is there an active deploy? A scheduled maintenance window? A known upstream dependency issue?
  3. Reproduce if possible — Can you trigger the error yourself? Does a synthetic transaction fail?
  4. Assess scope — Is this affecting one endpoint, one region, one user segment, or everything?
  5. Move to Triage — If the issue is real and ongoing, immediately move to triage.

Phase 2: Triage

Triage is the phase where you assess the impact, classify severity, and decide who needs to be involved. The goal is to spend under 5 minutes on triage for SEV-1 and under 15 minutes for SEV-2.

The Triage Framework

Answer these four questions in order:

1. What is broken? Identify the specific symptom. "API errors are elevated" is better than "something is wrong." "POST /api/orders returns 500 with connection refused from payment service" is best.

2. Who is affected?

  • All users vs. a subset (geographic region, plan tier, specific feature)
  • Internal vs. external users
  • Revenue-generating flows vs. non-critical flows

3. What is the blast radius?

Blast Radius Assessment:

Single user        → SEV-3/4, low urgency
Single feature     → SEV-3, moderate urgency
Single region      → SEV-2, high urgency
Core workflow       → SEV-2, high urgency
All users          → SEV-1, maximum urgency
Data integrity     → SEV-1, maximum urgency
Security breach    → SEV-1, maximum urgency

4. Who needs to be paged?

Blast RadiusPage
Single feature, non-criticalOn-call for owning team
Core workflow degradedOn-call + team lead
Full outage or data lossOn-call + team lead + incident commander + eng management
Security breachOn-call + security team + CISO + incident commander + eng management + legal

Triage Anti-Patterns

DANGER

Anti-pattern: Solo debugging. The on-call engineer spends 45 minutes trying to figure out the root cause alone before telling anyone. Meanwhile, the outage is growing. Fix: If you cannot mitigate within 10 minutes, escalate immediately.

Anti-pattern: Severity denial. "It's probably fine, let's wait and see." By the time you confirm it's not fine, you've wasted 30 minutes. Fix: Assume the worst, downgrade later.

Anti-pattern: Wrong responders. Paging the backend team for a CDN issue. Fix: Maintain a clear service ownership map and page the right team from the start.

Phase 3: Mitigation

Mitigation is about restoring service as fast as possible. It is explicitly not about finding the root cause. You can investigate the root cause in the postmortem — right now the goal is to stop the bleeding.

The Mitigation Hierarchy

Try these strategies in order. Each subsequent strategy takes longer and carries more risk:

Strategy 1: Rollback

The fastest and safest mitigation. If the incident started after a deploy, roll it back first, ask questions later.

bash
# Kubernetes rollback
kubectl rollout undo deployment/api-server -n production

# Verify rollback
kubectl rollout status deployment/api-server -n production

# Check the previous revision it rolled back to
kubectl rollout history deployment/api-server -n production
bash
# AWS ECS — update to previous task definition
aws ecs update-service \
  --cluster production \
  --service api-server \
  --task-definition api-server:$PREVIOUS_REVISION

# Argo Rollouts — abort and rollback
kubectl argo rollouts abort api-server -n production
kubectl argo rollouts undo api-server -n production

When rollback fails: The deploy is not the cause, or the database schema changed and the old code is incompatible with the new schema. Move to the next strategy.

Strategy 2: Feature Flag Toggle

If you can identify the specific feature causing the issue, disable it without a full rollback.

typescript
// LaunchDarkly / Unleash / custom feature flag
if (featureFlags.isEnabled('new-search-algorithm', user)) {
  return newSearchAlgorithm(query);
} else {
  return legacySearchAlgorithm(query); // fallback
}

// During incident: disable 'new-search-algorithm' in flag dashboard
// Effect: instant, no deploy required, blast radius = zero

TIP

Feature flags are the single best investment for incident mitigation speed. If every new feature ships behind a flag, 80%+ of feature-related incidents can be mitigated in under 60 seconds by flipping a toggle.

Strategy 3: Traffic Shift

If the problem is localized to a specific region, availability zone, or backend instance, shift traffic away.

bash
# Remove unhealthy backend from load balancer pool
aws elbv2 deregister-targets \
  --target-group-arn $TG_ARN \
  --targets Id=$UNHEALTHY_INSTANCE

# Route53 failover — disable unhealthy region
aws route53 change-resource-record-sets \
  --hosted-zone-id $ZONE_ID \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "us-east-1",
        "Failover": "PRIMARY",
        "TTL": 60,
        "ResourceRecords": [{"Value": "0.0.0.0"}]
      }
    }]
  }'

Strategy 4: Scale / Restart

Sometimes the mitigation is simply restarting pods or scaling up to handle unexpected load.

bash
# Kubernetes — restart all pods in a deployment
kubectl rollout restart deployment/api-server -n production

# Scale up to handle load
kubectl scale deployment/api-server --replicas=20 -n production

# If a specific pod is in a bad state
kubectl delete pod api-server-abc123 -n production

Strategy 5: Hotfix

A targeted code change deployed directly to production. This is the riskiest mitigation because you're deploying under pressure with minimal testing.

bash
# Hotfix workflow
git checkout main
git pull origin main
git checkout -b hotfix/payment-null-check

# Make the minimal possible change
# ONE line fix, no refactoring, no "while I'm here" changes

git add src/payment/processor.ts
git commit -m "hotfix: null check on payment response"
git push origin hotfix/payment-null-check

# Fast-track through CI — skip non-essential checks
# Deploy via expedited pipeline

DANGER

Hotfixes under pressure are the most common source of "the fix made it worse" incidents. Every hotfix should be reviewed by at least one other engineer, even if the review takes only 60 seconds. A second pair of eyes catches the typo that turns a 1-hour outage into a 4-hour outage.

Strategy 6: Partial Degradation / Graceful Failover

If you cannot fully restore service, degrade gracefully. Serve stale data, disable non-essential features, or switch to a read-only mode.

typescript
// Circuit breaker pattern — degrade to cached response
async function getProductRecommendations(userId: string) {
  try {
    return await recommendationService.getPersonalized(userId);
  } catch (error) {
    // Service is down — return cached/static recommendations
    logger.warn('Recommendation service unavailable, serving fallback');
    return getCachedRecommendations();
  }
}

Phase 4: Communication

Communication is not optional and it is not secondary to mitigation. Silence during an outage is worse than bad news — it signals that either you do not know about the problem or you do not care.

The Communication Timeline

Time Since DetectionActionChannel
0-5 minInternal alert — "We're aware and investigating"Incident Slack channel
5-15 minSeverity classified, IC assignedIncident Slack channel
15-30 minExternal status page updatedStatuspage / Instatus
30-60 minFirst customer email (if SEV-1/2)Email, in-app banner
Every 30 minInternal status updateIncident Slack channel
Every 60 minExternal status updateStatuspage, Twitter/X
Resolution"Resolved" update on all channelsAll channels
+24-72 hrPostmortem sharedInternal wiki, customer blog (if major)

Internal Communication Templates

Incident Declaration (Slack)

:rotating_light: INCIDENT DECLARED — SEV-[1/2/3]

What: [One sentence describing the symptom]
Impact: [Who is affected and how]
Started: [Time in UTC] ([X minutes ago])
IC: @[incident-commander]
Channel: #inc-[YYYY-MM-DD]-[short-description]
Status: Investigating

War room: [Zoom/Meet link]
Dashboard: [Grafana link]
Runbook: [Link]

Status Update (Every 30 min)

:mag: INCIDENT UPDATE — SEV-[1/2/3] — [HH:MM] UTC

Current status: [Investigating / Identified / Mitigating / Monitoring]
What we know: [2-3 sentences on current understanding]
What we're doing: [Current mitigation actions]
Next update: [Time] UTC

Impact: [Updated impact assessment]
Duration so far: [X hours Y minutes]

Resolution (Slack)

:white_check_mark: INCIDENT RESOLVED — SEV-[1/2/3]

Duration: [X hours Y minutes]
What happened: [2-3 sentence summary]
Mitigation: [What fixed it]
Customer impact: [Brief impact summary]

Postmortem will be scheduled within [48/72] hours.
Channel will be archived in 24 hours — add any notes before then.

External Communication Templates

Status Page — Investigating

Title: Elevated error rates on [Service Name]

We are currently investigating elevated error rates affecting
[description of affected functionality]. Some users may experience
[specific symptom — e.g., slow page loads, failed API requests,
inability to log in].

Our engineering team is actively working to resolve this issue.
We will provide updates every 30 minutes.

Posted: [Time] UTC

Status Page — Identified

Title: Elevated error rates on [Service Name] — Cause Identified

We have identified the cause of the elevated error rates affecting
[Service Name]. The issue is related to [general category — e.g.,
a configuration change, increased load, a dependency failure].

Our team is implementing a fix and we expect service to be restored
within [estimated timeframe].

Updated: [Time] UTC

Status Page — Resolved

Title: Elevated error rates on [Service Name] — Resolved

The issue affecting [Service Name] has been resolved. Service has
been restored to normal operation as of [Time] UTC.

The incident lasted approximately [duration]. During this time,
[brief impact summary — e.g., approximately X% of API requests
returned errors].

We will publish a detailed incident report within [timeframe].
We apologize for the disruption.

Updated: [Time] UTC

Customer Email (SEV-1, Major Outage)

Subject: [Service Name] Service Disruption — [Date]

Dear [Customer Name],

We want to inform you about a service disruption that affected
[Service Name] today.

WHAT HAPPENED
Starting at [time] UTC, [brief description of the symptom from the
customer's perspective — e.g., users were unable to access their
dashboards / API requests returned errors / data processing was
delayed].

IMPACT
The disruption lasted [duration] and affected [scope of impact].
[If applicable: Your account was / was not directly affected.]

RESOLUTION
Our engineering team identified the issue at [time] and implemented
a fix at [time]. Service was fully restored at [time] UTC.

WHAT WE'RE DOING TO PREVENT RECURRENCE
We are conducting a thorough incident review and will implement
[brief description of preventive measures]. We take reliability
seriously and are committed to preventing similar disruptions.

[If applicable: SLA CREDITS
If this disruption affected your service level agreement, credits
will be applied automatically to your next invoice.]

If you have questions, please contact your account manager or
our support team at support@example.com.

Sincerely,
[VP Engineering / CTO Name]
[Company Name]

WARNING

Never lie in external communications. Do not say "a small number of users" when 80% were affected. Do not say "briefly" when the outage lasted 3 hours. Customers who experienced the outage will read your status update and lose trust if the description does not match their experience.

Incident Commander Role

The Incident Commander (IC) is the single point of authority during an incident. They do not debug — they coordinate.

IC Responsibilities

  1. Declare the incident and set the initial severity
  2. Open the war room (Slack channel, video call)
  3. Assign roles (communications lead, technical lead, scribe)
  4. Drive the triage — ensure the right people are investigating the right things
  5. Make decisions — when responders disagree on mitigation approach, the IC decides
  6. Authorize risky actions — rollbacks, hotfixes, data migrations during the incident
  7. Manage communication cadence — ensure status updates go out on schedule
  8. Call for escalation — page additional responders, notify leadership
  9. Declare resolution — confirm the incident is over
  10. Ensure postmortem is scheduled

IC Anti-Patterns

Anti-PatternImpactCorrect Behavior
IC debugs the issue themselvesCoordination stops, response slowsDelegate debugging, focus on coordination
IC makes all technical decisionsBottleneck, slower responseLet domain experts decide; IC breaks ties
IC skips communication updatesStakeholders escalate, chaosSet a timer, delegate to comms lead
IC never escalatesTeam drowns in a SEV-1Escalate early and often
IC assigns blame during incidentPeople stop sharing informationFocus on facts and actions, never blame

War Room Etiquette

WAR ROOM RULES (Pin this in every incident channel)

1. IC runs the room. Follow their instructions.
2. One conversation at a time. Raise hand to speak.
3. State facts, not theories. "Error rate is 47%" not "I think the DB is slow."
4. Prefix messages with your role: "[DB] Replication lag is 30 seconds"
5. If you're not actively contributing, mute and observe.
6. No blame. No "who deployed this?" — focus on "what is the state now?"
7. Keep a timeline. Scribe records every significant action with timestamp.
8. Announce before making changes: "I'm going to restart pod X" → wait for IC OK.
9. No side-channel debugging. All findings reported in the war room.
10. Take breaks. Fatigue causes mistakes. IC rotates responders on long incidents.

Timeline Documentation

Every incident must have a written timeline. The scribe (or the IC if no scribe is assigned) records every significant event with a UTC timestamp.

Timeline Template

markdown
## Incident Timeline

| Time (UTC) | Event | Actor |
|------------|-------|-------|
| 14:02 | Alert fires: API error rate > 5% | PagerDuty |
| 14:04 | On-call acknowledges alert | @alice |
| 14:06 | Confirms error rate is real, 12% and climbing | @alice |
| 14:08 | Incident declared SEV-2, channel #inc-2026-04-05-api-errors created | @alice |
| 14:10 | IC assigned: @bob | @alice |
| 14:12 | War room opened, @carol paged for database expertise | @bob (IC) |
| 14:15 | Identified: errors are all "connection refused" to payment service | @alice |
| 14:17 | Payment service pods are in CrashLoopBackOff | @carol |
| 14:19 | IC decision: rollback payment service to previous version | @bob (IC) |
| 14:21 | Rollback initiated | @alice |
| 14:24 | Payment service pods healthy, error rate dropping | @alice |
| 14:28 | Error rate at 0.1% (normal), monitoring for 15 min | @alice |
| 14:30 | Status page updated: "Identified and mitigating" | @dave (comms) |
| 14:43 | 15 min stable at normal levels, incident resolved | @bob (IC) |
| 14:45 | Status page updated: "Resolved" | @dave (comms) |
| 14:47 | Postmortem scheduled for 2026-04-07 | @bob (IC) |

TIP

Write the timeline as events happen, not after the incident. Memory is unreliable under stress, and chat logs are hard to reconstruct into a coherent timeline after the fact. A dedicated scribe who records events in real-time is the single highest-value role during a long incident.

Phase 5: Postmortem

The postmortem converts the incident from a crisis into a permanent improvement. It is the most important phase of the incident lifecycle because it is the only phase that prevents future incidents.

Postmortem Scheduling

SeverityPostmortem Required?Due Date
SEV-1RequiredWithin 48 hours
SEV-2RequiredWithin 72 hours
SEV-3RecommendedWithin 1 week
SEV-4OptionalBest effort

The 5 Whys Technique

The 5 Whys is a simple root cause analysis technique. You ask "why" iteratively until you reach a systemic root cause that can be fixed.

Incident: Payment processing was down for 40 minutes.

Why 1: Why was payment processing down?
  → The payment service pods were in CrashLoopBackOff.

Why 2: Why were the pods crashing?
  → The new deployment had a null pointer exception when
    the payment gateway returned an unexpected response format.

Why 3: Why did the code not handle the unexpected response?
  → The integration test mocked the payment gateway response
    and never tested the real response format.

Why 4: Why didn't the test cover the real response format?
  → There is no contract test or integration test that validates
    against the actual payment gateway API.

Why 5: Why is there no contract test?
  → Contract testing was never added to the testing standards,
    and there is no linting rule or CI check that requires it for
    external API integrations.

Root Cause: Missing contract testing standard for external integrations.

Action Items:
1. Add contract tests for payment gateway (P0 — @alice, due next sprint)
2. Add CI check requiring contract tests for all external API clients (P1 — @bob)
3. Audit other external integrations for contract test coverage (P1 — @carol)

WARNING

The 5 Whys technique has a known limitation: it tends to converge on a single linear causal chain. Real incidents have multiple contributing causes. Use 5 Whys as a starting point, then explicitly ask "what other factors contributed?" to surface systemic issues that a single chain misses.

Postmortem Template

markdown
# Postmortem: [Incident Title]

**Date:** YYYY-MM-DD
**Duration:** HH:MM (detection to resolution)
**Severity:** SEV-X
**Incident Commander:** [Name]
**Authors:** [Names]
**Status:** Draft / In Review / Final

---

## Impact

- **Users affected:** ~N
- **Revenue impact:** $X (estimated)
- **SLO budget consumed:** X% of monthly budget
- **Services affected:** [list]
- **Error rate peak:** X%
- **Duration of customer impact:** X minutes

## Summary

[2-3 paragraph narrative of what happened, written for someone who
was not in the war room. Describe the timeline at a high level.]

## Timeline

[Detailed timeline table — see template above]

## Root Cause Analysis

### 5 Whys

[Walk through the chain]

### Contributing Factors

[List all systemic factors that enabled the incident]

## What Went Well

- [Things that worked during the response]
- [Detection was fast, rollback worked, communication was clear]

## What Went Poorly

- [Things that slowed down or hampered the response]
- [Detection was slow, wrong team paged, rollback failed]

## Action Items

| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| P0 | [Must-do to prevent recurrence] | @name | Date | Open |
| P1 | [Should-do to improve resilience] | @name | Date | Open |
| P2 | [Nice-to-have improvement] | @name | Date | Open |

## Lessons Learned

[Key insights for the broader engineering organization]

Blameless Culture

Blameless does not mean consequence-free. It means:

  • Focus on systems, not individuals. "The deploy pipeline has no canary analysis" — not "Alice deployed without testing."
  • Assume good intent. Engineers made reasonable decisions given the information they had at the time.
  • Avoid counterfactual blame. "If only Bob had checked the logs" is unhelpful because it does not prevent recurrence. "The monitoring did not surface the relevant logs" is actionable.
  • Psychological safety. If engineers fear punishment, they will hide information, avoid on-call, and under-report incidents. This makes the organization less reliable, not more.

Recurring Incident Patterns

Over time, certain patterns emerge that cause the same classes of incidents repeatedly. Recognizing these patterns accelerates both triage and mitigation.

The Twelve Recurring Patterns

PatternDescriptionMitigation Strategy
Deploy-and-prayNo canary, no staged rolloutCanary deployments, automated rollback on error spike
Config as code bombConfig change bypasses CI/CD safetyConfig changes through same pipeline as code
Dependency rouletteUpstream service fails, takes you downCircuit breakers, timeouts, fallback responses
Thundering herdAll caches expire simultaneouslyStaggered TTLs, cache stampede protection
DNS time bombTTL too high, failover is slowLow TTL for critical records, pre-warm DNS
Certificate expiryTLS cert expires, HTTPS breaksAutomated cert renewal (Let's Encrypt), expiry alerting
Disk fullLogs or data fill disk, service crashesLog rotation, disk usage alerts at 80%
Connection pool exhaustionSlow queries hold connections, pool drainsConnection timeouts, pool size monitoring
Memory leakGradual OOM over days/weeksMemory usage trending alerts, periodic restarts
Schema driftDB migration breaks backward compatibilityExpand/contract pattern, zero-downtime migrations
Secret rotation failureRotated secret, forgot a serviceCentralized secret management, automated rotation
Region failover untestedDR plan exists but was never testedRegular failover drills (GameDay)

Pattern Recognition During Triage

Quick diagnostic questions:

1. Did a deploy happen in the last 2 hours?          → Deploy-and-pray
2. Did a config change happen in the last 2 hours?   → Config bomb
3. Is an upstream dependency having issues?           → Dependency roulette
4. Did the issue start at a round time (top of hour)? → Thundering herd / cron collision
5. Has the issue been getting gradually worse?        → Memory leak / disk full
6. Is a certificate or secret near expiry?            → Cert/secret rotation
7. Are connection pools at capacity?                  → Connection pool exhaustion

On-Call Handbook

On-Call Responsibilities

You are on-call to detect, triage, and mitigate — not to fix root causes, refactor code, or implement permanent solutions during the on-call shift.

On-Call Checklist (Start of Rotation)

markdown
- [ ] Verify PagerDuty/OpsGenie is configured and you receive test pages
- [ ] Verify VPN access works
- [ ] Verify you can access production dashboards (Grafana, Datadog, etc.)
- [ ] Verify you can deploy and rollback
- [ ] Review recent deploys and changes from the previous on-call shift
- [ ] Review open incidents or known issues
- [ ] Confirm you have the on-call phone/laptop charged and nearby
- [ ] Know who to escalate to (secondary on-call, team lead, IC roster)

On-Call Escalation Path

On-Call Handoff Template

markdown
## On-Call Handoff — [Date]

### Open Issues
- [Issue 1]: [Brief status, what to watch for]
- [Issue 2]: [Brief status, what to watch for]

### Recent Changes
- [Deploy X]: Shipped at [time], monitoring [metric]
- [Config change Y]: Applied at [time], watch for [symptom]

### Alerts That Fired
- [Alert A]: [Was it real? What was done?]
- [Alert B]: [Known flaky, ignore unless sustained > 10 min]

### Known Risks
- [Upcoming deploy of feature Z on Tuesday]
- [Upstream provider maintenance window Thursday 2-4am UTC]

### Notes for Next On-Call
- [Anything the next person should know]

Incident Response Tools

Alerting and Paging

ToolBest ForKey Feature
PagerDutyEnterprise on-call managementIntelligent routing, escalation policies, analytics
OpsGenieAtlassian-native teamsDeep Jira/Confluence integration
Grafana OnCallOpen-source, Grafana-nativeFree, integrates with Grafana alerting
incident.ioSlack-native incident managementIncident lifecycle in Slack, auto-documentation
FireHydrantFull incident lifecycleRunbooks, status pages, retrospectives

Status Pages

ToolBest ForKey Feature
Atlassian StatuspageEnterprise, establishedWidely recognized format, component status
InstatusModern, fast setupBeautiful UI, fast, affordable
CachetSelf-hostedOpen-source, full control
incident.ioIntegrated workflowStatus page as part of incident management

Incident Documentation

ToolBest ForKey Feature
incident.ioAutomated timelinePulls Slack messages into timeline automatically
JeliDeep postmortem analysisNarrative-focused, learning-oriented
BlamelessSRE-focused teamsSLO integration, error budget tracking
Google DocsSimplicityFree, everyone knows how to use it

Runbook Automation

ToolBest ForKey Feature
RundeckSelf-hosted automationJob scheduling, access control
PagerDuty AutomationPagerDuty customersEvent-driven automation
ShorelineReal-time remediationOp scripts run on fleet in seconds

Metrics to Track

Measure your incident response to improve it over time:

MetricDefinitionTarget
MTTD (Mean Time to Detect)Time from incident start to first alert< 5 min
MTTA (Mean Time to Acknowledge)Time from alert to human acknowledgment< 5 min (SEV-1)
MTTM (Mean Time to Mitigate)Time from detection to customer impact stops< 30 min (SEV-1)
MTTR (Mean Time to Resolve)Time from detection to root cause fix deployedVaries by severity
Postmortem completion rate% of required postmortems completed on time> 95%
Action item completion rate% of P0/P1 action items completed by due date> 90%
Incidents per monthTotal incident count by severityTrending downward
Customer-detected rate% of incidents first reported by customers< 10%

Key Takeaway

Key Takeaway

Incident response is a practiced skill, not an improvised reaction. The difference between a 15-minute mitigation and a 4-hour outage is almost never technical ability — it is having pre-built playbooks, clear escalation paths, communication templates ready to paste, and a team that has practiced the process before the real incident hits. Invest in the boring operational infrastructure (severity definitions, on-call handoffs, status page templates) and the exciting emergencies become routine.


Misconceptions

7 Incident Response Misconceptions

1. "The best incident responders are the ones who fix things fastest." The best incident responders are the ones who restore service fastest — which usually means rollback, not debugging. An engineer who rolls back in 2 minutes and investigates later is more effective than one who spends 45 minutes finding the root cause while the outage continues.

2. "We need to find the root cause before we can mitigate." Root cause analysis happens in the postmortem, not during the incident. Mitigation is about stopping customer impact, which can usually be done without understanding why the failure occurred. Roll back, toggle the flag, shift traffic — then investigate.

3. "More people in the war room means faster resolution." Beyond 5-7 active responders, additional people create coordination overhead that slows response. The IC should page specific domain experts, not broadcast for volunteers. Observers should mute and watch.

4. "SEV-1 means the incident is our fault." Severity classification is about customer impact, not blame. A SEV-1 caused by an upstream provider failure is still SEV-1 for your customers. Classify by impact, not by root cause.

5. "If we communicate the outage, customers will panic." Customers are already experiencing the outage. Silence does not prevent panic — it creates it. Proactive, honest communication builds trust even during failures. Companies that communicate well during incidents (Cloudflare, GitHub, Stripe) are trusted more because of their transparency.

6. "Blameless means no one is accountable." Blameless means not punishing honest mistakes made in good faith. It does not mean ignoring systemic accountability. Action items have owners, due dates, and are tracked to completion. Teams are accountable for their service reliability. Individuals are not punished for being human.

7. "We'll set up incident response when we're bigger." Incident response practices scale down to teams of 2-3. A startup with clear severity definitions, a Slack channel convention, and a postmortem template handles incidents better than a 200-person company with no process. Start simple and grow the process with the team.


When NOT to Use This Playbook

ScenarioWhy NotWhat to Do Instead
Planned maintenance windowsMaintenance is expected downtime, not an incidentUse a maintenance runbook and pre-notify customers
Bug reports with no current customer impactNot an active incident — it is a bugFile a ticket, prioritize normally
Performance issue with no degradationProactive optimization, not incident responseTrack in sprint planning, use performance budgets
Internal tooling issues with no customer impactSEV-4 at most, does not need war roomsHandle during business hours, file a ticket
Third-party outage with no impact on your serviceNot your incidentMonitor, but do not declare an incident
Security vulnerability discovered (no active exploit)This is a security response process, not incident responseFollow vulnerability management process

In Production

Production Considerations

Start with severity definitions. Before anything else, get your team to agree on what SEV-1 through SEV-4 mean for your specific product. A payment processing company and a social media app have very different SEV-1 thresholds. Write it down, share it, and reference it during every incident.

Practice before you need it. Run tabletop exercises (scenario walkthroughs) quarterly. Simulate a SEV-1 without touching production. Have the on-call declare an incident, open a war room, and practice the communication flow. Teams that have never practiced will fumble during real incidents.

Automate the ceremony. Use incident.io, PagerDuty, or even a Slack bot to automate channel creation, role assignment, timeline recording, and status page updates. The less cognitive overhead during an incident, the more brainpower is available for mitigation.

Track completion religiously. The most common failure mode is writing great postmortems with action items that never get done. P0 action items should become JIRA/Linear tickets with due dates, and completion rates should be reviewed in engineering meetings.

Review on-call health. Track pages per on-call shift, off-hours pages, and false alarm rate. If on-call is miserable (woken up 5 times for false alarms), engineers will dread it, morale drops, and response quality degrades. Invest in alert quality.


Quiz

Quiz — 7 Questions

Q1: What is the correct order of the incident lifecycle phases? Detect, Triage, Mitigate, Communicate, Postmortem. Communication actually overlaps with Mitigate in practice (you communicate while mitigating), but the mental model sequence starts with detection and ends with postmortem.

Q2: Why should you rollback before debugging during a SEV-1? The goal of mitigation is to restore service as fast as possible, not to understand why the failure occurred. A rollback takes 1-2 minutes and immediately restores service if the deploy was the cause. Debugging can take 30+ minutes. Root cause analysis belongs in the postmortem, not during the active incident.

Q3: What is the difference between MTTD and MTTR? MTTD (Mean Time to Detect) measures the time from when the incident starts to when you first know about it (alert fires, human notices). MTTR (Mean Time to Resolve) measures the time from detection to the root cause fix being deployed. MTTM (Mean Time to Mitigate) is the metric in between — from detection to customer impact stopping.

Q4: When should you escalate severity during an incident? When in doubt, escalate immediately. If you initially classified as SEV-3 but the blast radius is growing, upgrade to SEV-2 or SEV-1 right away. The cost of over-classifying (a few extra pages) is much lower than the cost of under-classifying (delayed response to a major outage).

Q5: What does "blameless" mean in a postmortem context? Blameless means focusing on systemic factors rather than individual blame. It assumes engineers acted reasonably given the information they had. It does not mean no accountability — action items have owners and due dates. It means not punishing honest mistakes, so that engineers freely share what happened without fear of retribution.

Q6: Why are status page updates important even when you do not have full information? Customers who experience the outage will check your status page. If it says "All Systems Operational" while they cannot log in, they lose trust in your communication. An honest "Investigating" update within 20 minutes signals awareness and responsibility, even before you know the cause.

Q7: What is the most common failure mode of postmortem action items? They are written but never completed. Action items live in a postmortem document that no one revisits, they are never converted to tracked tickets, and there is no regular review of completion. The fix is to automatically create tickets from P0/P1 action items and review completion rates in recurring engineering meetings.


Exercise

Incident Response Tabletop Exercise

Scenario: It is 2:30 AM on a Saturday. Your monitoring system fires an alert: the API error rate has spiked to 35% (normal baseline: 0.5%). The errors are all HTTP 503 (Service Unavailable). Your service handles financial transactions.

Part 1 — Triage (5 minutes)

  1. What severity level do you assign? Justify your decision.
  2. Who do you page? What roles do you need?
  3. What are the first three questions you ask to assess blast radius?

Part 2 — Mitigation (10 minutes) 4. You check the deploy log — a deploy went out 45 minutes ago (before the error spike). The deploy added a new payment validation step. What is your first mitigation action? 5. You roll back, but the error rate does not improve. What is your second mitigation action? 6. You discover the error is "connection refused" from an upstream payment gateway. What mitigation strategy do you use when the problem is an external dependency?

Part 3 — Communication (10 minutes) 7. Write the initial internal Slack message declaring the incident (use the template). 8. Write the status page update for "Investigating" status. 9. The outage is now 90 minutes long. A major customer's CTO emails your CEO asking what is going on. Draft a brief, honest response.

Part 4 — Postmortem (15 minutes) 10. The incident is resolved after 2 hours — the payment gateway had a regional outage, and you mitigated by failing over to their secondary endpoint. Write the 5 Whys analysis. 11. Identify at least 3 action items (with priority, owner, and due date). 12. What went well and what went poorly in your response?

Evaluation criteria:

  • Severity classification matches the impact (financial transactions, 35% error rate = SEV-1)
  • Mitigation attempts are ordered correctly (rollback first, then investigate)
  • Communication is honest, timely, and uses the templates
  • Postmortem identifies systemic factors, not individual blame
  • Action items are specific, actionable, and assigned

One-Liner Summary

Incident response is five phases practiced in advance — detect fast, triage accurately, mitigate before debugging, communicate honestly, and postmortem relentlessly — because the team that rehearses the boring playbook handles the terrifying outage.


Further Reading

"What I cannot create, I do not understand." — Richard Feynman