Skip to content
DevOps Engineer0%

Incident Response

Why It Exists

Production incidents are inevitable. Every system, no matter how well-designed, will eventually fail in unexpected ways. The question is not whether you will have incidents, but how effectively you respond to them. Companies with mature incident response processes resolve incidents 3-5x faster, experience 60% less customer impact, and learn more from each failure than companies that treat incidents as chaotic, ad-hoc events.

Incident response is a discipline borrowed from emergency management (firefighting, disaster response, military operations) and adapted for software systems. The core insight is that under the stress of an active incident, people make poor decisions, communication breaks down, and actions are uncoordinated unless there is a well-practiced process to follow.

The Incident Timeline

Every incident follows a predictable lifecycle:

PhaseKey ActivitySuccess Metric
DetectionAlert fires or customer reportsMTTD (Mean Time to Detect)
TriageClassify severity, assess impactTime to severity assignment
MobilizationAssemble the right peopleTime to first responder
InvestigationDiagnose root causeTime to root cause identification
MitigationStop the bleedingMTTM (Mean Time to Mitigate)
ResolutionFully resolve and verifyMTTR (Mean Time to Resolve)
Post-IncidentLearn and improvePostmortem quality score

The Cost of Incidents

Incident Cost=Revenue Impact+Engineering Cost+Customer Cost+Reputation Cost

For a $100M ARR SaaS company:

ComponentP0 (1 hour)P1 (4 hours)P2 (8 hours)
Revenue loss$11,400/hr$11,400/hr$2,850/hr
Engineering (5 people)$750/hr$500/hr$250/hr
Customer compensation$5,000$2,000$500
Reputation/churn$50,000$10,000$2,000
Total$67,150$57,600$25,400

Investing $200K/year in incident response maturity (tooling, training, practice) can save $500K-2M in incident costs.

First Principles

The Three Priorities of Incident Response

In order of importance:

  1. Protect people: Safety of employees, users, and the public
  2. Preserve the system: Minimize damage, prevent cascading failures
  3. Restore service: Return to normal operation

This ordering matters. You never sacrifice safety for speed. You never make the system worse to appear to be doing something.

Incident Response as a Feedback Loop

Each incident should make the system more resilient. If you're having the same type of incident repeatedly, the feedback loop is broken.

The Incident Command System (ICS)

Adapted from emergency management, the Incident Command System defines clear roles during an incident:

RoleResponsibilityWho
Incident Commander (IC)Coordinates response, makes decisions, controls communicationSenior engineer or SRE
Operations LeadExecutes technical investigation and remediationSubject matter expert
Communications LeadInternal updates, customer comms, status pageProduct or ops person
ScribeDocuments timeline, actions, decisionsAnyone available

The IC does NOT debug the issue. They coordinate people who debug the issue. This is the most common mistake: the most senior engineer becomes IC and then gets sucked into debugging, losing oversight of the overall response.

Core Mechanics

Incident Severity Classification

Incident Response Workflow

Implementation

Incident Management System

typescript
interface Incident {
  id: string;
  title: string;
  severity: 'SEV-0' | 'SEV-1' | 'SEV-2' | 'SEV-3' | 'SEV-4';
  status: 'detected' | 'investigating' | 'mitigating' | 'monitoring' | 'resolved';
  commander: string;
  operationsLead?: string;
  communicationsLead?: string;
  affectedServices: string[];
  detectedAt: Date;
  acknowledgedAt?: Date;
  mitigatedAt?: Date;
  resolvedAt?: Date;
  timeline: TimelineEntry[];
  customerImpact: string;
  internalImpact: string;
  postmortemUrl?: string;
}

interface TimelineEntry {
  timestamp: Date;
  author: string;
  type: 'detection' | 'action' | 'decision' | 'communication' | 'escalation' | 'resolution';
  content: string;
}

class IncidentManager {
  private incidents: Map<string, Incident> = new Map();

  createIncident(params: {
    title: string;
    severity: Incident['severity'];
    commander: string;
    affectedServices: string[];
    customerImpact: string;
  }): Incident {
    const id = `INC-${Date.now().toString(36).toUpperCase()}`;

    const incident: Incident = {
      id,
      title: params.title,
      severity: params.severity,
      status: 'detected',
      commander: params.commander,
      affectedServices: params.affectedServices,
      detectedAt: new Date(),
      timeline: [
        {
          timestamp: new Date(),
          author: 'system',
          type: 'detection',
          content: `Incident created: ${params.title}`,
        },
      ],
      customerImpact: params.customerImpact,
      internalImpact: '',
    };

    this.incidents.set(id, incident);

    // Auto-escalation based on severity
    if (params.severity === 'SEV-0' || params.severity === 'SEV-1') {
      this.openWarRoom(incident);
      this.notifyStakeholders(incident);
    }

    return incident;
  }

  addTimelineEntry(
    incidentId: string,
    entry: Omit<TimelineEntry, 'timestamp'>
  ): void {
    const incident = this.incidents.get(incidentId);
    if (!incident) throw new Error(`Incident ${incidentId} not found`);

    incident.timeline.push({
      ...entry,
      timestamp: new Date(),
    });
  }

  updateStatus(
    incidentId: string,
    status: Incident['status'],
    updatedBy: string
  ): void {
    const incident = this.incidents.get(incidentId);
    if (!incident) throw new Error(`Incident ${incidentId} not found`);

    incident.status = status;

    if (status === 'mitigating' && !incident.mitigatedAt) {
      incident.mitigatedAt = new Date();
    }
    if (status === 'resolved' && !incident.resolvedAt) {
      incident.resolvedAt = new Date();
    }

    this.addTimelineEntry(incidentId, {
      author: updatedBy,
      type: 'action',
      content: `Status changed to ${status}`,
    });
  }

  getMetrics(incidentId: string): {
    mttd: number;
    mtta: number;
    mttm: number;
    mttr: number;
  } {
    const incident = this.incidents.get(incidentId);
    if (!incident) throw new Error(`Incident ${incidentId} not found`);

    const detected = incident.detectedAt.getTime();
    const acknowledged = incident.acknowledgedAt?.getTime() ?? detected;
    const mitigated = incident.mitigatedAt?.getTime() ?? Date.now();
    const resolved = incident.resolvedAt?.getTime() ?? Date.now();

    return {
      mttd: 0, // Time from actual failure to detection - requires external signal
      mtta: (acknowledged - detected) / 60_000,
      mttm: (mitigated - detected) / 60_000,
      mttr: (resolved - detected) / 60_000,
    };
  }

  private openWarRoom(incident: Incident): void {
    console.log(`Opening war room for ${incident.id}: ${incident.title}`);
    // Create Slack channel, Zoom bridge, etc.
  }

  private notifyStakeholders(incident: Incident): void {
    console.log(`Notifying stakeholders for ${incident.severity} incident`);
    // Page on-call, notify management, update status page
  }
}

Edge Cases and Failure Modes

Common Incident Response Failures

  1. Everyone debugs, nobody coordinates: Without an IC, five engineers investigate independently, duplicate work, and conflict.
  2. Communication blackout: Internal stakeholders and customers are left in the dark, escalating anxiety and pressure.
  3. Premature root cause declaration: "It's the database!" - team spends 30 minutes on the database while the real cause is a config change.
  4. Hero culture: One senior engineer fixes everything, creating a single point of failure and preventing others from learning.
  5. Blame culture: Fear of blame prevents honest postmortems, which prevents systemic improvements.

Performance Characteristics

Industry Benchmarks

MetricTop QuartileMedianBottom Quartile
MTTD2 min15 min60+ min
MTTA3 min10 min30+ min
MTTM15 min60 min4+ hours
MTTR30 min4 hours24+ hours
Postmortem completion100% for SEV-0/160%< 20%
Action item completion90%50%< 25%

Mathematical Foundations

Incident Frequency Model

Incidents arrive as a Poisson process with rate λ:

P(k incidents in time t)=(λt)keλtk!

For a service with an average of 2 incidents per month:

P(0 incidents in a month)=e213.5%P(5 incidents in a month)=1k=042ke2k!5.3%

MTTR Decomposition

MTTR=MTTD+MTTE+MTTF+MTTV

Where:

  • MTTD = Mean Time to Detect
  • MTTE = Mean Time to Engage (triage + mobilize)
  • MTTF = Mean Time to Fix (investigate + mitigate)
  • MTTV = Mean Time to Verify (confirm resolution)

Each component is an independent optimization target.

Real-World War Stories

War Story

The Incident With No Incident Commander (2022)

A SaaS company had a SEV-1 outage affecting 30% of users. Six engineers joined the war room simultaneously. Everyone started investigating their own theory. Three people ran queries against the production database simultaneously, causing additional load. Nobody was communicating with customers. After 45 minutes, someone noticed that two engineers had found the same root cause independently but hadn't shared it. The fix took 5 minutes once they coordinated.

Total incident duration: 55 minutes. Estimated duration with an IC: 20 minutes. The 35-minute difference cost $40,000 in revenue.

Lesson: Always designate an IC, even for "small" incidents. The IC's job is to prevent exactly this kind of uncoordinated chaos.

Decision Framework

Incident Response Maturity Model

LevelCharacteristicsMTTR (P0)Postmortem Rate
1 - Ad HocNo process, heroic individuals4+ hours< 10%
2 - DefinedWritten process, basic roles2-4 hours50%
3 - PracticedRegular drills, clear roles, runbooks1-2 hours80%
4 - MeasuredSLOs, metrics dashboards, feedback loops30-60 min95%
5 - OptimizedAutomated detection/mitigation, chaos engineering< 30 min100%

Section Overview

This section covers the complete incident response lifecycle:

Cross-References

"What I cannot create, I do not understand." — Richard Feynman