Skip to content
Unverified — AI-generated content. Help verify this page

On-Call Handbook

Being on-call is one of the most consequential responsibilities in engineering. When a production system breaks at 3 AM, the on-call engineer is the first line of defense between a minor blip and a catastrophic outage. Yet most teams throw engineers into on-call rotations with minimal preparation — no clear escalation paths, no severity definitions, no runbooks, and no support structure. The result is burned-out engineers, slow incident response, and recurring outages. This handbook defines what good on-call looks like: clear responsibilities, structured incident response, blameless postmortems, and sustainable rotation strategies that protect both the system and the people who run it.

On-Call Responsibilities

What On-Call Means

Being on-call means you are the first responder for production issues affecting your service or domain. Specifically:

ResponsibilityDetails
Acknowledge alertsRespond to pages within the SLA (typically 5-15 minutes)
Triage incidentsDetermine severity, scope, and initial impact
MitigateTake immediate action to reduce customer impact (rollback, scale up, disable feature)
EscalatePull in additional help when needed — do not be a hero
CommunicateKeep stakeholders informed via the incident channel
DocumentRecord actions taken during the incident for the postmortem
Hand offBrief the next on-call engineer at rotation boundaries

What On-Call Does NOT Mean

  • You are NOT expected to fix every issue yourself
  • You are NOT expected to write code during an incident (mitigation first, fix later)
  • You are NOT expected to be chained to your laptop — you need to be reachable and able to respond within the SLA
  • You are NOT responsible for problems caused by other teams — but you ARE responsible for escalating to them

The First Rule of On-Call

Mitigate first, investigate second. The goal during an incident is to restore service, not to find the root cause. Roll back the deploy, scale up the cluster, toggle the feature flag. Root cause analysis happens in the postmortem, not at 3 AM.

Incident Response Process

Severity Levels

Every incident must be classified by severity. The severity determines the response urgency, communication requirements, and escalation path:

SeverityNameDefinitionResponse SLAExample
SEV-1CriticalComplete service outage; data loss or corruption; security breach5 minutes acknowledge, 15 minutes responseProduction database down, payment system broken, data breach
SEV-2MajorSignificant degradation; major feature unavailable; large subset of users affected15 minutes acknowledge, 30 minutes responseSearch is down, checkout is intermittently failing, API latency > 10x normal
SEV-3MinorMinor degradation; workaround available; small subset of users affected30 minutes acknowledge, 2 hours responseEmail notifications delayed, non-critical admin feature broken
SEV-4LowCosmetic issue; no user impact; monitoring alert that needs attentionNext business dayWarning alert on disk usage trending up, non-production environment issue

Incident Response Flow

Incident Roles

For SEV-1 and SEV-2 incidents, assign explicit roles:

RoleResponsibilityWho
Incident Commander (IC)Owns the incident. Coordinates responders, makes decisions, controls the pace. Does NOT debug — they orchestrate.Senior on-call or engineering manager
Communications LeadPosts status updates to stakeholders, customers, status page. Shields responders from questions.Product manager, engineering manager, or designated engineer
Subject Matter Expert (SME)Debugs and mitigates the specific technical issue. The person with the deepest knowledge of the affected system.Domain expert for the affected service
ScribeRecords timeline, actions taken, decisions made. This becomes the foundation for the postmortem.Any available engineer

The IC Does Not Debug

The most common incident anti-pattern is having the IC also be the primary debugger. The IC's job is to maintain situational awareness, coordinate multiple responders, and make escalation decisions. If the IC is deep in logs, nobody is coordinating.

Escalation Paths

When to Escalate

Escalate when:

  • You have been working on the issue for 15 minutes without progress
  • The issue is outside your area of expertise
  • The impact is growing (more users affected, more services failing)
  • You need to make a decision with business impact (e.g., taking down a feature to protect the database)
  • You are unsure of the severity

Escalation Matrix

markdown
## Escalation Contacts (Keep Updated!)

### Infrastructure
- Primary: @alice (Slack) / +1-555-0101
- Secondary: @bob (Slack) / +1-555-0102
- Manager: @charlie (Slack) / +1-555-0103

### Database
- Primary: @dave (Slack) / +1-555-0201
- Secondary: @eve (Slack) / +1-555-0202
- DBA on-call: PagerDuty "Database" service

### Security (always escalate immediately for security incidents)
- Security on-call: PagerDuty "Security" service
- CISO: @frank (Slack) / +1-555-0301

### Executive (SEV-1 only)
- VP Engineering: @grace (Slack) / +1-555-0401
- CTO: @heidi (Slack) / +1-555-0501

Escalation is NOT Failure

One of the most damaging on-call cultural norms is treating escalation as weakness. Escalation is the correct response when you need help. An engineer who spends 2 hours struggling alone when a 5-minute escalation would have resolved the issue has cost the company 2 hours of downtime.

Communication Templates

Incident Channel Opening

markdown
:rotating_light: **INCIDENT — SEV-2: Checkout API returning 500 errors**

**Impact:** ~30% of checkout attempts are failing. Revenue impact estimated.
**Start time:** 2026-03-20 14:32 UTC
**IC:** @alice
**Comms:** @bob
**Channel:** #inc-20260320-checkout-500s

**Current status:** Investigating. Initial evidence suggests recent deploy
(v2.14.3, deployed at 14:15 UTC) may be the cause. Evaluating rollback.

Next update in 15 minutes.

Status Update

markdown
:hourglass: **UPDATE — SEV-2: Checkout API returning 500 errors**
**Time:** 2026-03-20 14:47 UTC

**What we know:**
- Root cause identified: v2.14.3 introduced a query that deadlocks under
  concurrent checkout requests
- Rollback to v2.14.2 initiated at 14:42 UTC
- Rollback deployment is in progress (ETA: 5 minutes)

**What we're doing:**
- Monitoring error rates during rollback
- Scaling up checkout service replicas to handle queued requests

**Customer impact:** Error rate has decreased from 30% to 15% as old pods
are replaced. Full recovery expected within 10 minutes.

Next update in 15 minutes or when resolved.

Resolution Message

markdown
:white_check_mark: **RESOLVED — SEV-2: Checkout API returning 500 errors**
**Time:** 2026-03-20 15:02 UTC
**Duration:** 30 minutes
**Resolution:** Rolled back to v2.14.2

**Impact summary:**
- Duration: 14:32 - 15:02 UTC (30 minutes)
- ~450 checkout attempts failed (estimated $12,000 in delayed revenue)
- No data loss or corruption

**Next steps:**
- Postmortem scheduled for 2026-03-22 10:00 UTC
- Fix for the deadlock query is in review: PR #1234
- Affected customers will receive email notification

CC: @vp-engineering @product-lead

Post-Incident Review (Blameless Postmortem)

The Blameless Principle

The most important principle: blame the system, not the person. If a human made an error, the question is not "why did they do that?" but "why did the system make it easy to make that error?" Human error is a symptom. The root cause is always a system that allowed the error to have impact.

Postmortem Template

markdown
# Postmortem: [Incident Title]

**Date of incident:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV-N
**Author:** [Name]
**Postmortem date:** YYYY-MM-DD (within 48 hours of resolution)
**Attendees:** [Names]

## Summary

2-3 sentences: what happened, how long it lasted, what the impact was.

## Timeline (all times UTC)

| Time | Event |
|------|-------|
| 14:15 | Deploy v2.14.3 rolls out to production |
| 14:32 | Alert fires: checkout API error rate > 5% |
| 14:34 | On-call acknowledges alert |
| 14:38 | Incident declared SEV-2, channel opened |
| 14:42 | Root cause identified: deadlocking query in new code |
| 14:42 | Rollback to v2.14.2 initiated |
| 14:52 | Rollback complete, old pods serving traffic |
| 15:02 | Error rate returns to baseline, incident resolved |

## Root Cause

The new endpoint introduced in v2.14.3 executed a SELECT ... FOR UPDATE
query on the orders table without a consistent ordering clause. Under
concurrent access (which never occurred in staging due to low traffic),
two transactions would lock rows in opposite order, causing a deadlock.
PostgreSQL detected the deadlock and killed one transaction, returning a
500 error to the client.

## Impact

- ~450 checkout requests failed over 30 minutes
- Estimated $12,000 in delayed revenue (customers retried after resolution)
- No data loss or corruption
- No SLA breach (SLA allows 30 minutes of degradation per month)

## What Went Well

- Alert fired within 2 minutes of error rate increase
- On-call acknowledged within 2 minutes
- Root cause identified within 10 minutes
- Rollback was clean and fast (10 minutes)
- Communication was clear and timely

## What Went Wrong

- The deadlock scenario was not caught in code review
- Staging environment does not simulate concurrent traffic
- No automated canary deployment — the deploy went to 100% of pods
  simultaneously

## Action Items

| Action | Owner | Priority | Deadline | Status |
|--------|-------|----------|----------|--------|
| Fix deadlocking query with consistent ORDER BY | @dave | P1 | 2026-03-23 | In review |
| Add concurrent checkout load test to CI | @eve | P1 | 2026-03-27 | Not started |
| Implement canary deployments (10% → 50% → 100%) | @alice | P2 | 2026-04-15 | Not started |
| Add deadlock detection alert to database monitoring | @bob | P2 | 2026-03-30 | Not started |

## Lessons Learned

1. Code review cannot reliably catch concurrency bugs — automated
   concurrent testing is required
2. Staging environments must simulate production traffic patterns
3. Canary deployments limit blast radius — if only 10% of pods had
   the new code, only ~45 requests would have failed

Postmortem Meeting Ground Rules

  1. No blame. Statements like "if X had just..." are not allowed
  2. Focus on systems. "How can we prevent this?" not "Who caused this?"
  3. Be specific about action items. Every action item has an owner, priority, and deadline
  4. Follow up. Track action items to completion. An unfinished postmortem action item is a future incident

Managing On-Call Health

Rotation Strategies

StrategyStructureBest For
Weekly rotationOne engineer per week, 24/7Teams of 5+ engineers
Split rotationBusiness hours primary + after-hours secondaryGeographically distributed teams
Follow-the-sunHand off to the next timezone at end of business dayGlobal teams (US → EU → APAC)
Primary + SecondaryPrimary handles all alerts; secondary is backup if primary is unreachableAll team sizes

Minimum Rotation Size

The minimum sustainable on-call rotation is 5 engineers. With fewer than 5:

On-call frequency=1team size×52 weeks/year
Team SizeWeeks On-Call per YearSustainable?
317.3 weeksNo — burnout guaranteed
413 weeksBarely — no room for vacation
510.4 weeksMinimum viable
68.7 weeksGood
86.5 weeksIdeal

On-Call Compensation

On-call work must be compensated. Common models:

ModelDescriptionTypical Amount
Flat stipendFixed amount per on-call shift$200-500/week
Per-page paymentAdditional payment per incident responded to$50-200/page
Comp timeTime off equal to time spent on incidents1:1 or 1.5:1 ratio
HybridFlat stipend + per-page for incidents outside business hoursVaries

On-Call Without Compensation Is a Retention Problem

Engineers who are paged at 3 AM without compensation will leave. Even if the company culture normalizes uncompensated on-call, the best engineers will move to companies that value their time. Budget for on-call compensation or accept the turnover cost.

Reducing On-Call Burden

The goal is not to make on-call more bearable — it is to reduce the number of pages:

Target: fewer than 2 pages per on-call shift. If your team is getting more than 2 actionable pages per week, the on-call rotation is not sustainable. Invest in reliability improvements before adding more engineers to the rotation.

On-Call Handoff

At the end of every rotation, the outgoing engineer briefs the incoming engineer:

markdown
## On-Call Handoff: 2026-03-13 → 2026-03-20

### Active Issues
- **Checkout latency elevated** (P2): p99 is 800ms (target: 200ms).
  Root cause is slow inventory queries. @dave is working on an index fix.
  Expected resolution: March 21. No action needed unless it crosses 2000ms.

### Recent Incidents
- **SEV-3 on March 15**: Redis primary failover caused 2 minutes of
  elevated errors. Sentinel handled it. No action needed.

### Upcoming Risks
- **v2.15.0 deploy** scheduled for March 21. Contains migration #1234.
  Rollback plan is in the deploy ticket. Watch for elevated error rates
  for 30 minutes post-deploy.

### Environment Notes
- Staging is currently broken (disk full). Ticket #5678 filed.
- New Grafana dashboard for checkout: [link]

PagerDuty / OpsGenie Integration

PagerDuty Configuration

yaml
# Terraform configuration for PagerDuty
resource "pagerduty_service" "checkout" {
  name                    = "Checkout Service"
  escalation_policy       = pagerduty_escalation_policy.checkout.id
  alert_creation          = "create_alerts_and_incidents"
  auto_resolve_timeout    = 14400  # 4 hours
  acknowledgement_timeout = 900    # 15 minutes — re-page if not ack'd

  incident_urgency_rule {
    type    = "use_support_hours"

    during_support_hours {
      type    = "constant"
      urgency = "high"
    }

    outside_support_hours {
      type    = "constant"
      urgency = "high"  # Always high for critical services
    }
  }
}

resource "pagerduty_escalation_policy" "checkout" {
  name      = "Checkout Escalation"
  num_loops = 2  # Repeat escalation twice before giving up

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.checkout_primary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.checkout_secondary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "user_reference"
      id   = pagerduty_user.engineering_manager.id
    }
  }
}

Alert Routing from Prometheus

yaml
# Alertmanager configuration routing to PagerDuty
route:
  receiver: 'default'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true

    - match:
        severity: warning
      receiver: 'pagerduty-warning'
      group_wait: 5m

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: '<pagerduty-integration-key>'
        severity: critical
        description: '&#123;&#123; .GroupLabels.alertname &#125;&#125;: &#123;&#123; .Annotations.summary &#125;&#125;'
        details:
          service: '{​{ .GroupLabels.service }}'
          dashboard: '{​{ .Annotations.dashboard }}'
          runbook: '{​{ .Annotations.runbook_url }}'

  - name: 'pagerduty-warning'
    pagerduty_configs:
      - routing_key: '<pagerduty-integration-key>'
        severity: warning

Alert Quality

Not all alerts should page. Apply this framework:

Alert TypeActionPages On-Call?
Service is down (0% success rate)Immediate responseYes — SEV-1/SEV-2
Error rate > 5% for 5+ minutesInvestigate and mitigateYes — SEV-2/SEV-3
Latency p99 > SLA for 10+ minutesInvestigateYes — SEV-3
Disk usage > 80%Plan capacity increaseNo — ticket
Certificate expiring in 30 daysRenew certificateNo — ticket
Non-production environment issueFix during business hoursNo — Slack notification

Every Alert That Pages Must Be Actionable

If an alert pages the on-call engineer and the correct response is "ignore it" or "it will resolve itself," that alert should be deleted or converted to a non-paging notification. Alert fatigue — the desensitization caused by too many false or non-actionable alerts — is the number one killer of on-call effectiveness.

On-Call Maturity Model

LevelCharacteristicsTarget
Level 1: FirefightingNo runbooks, no severity levels, everyone pages everyone, no postmortemsGet to Level 2 within 3 months
Level 2: ReactiveSeverity levels defined, basic escalation, postmortems happen but action items are not trackedGet to Level 3 within 6 months
Level 3: StructuredRunbooks for common issues, postmortem action items tracked, on-call compensation, < 5 pages/weekGet to Level 4 within 12 months
Level 4: ProactiveAuto-remediation for known issues, canary deployments, < 2 pages/week, on-call handoffs are thoroughMaintain
Level 5: PreventiveChaos engineering, game days, error budgets, on-call is boring (rare pages, fast resolution)Aspire

Further Reading

"What I cannot create, I do not understand." — Richard Feynman