Skip to content

Production Readiness Checklists

Checklists are the cheapest reliability investment you will ever make. A Boeing 747 has four million parts and carries four hundred people at 35,000 feet. The reason it does this safely is not because pilots are geniuses — it is because they follow checklists. Software systems are no less complex, and the humans operating them are no less fallible.

The surgical checklist introduced by Atul Gawande at the WHO reduced surgical mortality by 47%. Not through new technology. Not through hiring better surgeons. Through a one-page checklist that took sixty seconds to complete. Production readiness checklists achieve the same thing for software launches — they catch the obvious failures that smart, experienced engineers miss under pressure.

Related: Pre-Launch Checklist | Security Review Checklist | Performance Review Checklist | Observability Readiness Checklist


Why Checklists Matter

The Problem with Human Memory

Engineers are smart. They are also human. Under the pressure of a launch deadline, even senior engineers forget things:

Failure ModeReal-World ExampleChecklist Catches It?
Forgot to set up monitoringService runs for 3 weeks with no metrics; silent data corruption goes unnoticedYes — "Dashboards created and reviewed"
Skipped load testingLaunch day traffic overwhelms the service; 90 minutes of downtimeYes — "Load test completed at 2x expected traffic"
No rollback planBad deploy ships; team spends 45 minutes figuring out how to roll backYes — "Rollback procedure documented and tested"
Hardcoded secretsAPI key committed to git; detected by security scanner 6 months laterYes — "All secrets in vault/parameter store"
No alertingDatabase fills up overnight; nobody gets paged; users cannot write data for 5 hoursYes — "Alerts configured for disk, CPU, memory, error rate"

The Checklist Manifesto Insight

Gawande identifies two kinds of errors: errors of ignorance (we do not know enough) and errors of ineptitude (we know enough but fail to apply what we know). Checklists eliminate errors of ineptitude. In mature engineering organizations, most production incidents are errors of ineptitude — the team knew they should have set up alerting, they just forgot.

The Data

Organizations that implement production readiness reviews with checklists consistently report:

MetricBefore ChecklistsAfter ChecklistsImprovement
Launch-related incidents (first 30 days)4.2 per service0.8 per service81% reduction
Mean time to detect (MTTD)47 minutes8 minutes83% reduction
Rollback success rate62%97%56% increase
On-call escalations from new services3.1/week0.4/week87% reduction
Time spent on launch-day firefighting12+ hours< 2 hours83% reduction

How to Use These Checklists

The Production Readiness Review (PRR) Process

Step-by-Step

  1. Start early — begin the checklist when the service design is finalized, not the week before launch
  2. Assign an owner — one person is responsible for driving the checklist to completion, even if they delegate individual items
  3. Review with a partner — a fresh pair of eyes catches things the builder assumes are done
  4. Document exceptions — if an item genuinely does not apply, document why (do not just skip it)
  5. Archive completed checklists — they become invaluable during postmortems ("did we actually verify X before launch?")

Checklist Categories

This section contains four checklists. Use them together for a comprehensive production readiness review, or individually for focused reviews:

ChecklistWhen to UseItemsTime to Complete
Pre-Launch ChecklistBefore any new service or major feature goes to production50+ items2-4 hours (review), 1-3 weeks (remediation)
Security Review ChecklistBefore launch and quarterly thereafter40+ items1-2 hours (review), 1-2 weeks (remediation)
Performance Review ChecklistBefore launch and before major traffic events35+ items1-2 hours (review), 1-2 weeks (remediation)
Observability Readiness ChecklistBefore launch and when onboarding new on-call engineers35+ items1-2 hours (review), 1-2 weeks (remediation)

Priority Levels

Every checklist item is tagged with a priority:

PriorityMeaningBlocking?
P0 — CriticalMust be completed before production traffic is served. Failure to complete will cause outages or security incidents.Yes — launch blocked
P1 — HighShould be completed before launch. Acceptable risk window of 1-2 weeks post-launch with a tracking ticket.Soft block — needs exception
P2 — MediumShould be completed within 30 days of launch. Improves operational posture but not immediately dangerous.No — tracked in backlog
P3 — Nice to HaveBest practice that should be adopted eventually.No — aspirational

Building a Checklist Culture

What Goes Wrong Without Culture

You can write the best checklists in the world and they will be ignored if the culture does not support them. Common failure modes:

The Checklist Death Spiral

The most common failure is the death spiral: an incident happens, someone adds a checklist item, the checklist grows, engineers stop reading it carefully, another incident happens, another item is added. Within a year, you have a 200-item checklist that nobody takes seriously. Keep checklists under 60 items. Ruthlessly prune items that have never caught a real issue.

Principles for Checklist Culture

  1. Checklists are collaborative, not compliance — a checklist is a conversation starter between the launch team and the platform/SRE team, not a form to fill out and submit
  2. Keep them short — if a checklist is longer than two pages, it will not be read carefully. Consolidate, simplify, or split into focused checklists
  3. Make them living documents — review checklists quarterly. Remove items that never catch issues. Add items inspired by recent postmortems
  4. Celebrate catches — when a checklist catches a real issue before production, publicize it. "The security review caught an unauthenticated endpoint" is a success story
  5. No blame for honest gaps — if an engineer completes a checklist and marks an item as "not done" with a reason, that is better than lying

The Two Types of Checklists

Gawande distinguishes two types:

TypeDescriptionWhen to UseExample
DO-CONFIRMPerform tasks from memory, then pause and confirm each item against the checklistExperienced teams with well-understood processesPre-launch review for a team's fifth microservice
READ-DORead each item, then perform it, in sequenceNew teams, unfamiliar processes, or high-stakes situationsFirst-ever production launch; disaster recovery execution

For production readiness, start with READ-DO checklists when the team is new to the process. Transition to DO-CONFIRM as the team matures and internalizes the items.


Integrating Checklists into Your Workflow

GitHub PR Template

Create a PR template that references the appropriate checklist:

markdown
## Production Readiness

- [ ] [Pre-Launch Checklist](/devops/checklists/pre-launch) completed
- [ ] [Security Review](/devops/checklists/security-review) completed
- [ ] [Performance Review](/devops/checklists/performance-review) completed
- [ ] [Observability Readiness](/devops/checklists/observability-readiness) completed

### Exceptions

<!-- List any checklist items that were waived and why -->

Automated Checklist Enforcement

Use CI/CD to enforce critical checklist items automatically:

yaml
# .github/workflows/production-readiness.yml
name: Production Readiness Gate

on:
  pull_request:
    branches: [main]
    paths:
      - 'deploy/**'
      - 'k8s/**'
      - 'terraform/**'

jobs:
  readiness-checks:
    runs-on: ubuntu-latest
    steps:
      - name: Verify monitoring configuration
        run: |
          # Check that Prometheus ServiceMonitor exists
          if ! find . -name "servicemonitor.yaml" | grep -q .; then
            echo "::error::No ServiceMonitor found. See Production Readiness Checklist."
            exit 1
          fi

      - name: Verify alerting rules
        run: |
          # Check that PrometheusRule exists
          if ! find . -name "prometheusrule.yaml" | grep -q .; then
            echo "::error::No alerting rules found. See Production Readiness Checklist."
            exit 1
          fi

      - name: Verify runbook links
        run: |
          # Check that every alert has a runbook_url annotation
          for file in $(find . -name "prometheusrule.yaml"); do
            if ! grep -q "runbook_url" "$file"; then
              echo "::error::Alert rules in $file missing runbook_url annotations."
              exit 1
            fi
          done

      - name: Verify resource limits
        run: |
          # Check that all containers have resource limits
          for file in $(find . -name "deployment.yaml" -o -name "statefulset.yaml"); do
            if ! grep -q "limits:" "$file"; then
              echo "::error::$file missing resource limits."
              exit 1
            fi
          done

Tracking Checklist Completion

Use a simple tracking table in your team wiki or project management tool:

markdown
| Service | Pre-Launch | Security | Performance | Observability | Launch Date | Owner |
|---------|-----------|----------|-------------|---------------|------------|-------|
| payment-api | Done | Done | Done | Done | 2026-03-15 | @alice |
| user-service | Done | In Progress | Not Started | Not Started | 2026-04-01 | @bob |
| search-api | Not Started | Not Started | Not Started | Not Started | 2026-04-15 | @carol |

Checklist Maintenance

Quarterly Review Process

  • [ ] Review each checklist item — has it caught a real issue in the last 6 months?
  • [ ] Remove items that have never triggered a finding
  • [ ] Review recent postmortems — should any new items be added?
  • [ ] Check that all cross-references and links are still valid
  • [ ] Verify that automated enforcement still works
  • [ ] Gather feedback from teams that recently completed the checklists
  • [ ] Update time estimates based on actual completion data

Versioning Checklists

Treat checklists like code. Version them, review changes, and maintain a changelog:

bash
# Checklist changelog
## 2026-Q1
- Added: "Circuit breaker configured for all external dependencies" (inspired by payment-api postmortem)
- Removed: "JVM heap size configured" (moved to language-specific supplement)
- Updated: "Load test at 2x traffic" changed to "Load test at 3x traffic" (we've seen 2.5x spikes)

Anti-Patterns

Anti-PatternWhy It FailsBetter Approach
Checklist as a gateTeams see it as bureaucracy to gameFrame as collaborative review
One-size-fits-allA CLI tool doesn't need CDN checksAllow documented exceptions
Checklist without automationManual verification of 60 items is error-proneAutomate what can be automated
Checklist without context"Set up monitoring" is useless without guidanceLink each item to a how-to page
Static checklistNever updated, becomes irrelevantQuarterly review cadence
Checklist as blame tool"You missed item 37" in a postmortemFocus on process improvement

Further Reading

ResourceTypeKey Takeaway
The Checklist Manifesto (Gawande)BookThe seminal work on why checklists work across industries
Google SRE Book, Ch. 32ChapterProduction readiness reviews at Google scale
SLI / SLO / SLA EngineeringInternalDefining the targets your checklist verifies
Incident ResponseInternalWhat happens when checklist gaps reach production
Disaster RecoveryInternalDR planning that checklists should verify
On-Call HandbookInternalThe on-call experience your checklist protects

Measuring Checklist Effectiveness

Track these metrics to ensure your checklists are providing value:

MetricHow to MeasureHealthy Target
Checklist completion rate% of launches that completed all P0 items> 95%
Launch incident rateIncidents per service in first 30 days< 1.0
Item catch rate% of checklist runs that find at least one real issue30-60%
Time to completeAverage hours from start to completion< 8 hours for review
Exception rate% of items waived per checklist run< 10%
Staleness scoreMonths since last checklist update< 3 months

Measuring What Matters

If your checklist catch rate is below 20%, the checklist may be too easy — items are things teams always do anyway. If it is above 80%, the checklist is catching real issues but your development process may need improvement upstream. The sweet spot is 30-60%: the checklist catches issues often enough to be valuable but not so often that it indicates systemic process failures.

Maturity Model for Checklist Adoption


What's Next

Start with the Pre-Launch Checklist — it is the most comprehensive and covers the broadest surface area. Then layer on the Security Review, Performance Review, and Observability Readiness checklists as your team matures.

Remember: a checklist that is actually used beats a perfect checklist that sits in a wiki. Start small, start now, and iterate.

"What I cannot create, I do not understand." — Richard Feynman