Skip to content

Migration Playbooks

Why Migrations Matter

Every production system eventually needs to change. The database schema that worked for 1,000 users cannot serve 10 million. The monolith that launched the startup cannot sustain 50 engineering teams. The on-premises infrastructure that met compliance in 2018 cannot compete with cloud-native competitors shipping features weekly.

Migration is the engineering discipline of moving a production system from state A to state B without losing data, breaking functionality, or causing downtime. It sounds simple. It is not. Migrations are the single highest-risk category of engineering work because they combine three dangerous properties simultaneously:

  1. Irreversibility pressure — once you start moving data or traffic, rolling back becomes progressively harder
  2. Blast radius — migrations typically affect the entire system, not a single feature
  3. Coordination complexity — migrations require synchronizing changes across code, data, infrastructure, and often multiple teams

The history of catastrophic outages is littered with failed migrations. Knight Capital lost $440 million in 45 minutes due to a botched code deployment that reactivated dead code. GitLab lost 6 hours of production data when a database migration went wrong and backups were discovered to be non-functional. British Airways grounded all flights for three days after a power migration in their data center failed.

These are not stories of incompetence. They are stories of complexity overwhelming preparation. The purpose of migration playbooks is to systematically reduce this complexity through repeatable processes, risk frameworks, and rollback strategies.

The Migration Spectrum

Not all migrations carry equal risk. Understanding where your migration falls on the risk spectrum determines how much preparation you need:


Types of Migrations

Database Migrations

Database migrations modify the schema or engine that stores your application's data. They are the most common and most dangerous migration type because data is the one thing you absolutely cannot lose.

Migration TypeRisk LevelTypical DurationRollback Difficulty
Add nullable columnLowSecondsEasy — drop the column
Add index (concurrent)LowMinutes to hoursEasy — drop the index
Rename columnMediumHours (expand-contract)Medium — requires code coordination
Change column typeHighHours to daysHard — data conversion may be lossy
Shard a databaseCriticalWeeksVery hard — data is distributed
Switch database engineCriticalWeeks to monthsVery hard — application layer changes

Start with the Expand-Contract Pattern

Almost every database migration can be made safe by decomposing it into expand (add new) and contract (remove old) phases. This is covered in depth in the Database Schema Migrations page.

Application Architecture Migrations

These migrations change how your application code is structured and deployed. The canonical example is monolith-to-microservices, but it also includes:

  • Framework migrations (e.g., Express to Fastify, Angular to React)
  • Language migrations (e.g., Ruby to Go for performance-critical services)
  • API version migrations (e.g., REST v1 to v2, or REST to GraphQL)
  • Authentication migrations (e.g., session-based to JWT, or to OAuth 2.0)

Application migrations are lower risk than database migrations because code is stateless and deployable. The primary risk is in the transition period where old and new code must coexist.

See also: Monolith to Microservices for the most complex application migration pattern.

Infrastructure Migrations

Infrastructure migrations move your compute, storage, and networking from one platform or configuration to another:

  • Cloud migrations (on-premises to cloud, or cloud-to-cloud)
  • Container migrations (VMs to Docker/Kubernetes)
  • Network migrations (IP range changes, VPN restructuring, DNS provider changes)
  • Storage migrations (NFS to S3, EBS to EFS, HDD to SSD)

Infrastructure migrations carry high risk because they affect every layer of the stack simultaneously. A misconfigured security group or DNS record can take down the entire system.

See also: Cloud Migration Playbook for the full 6 R's framework.

Data Migrations

Data migrations move data between systems, formats, or locations without necessarily changing the schema:

  • ETL migrations (changing data pipeline tools)
  • Data format migrations (XML to JSON, CSV to Parquet)
  • Data residency migrations (moving data to comply with GDPR, data sovereignty laws)
  • Archive migrations (moving cold data to cheaper storage tiers)

Risk Assessment Framework

Before starting any migration, run it through this risk assessment. The goal is not to avoid risk — it is to understand it well enough to prepare appropriate safeguards.

The DICE Framework

Adapted from the Boston Consulting Group's DICE framework for change management, this version is tailored for technical migrations:

FactorScore 1 (Low Risk)Score 2 (Medium Risk)Score 3 (High Risk)Score 4 (Critical Risk)
Duration< 1 day1-7 days1-4 weeks> 4 weeks
Integrity (data risk)No data changesAdditive onlyData transformationData deletion/movement
CoordinationSingle team2-3 teamsCross-orgExternal vendors
Effort (% of team)< 10%10-25%25-50%> 50%

Total score interpretation:

  • 4-6: Low risk. Standard deployment process is sufficient.
  • 7-10: Medium risk. Dedicated migration plan with explicit rollback steps.
  • 11-14: High risk. War room, dedicated on-call, rehearsed rollback.
  • 15-16: Critical risk. Board-level visibility, phased execution over weeks, kill switches at every stage.
typescript
// Migration risk calculator
interface MigrationRisk {
  duration: 1 | 2 | 3 | 4;
  integrity: 1 | 2 | 3 | 4;
  coordination: 1 | 2 | 3 | 4;
  effort: 1 | 2 | 3 | 4;
}

function assessRisk(risk: MigrationRisk): string {
  const total = risk.duration + risk.integrity
    + risk.coordination + risk.effort;

  if (total <= 6) return 'LOW — standard deployment process';
  if (total <= 10) return 'MEDIUM — dedicated migration plan';
  if (total <= 14) return 'HIGH — war room + rehearsed rollback';
  return 'CRITICAL — phased execution, board visibility';
}

// Example: sharding a production database
const shardingMigration: MigrationRisk = {
  duration: 4,   // > 4 weeks
  integrity: 4,  // data movement across shards
  coordination: 3, // cross-org (DBA, backend, SRE)
  effort: 3,     // 25-50% of engineering team
};

console.log(assessRisk(shardingMigration));
// "CRITICAL — phased execution, board visibility"

Pre-Migration Checklist

Every migration, regardless of risk level, should pass this checklist before execution:

markdown
## Pre-Migration Checklist

### Data Safety
- [ ] Full backup taken and verified (restore tested)
- [ ] Point-in-time recovery confirmed working
- [ ] Data validation queries written and tested
- [ ] Rollback scripts written and tested in staging

### Observability
- [ ] Migration-specific dashboards created
- [ ] Alert thresholds adjusted for migration window
- [ ] Error rate baseline recorded
- [ ] Latency baseline recorded

### Communication
- [ ] Stakeholders notified of migration window
- [ ] Status page prepared (if customer-facing impact)
- [ ] Escalation contacts confirmed available
- [ ] Rollback decision criteria documented

### Execution
- [ ] Migration rehearsed in staging environment
- [ ] Timing estimates validated against staging run
- [ ] Feature flags in place for traffic control
- [ ] Kill switch tested and documented

Never Skip the Staging Rehearsal

The number one cause of migration failures is skipping the staging rehearsal. Production data is always messier, larger, and more interconnected than you expect. A migration that takes 30 seconds in staging may take 3 hours in production due to index sizes, foreign key constraints, or replication lag.


Rollback Strategies

A migration without a rollback plan is not a migration — it is a gamble. Every migration must have a documented rollback strategy before it begins.

Rollback Strategy Matrix

The Rollback Time Budget

Every migration has a point of no return — the moment after which rollback becomes more expensive than pushing forward. You must identify this point before starting.

typescript
interface RollbackBudget {
  /** Maximum acceptable downtime in minutes */
  maxDowntimeMinutes: number;
  /** Time to execute rollback from any point */
  rollbackExecutionMinutes: number;
  /** Time to verify rollback succeeded */
  rollbackVerificationMinutes: number;
  /** Buffer for unexpected issues (typically 2x) */
  bufferMultiplier: number;
}

function calculatePointOfNoReturn(budget: RollbackBudget): number {
  const rollbackTotal =
    (budget.rollbackExecutionMinutes + budget.rollbackVerificationMinutes)
    * budget.bufferMultiplier;

  const pointOfNoReturn = budget.maxDowntimeMinutes - rollbackTotal;

  if (pointOfNoReturn <= 0) {
    throw new Error(
      'Rollback time exceeds downtime budget — migration cannot be done safely in this window'
    );
  }

  return pointOfNoReturn;
}

// Example: 4-hour maintenance window
const budget: RollbackBudget = {
  maxDowntimeMinutes: 240,
  rollbackExecutionMinutes: 45,
  rollbackVerificationMinutes: 15,
  bufferMultiplier: 2,
};

console.log(`Point of no return: ${
  calculatePointOfNoReturn(budget)
} minutes into migration`);
// "Point of no return: 120 minutes into migration"

Know Your Point of No Return

If you reach the point of no return and the migration is not on track, you must roll back immediately. Do not negotiate with the timeline. Do not convince yourself "it will probably finish." The further past the point of no return you go, the more catastrophic a failure becomes. This decision should be made by a single designated owner, not by committee.

Forward-Fix vs. Rollback

Sometimes rolling back is worse than fixing forward. The decision framework:

ConditionRollbackFix Forward
Data has been deleted or corruptedRollback to backupNever
New code has a bug but data is intactEither — depends on speedIf fix is < 15 min
Migration is 90% complete with no errorsFinish itFix Forward
Users are reporting errorsRollback immediatelyOnly if root cause is known
Old infrastructure already decommissionedCannot rollbackFix Forward (only option)

Migration Execution Patterns

The Traffic Shift Pattern

For any migration that involves moving traffic between systems, use a gradual shift with automated rollback:

The Dual-Write Pattern

For data migrations, dual-writing ensures no data is lost during the transition:

  1. Start dual-writing — write to both old and new systems
  2. Backfill — copy historical data to the new system
  3. Validate — compare data in both systems
  4. Switch reads — read from the new system
  5. Stop writing to old — remove the dual-write path
  6. Decommission — remove old system after a safety period
typescript
class DualWriteProxy<T> {
  constructor(
    private oldStore: DataStore<T>,
    private newStore: DataStore<T>,
    private mode: 'shadow' | 'dual-write' | 'new-primary' | 'new-only'
  ) {}

  async write(key: string, value: T): Promise<void> {
    switch (this.mode) {
      case 'shadow':
        await this.oldStore.write(key, value);
        // Fire-and-forget to new store for testing
        this.newStore.write(key, value).catch(err =>
          metrics.increment('dual_write.shadow.error')
        );
        break;

      case 'dual-write':
        // Both must succeed
        await Promise.all([
          this.oldStore.write(key, value),
          this.newStore.write(key, value),
        ]);
        break;

      case 'new-primary':
        await this.newStore.write(key, value);
        this.oldStore.write(key, value).catch(err =>
          metrics.increment('dual_write.old_fallback.error')
        );
        break;

      case 'new-only':
        await this.newStore.write(key, value);
        break;
    }
  }

  async read(key: string): Promise<T | null> {
    if (this.mode === 'new-primary' || this.mode === 'new-only') {
      return this.newStore.read(key);
    }
    return this.oldStore.read(key);
  }
}

Communication During Migrations

The RACI Matrix for Migrations

ActivityResponsibleAccountableConsultedInformed
Migration planLead engineerEngineering managerDBA, SREAll engineering
Risk assessmentLead engineerVP EngineeringSecurity, LegalCTO
Staging rehearsalMigration teamLead engineerQASRE
Go/no-go decisionLead engineerVP EngineeringSRE, DBAAll engineering
ExecutionMigration teamLead engineerSRE (on-call)Status page
Rollback decisionIncident commanderVP EngineeringLead engineerAll engineering
Post-migration reviewLead engineerEngineering managerAll involvedAll engineering

Status Update Template

During long-running migrations, send status updates every 30 minutes:

markdown
## Migration Status Update — {​{​ timestamp }​}

**Phase:** 3 of 5 (Data backfill)
**Progress:** 67% (2.1M / 3.1M rows)
**ETA:** 45 minutes remaining
**Status:** ON TRACK

### Metrics
- Error rate: 0.001% (threshold: 0.1%)
- p99 latency: 142ms (baseline: 135ms)
- Replication lag: 2.3s (threshold: 10s)

### Issues
- None

### Next checkpoint
- 14:30 UTC — Phase 3 completion expected

This overview establishes the foundational concepts. Dive into the specific migration type you are planning:

Migration TypePageWhen To Use
Database schema changesZero-Downtime Database MigrationsAdding columns, changing types, renaming fields
Application re-architectureMonolith to MicroservicesBreaking apart a monolith
Infrastructure relocationCloud Migration PlaybookMoving to cloud or between clouds

See also: Deployment Strategies for deploying code changes safely, and Disaster Recovery for when migrations go catastrophically wrong.

"What I cannot create, I do not understand." — Richard Feynman