Multi-Region Architecture Patterns
Why Architecture Patterns Matter
Choosing the wrong multi-region pattern can cost millions in unnecessary infrastructure, create intractable data consistency problems, or worse — provide a false sense of safety that collapses during the very outage it was designed to survive.
Each pattern represents a specific trade-off between cost, complexity, consistency, and recovery guarantees. Understanding these trade-offs from first principles — rather than cargo-culting a pattern from a FAANG blog post — is the difference between a multi-region architecture that works and one that fails when you need it most.
The Three Fundamental Patterns
| Pattern | Description | RPO | RTO | Cost | Use Case |
|---|---|---|---|---|---|
| Active-Passive | One region serves traffic; another waits | Minutes | Minutes-Hours | 1.3-1.5x | DR compliance |
| Active-Active | All regions serve traffic simultaneously | Seconds | Seconds | 2.0-2.5x | Global performance |
| Cell-Based | Users partitioned into independent cells | Zero | Seconds | 2.0-3.0x | Massive scale |
First Principles
Failure Domain Theory
A failure domain is the set of components that fail together. Multi-region architecture creates isolated failure domains:
Failure domain hierarchy (from smallest to largest):
- Process: Single container/VM failure
- Host: Physical machine failure
- Rack: Power/network switch failure
- Availability Zone: Datacenter-level failure (fire, power, cooling)
- Region: Geographic area failure (natural disaster, fiber cut)
- Cloud Provider: Provider-wide failure (control plane bug)
- Global: Internet-level disruption (BGP hijack, DNS root)
The Blast Radius Equation
The expected downtime from a failure event is:
Where Blast Radius is the fraction of users/requests affected:
| Architecture | Blast Radius of Region Failure | Expected Annual Downtime |
|---|---|---|
| Single region | 100% | |
| Active-passive | 100% (during failover) -> 0% | |
| Active-active | 50% (instant failover) | |
| Cell-based (10 cells) | 10% |
Core Mechanics
Pattern 1: Active-Passive
In active-passive, one region handles all production traffic while the secondary region receives replicated data and stands ready to take over.
Failover process:
Implementation:
// src/failover/active-passive-controller.ts
interface FailoverConfig {
primaryRegion: string;
secondaryRegion: string;
healthCheckInterval: number;
failureThreshold: number;
recoveryThreshold: number;
dnsRecordId: string;
dnsTTL: number;
}
interface RegionHealth {
region: string;
healthy: boolean;
lastCheck: Date;
consecutiveFailures: number;
consecutiveSuccesses: number;
latencyMs: number;
}
class ActivePassiveController {
private regionHealth: Map<string, RegionHealth> = new Map();
private currentPrimary: string;
constructor(private config: FailoverConfig) {
this.currentPrimary = config.primaryRegion;
}
async performHealthCheck(region: string): Promise<boolean> {
const health = this.regionHealth.get(region) ?? {
region,
healthy: true,
lastCheck: new Date(),
consecutiveFailures: 0,
consecutiveSuccesses: 0,
latencyMs: 0,
};
try {
const start = Date.now();
const response = await fetch(`https://${region}.myapp.internal/health`, {
signal: AbortSignal.timeout(5000),
});
health.latencyMs = Date.now() - start;
if (response.ok) {
health.consecutiveSuccesses++;
health.consecutiveFailures = 0;
health.healthy = true;
} else {
health.consecutiveFailures++;
health.consecutiveSuccesses = 0;
}
} catch {
health.consecutiveFailures++;
health.consecutiveSuccesses = 0;
health.latencyMs = -1;
}
health.lastCheck = new Date();
// Determine health status
if (health.consecutiveFailures >= this.config.failureThreshold) {
health.healthy = false;
}
if (health.consecutiveSuccesses >= this.config.recoveryThreshold) {
health.healthy = true;
}
this.regionHealth.set(region, health);
return health.healthy;
}
async evaluateFailover(): Promise<{
action: 'failover' | 'failback' | 'none';
reason: string;
}> {
const primaryHealth = this.regionHealth.get(this.config.primaryRegion);
const secondaryHealth = this.regionHealth.get(this.config.secondaryRegion);
if (!primaryHealth || !secondaryHealth) {
return { action: 'none', reason: 'Insufficient health data' };
}
// Failover: primary is down, secondary is up
if (
this.currentPrimary === this.config.primaryRegion &&
!primaryHealth.healthy &&
secondaryHealth.healthy
) {
return {
action: 'failover',
reason: `Primary ${this.config.primaryRegion} unhealthy ` +
`(${primaryHealth.consecutiveFailures} consecutive failures)`,
};
}
// Failback: primary recovered, currently on secondary
if (
this.currentPrimary === this.config.secondaryRegion &&
primaryHealth.healthy &&
primaryHealth.consecutiveSuccesses >= this.config.recoveryThreshold * 2
) {
return {
action: 'failback',
reason: `Primary ${this.config.primaryRegion} recovered ` +
`(${primaryHealth.consecutiveSuccesses} consecutive successes)`,
};
}
return { action: 'none', reason: 'No action needed' };
}
async executeFailover(targetRegion: string): Promise<void> {
console.log(`Executing failover to ${targetRegion}`);
// Step 1: Promote RDS replica
if (targetRegion !== this.config.primaryRegion) {
await this.promoteRdsReplica(targetRegion);
}
// Step 2: Scale up compute in target region
await this.scaleCompute(targetRegion, 'up');
// Step 3: Update DNS
await this.updateDns(targetRegion);
// Step 4: Wait for DNS propagation
await this.waitForDnsPropagation();
// Step 5: Scale down old primary (after traffic drains)
const oldPrimary = this.currentPrimary;
this.currentPrimary = targetRegion;
setTimeout(async () => {
await this.scaleCompute(oldPrimary, 'down');
}, 5 * 60 * 1000); // Wait 5 minutes for traffic to drain
}
private async promoteRdsReplica(region: string): Promise<void> {
// AWS SDK call to promote read replica
console.log(`Promoting RDS replica in ${region} to primary`);
}
private async scaleCompute(region: string, direction: 'up' | 'down'): Promise<void> {
const targetCount = direction === 'up' ? 10 : 2;
console.log(`Scaling ${region} to ${targetCount} tasks`);
}
private async updateDns(targetRegion: string): Promise<void> {
console.log(`Updating DNS to point to ${targetRegion}`);
}
private async waitForDnsPropagation(): Promise<void> {
// Wait for 2x TTL to ensure propagation
const waitTime = this.config.dnsTTL * 2 * 1000;
await new Promise(resolve => setTimeout(resolve, waitTime));
}
}Pattern 2: Active-Active
In active-active, all regions serve production traffic simultaneously. Each region can handle the full load independently.
The Write Conflict Challenge:
Active-active's hardest problem is concurrent writes to the same data in different regions.
Conflict resolution strategies:
| Strategy | Description | Data Loss Risk | Use Case |
|---|---|---|---|
| Last-Writer-Wins (LWW) | Timestamp-based, latest write wins | One write lost | User profiles, preferences |
| Application-level merge | Custom merge logic | None (if logic correct) | Documents, collaborative editing |
| CRDTs | Mathematically convergent types | None | Counters, sets, flags |
| Region ownership | Each region owns specific data | None | User data by home region |
| Conflict-free routing | Route writes to one region | None | Shopping carts, orders |
LWW with vector clocks:
// src/replication/conflict-resolver.ts
interface VectorClock {
[regionId: string]: number;
}
interface VersionedValue<T> {
value: T;
vectorClock: VectorClock;
timestamp: number;
region: string;
}
class ConflictResolver<T> {
/**
* Compare two vector clocks.
* Returns:
* 'before' if a happened-before b
* 'after' if b happened-before a
* 'concurrent' if neither happened-before the other
*/
compareVectorClocks(a: VectorClock, b: VectorClock): 'before' | 'after' | 'concurrent' {
let aBeforeB = false;
let bBeforeA = false;
const allKeys = new Set([...Object.keys(a), ...Object.keys(b)]);
for (const key of allKeys) {
const aVal = a[key] ?? 0;
const bVal = b[key] ?? 0;
if (aVal < bVal) aBeforeB = true;
if (bVal < aVal) bBeforeA = true;
}
if (aBeforeB && !bBeforeA) return 'before';
if (bBeforeA && !aBeforeB) return 'after';
return 'concurrent';
}
resolve(
local: VersionedValue<T>,
remote: VersionedValue<T>,
strategy: 'lww' | 'custom',
customMerge?: (a: T, b: T) => T
): VersionedValue<T> {
const relation = this.compareVectorClocks(
local.vectorClock,
remote.vectorClock
);
switch (relation) {
case 'before':
// Local is older, use remote
return remote;
case 'after':
// Remote is older, keep local
return local;
case 'concurrent':
// True conflict — need resolution strategy
if (strategy === 'lww') {
// Last-writer-wins by timestamp, break ties by region name
if (local.timestamp > remote.timestamp) return local;
if (remote.timestamp > local.timestamp) return remote;
return local.region > remote.region ? local : remote;
}
if (strategy === 'custom' && customMerge) {
return {
value: customMerge(local.value, remote.value),
vectorClock: this.mergeClocks(local.vectorClock, remote.vectorClock),
timestamp: Math.max(local.timestamp, remote.timestamp),
region: local.region,
};
}
// Default: LWW
return local.timestamp >= remote.timestamp ? local : remote;
}
}
private mergeClocks(a: VectorClock, b: VectorClock): VectorClock {
const merged: VectorClock = { ...a };
for (const [key, val] of Object.entries(b)) {
merged[key] = Math.max(merged[key] ?? 0, val);
}
return merged;
}
}Pattern 3: Cell-Based Architecture
Cell-based architecture partitions users into independent cells, each containing a complete copy of the application stack. Cells are isolated — a failure in one cell cannot affect another.
Cell assignment strategies:
// src/cell/cell-router.ts
interface Cell {
id: string;
region: string;
capacity: number;
currentLoad: number;
healthy: boolean;
}
interface CellAssignment {
userId: string;
cellId: string;
assignedAt: Date;
reason: string;
}
class CellRouter {
private cells: Map<string, Cell>;
private assignments: Map<string, CellAssignment>;
constructor(cells: Cell[]) {
this.cells = new Map(cells.map(c => [c.id, c]));
this.assignments = new Map();
}
/**
* Assign a user to a cell.
* Strategy: Geographic affinity + load balancing
*/
assignCell(userId: string, userRegion: string): CellAssignment {
// Check existing assignment
const existing = this.assignments.get(userId);
if (existing) {
const cell = this.cells.get(existing.cellId);
if (cell?.healthy) return existing;
// Cell is unhealthy — reassign
}
// Find best cell
const healthyCells = [...this.cells.values()].filter(c => c.healthy);
if (healthyCells.length === 0) {
throw new Error('No healthy cells available');
}
// Prefer cells in the same region
const sameRegion = healthyCells.filter(c => c.region === userRegion);
const candidates = sameRegion.length > 0 ? sameRegion : healthyCells;
// Among candidates, pick the one with most available capacity
const bestCell = candidates.reduce((best, cell) => {
const availableCapacity = cell.capacity - cell.currentLoad;
const bestAvailable = best.capacity - best.currentLoad;
return availableCapacity > bestAvailable ? cell : best;
});
const assignment: CellAssignment = {
userId,
cellId: bestCell.id,
assignedAt: new Date(),
reason: sameRegion.length > 0
? `Geographic affinity (${userRegion})`
: `Best available capacity (${bestCell.region})`,
};
this.assignments.set(userId, assignment);
bestCell.currentLoad++;
return assignment;
}
/**
* Evacuate a cell — move all users to other cells
*/
async evacuateCell(cellId: string): Promise<{
evacuated: number;
failed: number;
}> {
const cell = this.cells.get(cellId);
if (!cell) throw new Error(`Unknown cell: ${cellId}`);
cell.healthy = false;
const usersInCell = [...this.assignments.entries()]
.filter(([, a]) => a.cellId === cellId);
let evacuated = 0;
let failed = 0;
for (const [userId] of usersInCell) {
try {
// Remove existing assignment
this.assignments.delete(userId);
// Reassign to a healthy cell
this.assignCell(userId, cell.region);
evacuated++;
} catch {
failed++;
}
}
return { evacuated, failed };
}
/**
* Consistent hashing for deterministic cell assignment
* (Alternative to database-backed assignments)
*/
assignCellConsistentHash(userId: string): string {
const hash = this.hashString(userId);
const healthyCells = [...this.cells.values()]
.filter(c => c.healthy)
.sort((a, b) => a.id.localeCompare(b.id));
if (healthyCells.length === 0) {
throw new Error('No healthy cells');
}
const index = hash % healthyCells.length;
return healthyCells[index].id;
}
private hashString(str: string): number {
let hash = 5381;
for (let i = 0; i < str.length; i++) {
hash = ((hash << 5) + hash) + str.charCodeAt(i);
hash = hash & 0x7fffffff; // Keep positive
}
return hash;
}
}Cell boundary challenges:
The hardest part of cell-based architecture is handling cross-cell operations. For example, if User A (Cell 1) wants to send a message to User B (Cell 2):
Edge Cases & Failure Modes
Active-Passive Pitfalls
| Pitfall | Description | Prevention |
|---|---|---|
| Untested failover | Failover never tested, breaks in production | Monthly failover drills |
| Stale standby | Secondary has months-old code/config | Deploy to both regions always |
| Replication lag surprise | 5 minutes of data lost on failover | Monitor and alert on replication lag |
| DNS TTL trap | Cached DNS prevents failover for hours | Use TTL 60s, not 3600s |
| Manual promotion complexity | DBA needed to promote DB replica at 3 AM | Automate full failover |
| Warm pool too cold | Secondary can't handle full traffic | Keep secondary at 30%+ capacity |
| Failback harder than failover | Getting back to primary is complex | Document and automate failback |
Active-Active Pitfalls
| Pitfall | Description | Prevention |
|---|---|---|
| Write amplification | Every write replicated N times | Write to nearest region, async replicate |
| Conflict storms | High-contention data causes cascading conflicts | Region ownership for hot data |
| Inconsistent reads | Read-your-writes not guaranteed cross-region | Session affinity or synchronous reads |
| Clock skew | LWW depends on synchronized clocks | NTP + bounded-staleness protocols |
| Global service dependency | Auth/billing is single-region | Make global services multi-region too |
| Replication topology complexity | N regions = N*(N-1)/2 replication links | Hub-and-spoke or mesh with limits |
Cell-Based Pitfalls
| Pitfall | Description | Prevention |
|---|---|---|
| Hot cells | Viral user overloads one cell | Cell-level auto-scaling, re-sharding |
| Cross-cell latency | Operations crossing cells are slow | Minimize cross-cell operations |
| Global state | Some data must be global (billing, auth) | Separate global services from cell-local |
| Cell migration | Moving users between cells is complex | Zero-downtime migration tooling |
| Uneven growth | Some cells grow faster than others | Periodic rebalancing |
Performance Characteristics
Failover Speed Comparison
| Component | Active-Passive | Active-Active | Cell-Based |
|---|---|---|---|
| Detection time | 30s (3x 10s checks) | 10s (health check) | 10s (cell health) |
| DNS propagation | 60-300s | 0s (already routing) | 0s (re-route user) |
| DB promotion | 60-300s | 0s (already writable) | 0s (cell-local DB) |
| Scale-up | 120-600s | 0s (already scaled) | 0s (already scaled) |
| Cache warming | 300-1800s | 0s (already warm) | 0s (cell-local cache) |
| Total RTO | ~5-30 min | ~10-30s | ~10-30s |
Resource Utilization
Active-active requires each region to handle the other's traffic during failover, so each region must be provisioned at < 50% capacity — meaning 50% of your compute is "wasted" under normal operation.
Mathematical Foundations
Availability Calculation per Pattern
Active-Passive (with failover time
With monthly primary failure rate of 0.1% and failover time of 10 minutes:
Active-Active (instant failover):
With independent region failure rates of 0.01% and correlated failure probability of 0.0001%:
Cell-Based (per cell):
With 10 cells and each cell failing 0.1% of the time, the probability that a specific user is affected:
This is the same as single-region, but the blast radius is 1/10. Combined with fast evacuation (seconds), effective availability approaches active-active levels.
Cost Efficiency Frontier
| Transition | Availability Gain | Cost Increase | Cost per Nine |
|---|---|---|---|
| Single -> Active-Passive | +0.0098% (99.99 -> 99.9998%) | +30% | $3,061/nine |
| Active-Passive -> Active-Active | +0.00009% | +70% | $777,778/nine |
| Active-Active -> Cell-Based | +0.000005% | +20% | $4,000,000/nine |
Each additional "nine" of availability becomes exponentially more expensive. This is why most companies stop at active-passive unless they have extremely high availability requirements (financial services, healthcare).
Real-World War Stories
War Story — The Active-Passive That Wasn't
A healthcare SaaS company maintained an active-passive setup for 3 years without ever testing failover. When their primary region experienced a 4-hour outage, they attempted failover and discovered:
- The database replica was 18 hours behind (replication had silently broken 2 weeks prior)
- The secondary region was running a 6-month-old version of the application
- The secondary region's SSL certificates had expired
- The configuration pointed to the primary region's cache and queue services
- The failover runbook referenced infrastructure that no longer existed
Result: Instead of a 10-minute failover, they experienced a 12-hour outage.
Fix: Implemented automated monthly failover tests (chaos engineering). The failover pipeline runs every month:
- Scale secondary to production capacity
- Switch 10% of traffic to secondary for 1 hour
- Validate all functionality works
- Switch back
They also added monitoring for replication lag, certificate expiration, and version drift between regions.
Lesson: An untested failover is not a failover. It's a wish.
War Story — The Cell Architecture Migration
Slack's infrastructure evolved from a single-region monolith to a cell-based architecture (which they call "bedrock"). The migration took 2 years and involved:
- Identifying natural data boundaries (workspaces are independent)
- Building a cell router that maps workspace IDs to cells
- Creating tooling for zero-downtime cell migration
- Building cross-cell messaging for shared channels
- Implementing per-cell deployment and rollback
The key insight was that Slack workspaces are natural cells — users within a workspace interact frequently with each other but rarely across workspaces. This made the data partitioning natural rather than forced.
Result: When a cell experiences issues, only the workspaces in that cell are affected. Global incidents became cell-local incidents, reducing blast radius by 10-50x.
Lesson: Cell-based architecture works best when there are natural data boundaries. Don't force it if your data model requires extensive cross-cell communication.
Decision Framework
Choosing Your Pattern
Pattern Selection Matrix
| Requirement | Active-Passive | Active-Active | Cell-Based |
|---|---|---|---|
| DR compliance | Great | Great | Great |
| Sub-second RTO | No | Yes | Yes |
| Zero RPO | No | With sync replication | Yes (per cell) |
| Global low latency | No | Yes | Yes |
| Cost efficiency | Best | Moderate | Moderate |
| Operational simplicity | Best | Complex | Very complex |
| Independent deployments | No | Possible | Yes |
| Blast radius reduction | Limited | Moderate | Excellent |
| Regulatory compliance | Good | Complex | Good |
Advanced Topics
Hybrid Patterns
Real-world architectures often combine patterns:
- Active-active compute + active-passive database: Serve reads everywhere, funnel writes to one region
- Cell-based per service: Order service is cell-based, catalog service is active-active (read-heavy)
- Active-passive with read replicas: Primary handles writes, all regions handle reads
Testing Multi-Region Architectures
// tests/multi-region/failover.test.ts
describe('Multi-region failover', () => {
it('should failover within RTO when primary fails', async () => {
const startTime = Date.now();
// Simulate primary region failure
await simulateRegionFailure('us-east-1');
// Wait for detection + failover
await waitForCondition(
async () => {
const health = await checkEndpoint('https://api.myapp.com/health');
return health.status === 200 &&
health.headers['x-served-from'] === 'eu-west-1';
},
{ timeout: 120_000, interval: 1_000 }
);
const rto = Date.now() - startTime;
console.log(`Failover completed in ${rto}ms`);
expect(rto).toBeLessThan(120_000); // 2 min RTO
// Verify data integrity
const testData = await fetchTestRecord('eu-west-1');
expect(testData).toBeDefined();
expect(testData.version).toBeGreaterThanOrEqual(expectedMinVersion);
// Restore primary
await restoreRegion('us-east-1');
});
});For detailed implementation of data replication across these patterns, see Data Replication. For traffic management strategies, see Traffic Routing.