Skip to content
Unverified — AI-generated content. Help verify this page

Multi-Region Architecture Design

A multi-region architecture deploys your application across multiple geographic regions to achieve low latency for global users, high availability during regional outages, and compliance with data residency regulations. Going multi-region is one of the most complex infrastructure decisions you can make. It transforms every simple problem — caching, sessions, database writes, deployments — into a distributed systems problem with no clean answers.

Why Go Multi-Region?

MotivationSingle-Region RiskMulti-Region Benefit
AvailabilityRegional AWS outage takes you offlineTraffic shifts to healthy region
LatencyUsers in Asia get 200ms+ to us-east-1Users hit nearest region, < 50ms
ComplianceEU data stored in US violates GDPRData stays in eu-west-1
Disaster recoverySingle region failure = total outageRPO/RTO measured in minutes, not hours
Traffic distributionSingle region absorbs all loadLoad split across regions

Active-Passive vs Active-Active

The two fundamental multi-region strategies have drastically different complexity and cost profiles.

Active-Passive

AspectDetails
How it worksAll traffic goes to primary. DR region is warm standby.
FailoverDNS or load balancer switches traffic to DR region
RTO5-30 minutes (DNS propagation + cache warming)
RPOSeconds to minutes (depends on replication lag)
Cost~1.3-1.5x single region (standby infra is underutilized)
ComplexityModerate — replication + failover automation

Failover process:

typescript
// Automated failover with health checks
class FailoverController {
  private consecutiveFailures = 0;
  private readonly failoverThreshold = 3;

  async checkPrimaryHealth(): Promise<void> {
    try {
      const response = await fetch('https://primary.example.com/health', {
        signal: AbortSignal.timeout(5000),
      });

      if (response.ok) {
        this.consecutiveFailures = 0;
        return;
      }

      this.consecutiveFailures++;
    } catch {
      this.consecutiveFailures++;
    }

    if (this.consecutiveFailures >= this.failoverThreshold) {
      await this.initiateFailover();
    }
  }

  private async initiateFailover(): Promise<void> {
    console.log('Initiating failover to DR region...');

    // 1. Promote DB replica to primary
    await this.promoteReplica('us-west-2');

    // 2. Update DNS to point to DR region
    await this.updateDNS('api.example.com', 'dr-lb.us-west-2.example.com');

    // 3. Warm caches in DR region
    await this.warmCaches('us-west-2');

    // 4. Alert on-call team
    await this.notify('FAILOVER EXECUTED: traffic now in us-west-2');
  }
}

Active-Active

AspectDetails
How it worksAll regions serve traffic simultaneously. Writes accepted everywhere.
FailoverAutomatic — DNS removes unhealthy region
RTONear-zero (other regions already serving traffic)
RPODepends on conflict resolution strategy
Cost~2-3x single region (all regions fully active)
ComplexityVery high — write conflicts, data consistency, split-brain

Comparison Matrix

FactorActive-PassiveActive-Active
LatencyHigh for remote usersLow globally
Availability99.95% (with fast failover)99.99%+
Cost1.3-1.5x2-3x
ComplexityModerateVery high
Data consistencyStrong (single writer)Eventual (multi-writer)
Write conflictsNoneMust be resolved
Failover time5-30 minutesSeconds
Operational burdenMediumVery high

DNS Routing Strategies

DNS is the front door of multi-region architecture. The routing strategy determines which region serves each user.

Latency-Based Routing

Route users to the region with the lowest measured latency. AWS Route 53 continuously measures latency from its global network to each region.

hcl
# Terraform — Route 53 latency-based routing
resource "aws_route53_record" "api_us" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "us-east-1"

  latency_routing_policy {
    region = "us-east-1"
  }

  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "api_eu" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "eu-west-1"

  latency_routing_policy {
    region = "eu-west-1"
  }

  alias {
    name                   = aws_lb.eu_west.dns_name
    zone_id                = aws_lb.eu_west.zone_id
    evaluate_target_health = true
  }
}

Geolocation-Based Routing

Route based on the geographic location of the DNS resolver. Useful for compliance (GDPR) and content localization.

StrategyRoute byBest For
LatencyMeasured network latencyPerformance optimization
GeolocationCountry/continent of userCompliance, localization
WeightedConfigured percentagesGradual migration, canary
FailoverPrimary + secondary with health checksActive-passive DR

Weighted Routing for Gradual Migration

hcl
# Gradually shift traffic from old to new region
resource "aws_route53_record" "api_old" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "old-region"

  weighted_routing_policy {
    weight = 90  # Start at 90%, reduce over time
  }

  alias {
    name    = aws_lb.old_region.dns_name
    zone_id = aws_lb.old_region.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "api_new" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "new-region"

  weighted_routing_policy {
    weight = 10  # Start at 10%, increase over time
  }

  alias {
    name    = aws_lb.new_region.dns_name
    zone_id = aws_lb.new_region.zone_id
    evaluate_target_health = true
  }
}

Data Replication Strategies

Strategy 1: Single-Writer with Read Replicas

Replication lag handling:

typescript
// Read-your-writes consistency with replication lag
class MultiRegionReadService {
  constructor(
    private primaryDb: Database,  // us-east-1
    private localReplica: Database, // local region replica
  ) {}

  async getUser(userId: string, afterVersion?: number): Promise<User> {
    // If the client specifies a minimum version (from a recent write),
    // check if the local replica has caught up
    if (afterVersion) {
      const localVersion = await this.localReplica.query(
        'SELECT version FROM users WHERE id = $1',
        [userId]
      );

      if (localVersion < afterVersion) {
        // Replica hasn't caught up — read from primary
        return this.primaryDb.query(
          'SELECT * FROM users WHERE id = $1',
          [userId]
        );
      }
    }

    // Read from local replica (low latency)
    return this.localReplica.query(
      'SELECT * FROM users WHERE id = $1',
      [userId]
    );
  }
}

Strategy 2: Multi-Writer with Conflict Resolution

When both regions accept writes, conflicts are inevitable. Two users updating the same record simultaneously in different regions creates a conflict that must be resolved.

Conflict Resolution Strategies

Last-Writer-Wins (LWW)

The write with the latest timestamp wins. Simple but can lose data.

typescript
// Last-Writer-Wins with Lamport timestamps
interface VersionedRecord {
  data: any;
  timestamp: number;     // Lamport timestamp
  regionId: string;      // Tiebreaker
}

function resolveConflict(local: VersionedRecord, remote: VersionedRecord): VersionedRecord {
  if (local.timestamp > remote.timestamp) return local;
  if (remote.timestamp > local.timestamp) return remote;
  // Tiebreaker: lexicographic comparison of region IDs
  return local.regionId > remote.regionId ? local : remote;
}

When to use LWW: User profile updates, preferences, non-critical data where "last update wins" is acceptable.

When to avoid LWW: Counters, inventory, financial data — losing a write means losing money.

CRDTs (Conflict-Free Replicated Data Types)

CRDTs are data structures designed to merge concurrent updates without conflicts. Every update can be applied in any order and the result converges to the same state. See our CRDT Fundamentals for a deep dive.

typescript
// G-Counter (Grow-only Counter) CRDT
class GCounter {
  // Each region maintains its own counter
  private counts: Map<string, number> = new Map();

  constructor(private regionId: string) {}

  increment(amount: number = 1): void {
    const current = this.counts.get(this.regionId) || 0;
    this.counts.set(this.regionId, current + amount);
  }

  value(): number {
    let total = 0;
    for (const count of this.counts.values()) {
      total += count;
    }
    return total;
  }

  // Merge with another replica — always converges
  merge(other: GCounter): void {
    for (const [region, count] of other.counts) {
      const current = this.counts.get(region) || 0;
      this.counts.set(region, Math.max(current, count));
    }
  }
}

// Usage: page view counter across regions
const usCounter = new GCounter('us-east-1');
const euCounter = new GCounter('eu-west-1');

usCounter.increment(); // US visitor
usCounter.increment(); // US visitor
euCounter.increment(); // EU visitor

// After replication, both converge to 3
usCounter.merge(euCounter);
euCounter.merge(usCounter);
// usCounter.value() === 3
// euCounter.value() === 3
typescript
// LWW-Element-Set CRDT for shopping carts
class LWWSet<T> {
  private addSet: Map<string, { value: T; timestamp: number }> = new Map();
  private removeSet: Map<string, { value: T; timestamp: number }> = new Map();

  add(key: string, value: T, timestamp: number): void {
    const existing = this.addSet.get(key);
    if (!existing || timestamp > existing.timestamp) {
      this.addSet.set(key, { value, timestamp });
    }
  }

  remove(key: string, timestamp: number): void {
    const existing = this.removeSet.get(key);
    if (!existing || timestamp > existing.timestamp) {
      this.removeSet.set(key, { value: this.addSet.get(key)!.value, timestamp });
    }
  }

  lookup(key: string): T | undefined {
    const addEntry = this.addSet.get(key);
    const removeEntry = this.removeSet.get(key);

    if (!addEntry) return undefined;
    if (!removeEntry) return addEntry.value;
    // Added after removed? Item is in the set
    return addEntry.timestamp > removeEntry.timestamp ? addEntry.value : undefined;
  }

  merge(other: LWWSet<T>): void {
    for (const [key, entry] of other.addSet) {
      this.add(key, entry.value, entry.timestamp);
    }
    for (const [key, entry] of other.removeSet) {
      this.remove(key, entry.timestamp);
    }
  }
}

Custom Merge Functions

For domain-specific conflicts, define a custom merge strategy.

typescript
// Custom merge for an inventory system
interface InventoryRecord {
  productId: string;
  stock: number;
  reservations: Reservation[];
  lastModified: number;
}

function mergeInventory(
  local: InventoryRecord,
  remote: InventoryRecord,
): InventoryRecord {
  return {
    productId: local.productId,
    // Stock: take the lower value (conservative approach)
    stock: Math.min(local.stock, remote.stock),
    // Reservations: union of both sets (no lost reservations)
    reservations: mergeReservations(local.reservations, remote.reservations),
    lastModified: Math.max(local.lastModified, remote.lastModified),
  };
}

function mergeReservations(a: Reservation[], b: Reservation[]): Reservation[] {
  const merged = new Map<string, Reservation>();
  for (const r of [...a, ...b]) {
    const existing = merged.get(r.id);
    if (!existing || r.updatedAt > existing.updatedAt) {
      merged.set(r.id, r);
    }
  }
  return Array.from(merged.values());
}

Conflict Resolution Strategy Selection

StrategyData SafetyComplexityBest For
LWWLow (lossy)LowPreferences, profiles, settings
CRDTsHigh (lossless)MediumCounters, sets, collaborative editing
Custom mergeHighHighDomain-specific business rules
Region-pinningHighLowAvoid conflicts entirely — route user to owner region

Consistency Tradeoffs

CAP Theorem in Multi-Region

In a multi-region deployment, you are always choosing between consistency and availability during a network partition between regions.

Database Options for Multi-Region

DatabaseMulti-Region ModelConsistencyWrite Latency
Aurora GlobalSingle writer + read replicasStrong (single writer)Low in writer region
DynamoDB Global TablesMulti-writerEventual (LWW)Low in all regions
CockroachDBMulti-writer with consensusStrong (serializable)Higher (cross-region consensus)
CassandraMulti-writerTunable (per-query)Low in all regions
SpannerMulti-writer with TrueTimeStrong (external consistency)Higher (GPS-synchronized)
PostgreSQL (Citus)Sharded, distributedStrong per shardLow for shard-local

Testing Multi-Region

Testing multi-region systems requires simulating conditions that are hard to reproduce in development.

What to Test

typescript
// Multi-region test scenarios
describe('Multi-Region System', () => {
  it('should failover within 30 seconds when primary region is unavailable', async () => {
    // Simulate region failure
    await blockRegionTraffic('us-east-1');

    const startTime = Date.now();

    // Wait for health check to detect failure
    await waitForCondition(async () => {
      const response = await fetch('https://api.example.com/health');
      const region = response.headers.get('X-Served-By-Region');
      return region === 'eu-west-1'; // Failover detected
    }, { timeout: 30_000 });

    const failoverTime = Date.now() - startTime;
    expect(failoverTime).toBeLessThan(30_000);

    // Verify data is consistent
    const data = await fetch('https://api.example.com/orders/recent');
    expect(data.status).toBe(200);
  });

  it('should handle replication lag gracefully', async () => {
    // Write to primary
    const order = await createOrder({ region: 'us-east-1' });

    // Read from replica immediately — may not be there yet
    const result = await getOrder(order.id, {
      region: 'eu-west-1',
      afterVersion: order.version, // Read-your-writes header
    });

    expect(result.id).toBe(order.id);
  });

  it('should resolve write conflicts correctly', async () => {
    // Concurrent writes to different regions
    const [resultUS, resultEU] = await Promise.all([
      updateProfile('user-1', { name: 'Alice' }, { region: 'us-east-1' }),
      updateProfile('user-1', { name: 'Bob' }, { region: 'eu-west-1' }),
    ]);

    // Wait for replication convergence
    await sleep(5000);

    // Both regions should agree on the same value
    const usProfile = await getProfile('user-1', { region: 'us-east-1' });
    const euProfile = await getProfile('user-1', { region: 'eu-west-1' });
    expect(usProfile.name).toBe(euProfile.name);
  });
});

Chaos Engineering for Multi-Region

TestSimulatesTool
Block inter-region trafficNetwork partition between regionsAWS VPC NACLs, Chaos Monkey
Kill primary databasePrimary region DB failureAWS FIS
Inject replication lagDegraded network between regionstc netem
DNS resolution failureDNS outageModify Route 53 health checks
Saturate cross-region bandwidthBandwidth contentioniperf3 + tc

Cost Considerations

Component2 Regions (Active-Passive)3 Regions (Active-Active)
Compute1.3x (standby at reduced capacity)3x (full capacity in each)
Database1.5x (replica cost)3x (full instances)
Data transfer$0.02/GB cross-region replication$0.02/GB x replication factor
Load balancers2x3x
Monitoring1.5x3x
Estimated multiplier1.3-1.5x single region2.5-3x single region

See our Cost of Scale page for detailed cost breakdowns.

Key Takeaways

  1. Start with active-passive — it gives you DR without the complexity of multi-writer conflicts
  2. Active-active is for companies that need 99.99%+ availability and can afford the engineering investment
  3. DNS routing strategy matters — latency-based routing is the best default for performance
  4. Choose your conflict resolution strategy before you write code — retrofitting is painful
  5. CRDTs are underused — they solve many multi-writer problems without custom conflict logic
  6. Test the failover regularly — untested failover plans fail when you need them most
  7. Data transfer costs are significant — cross-region replication at TB scale adds up fast

Real-World Examples

Netflix

Netflix runs active-active across 3 AWS regions (us-east-1, us-west-2, eu-west-1). Every region can handle 100% of global traffic if the other two fail. They use EVCache with consistent hashing for per-region caching, Cassandra with cross-region replication for persistence, and Zuul for regional routing. Their annual Chaos Engineering exercises (Chaos Kong) simulate entire region failures to validate failover.

Uber

Uber uses a cell-based multi-region architecture. Each city is assigned to a primary data center, and driver/rider data is partitioned geographically. A trip in Paris is served by their EU region with sub-50ms latency. If a region fails, affected cities fail over to the backup region. This geographic partitioning avoids the write conflict problem of active-active because each city's data has a single writer.

DynamoDB Global Tables

Amazon's DynamoDB Global Tables provides multi-writer replication across any AWS regions using last-writer-wins (LWW) conflict resolution. Each region accepts reads and writes locally with single-digit millisecond latency. Writes are asynchronously replicated to other regions within 1-2 seconds. Companies like Lyft and Samsung use Global Tables for session stores and user profiles where LWW is acceptable.

Interview Tip

What to say

"I'd start with active-passive for multi-region because it gives disaster recovery without multi-writer conflict complexity — one region handles all writes, others serve reads locally. The RTO is 5-30 minutes, which is acceptable for most applications. I'd move to active-active only when I need zero-downtime failover or global write latency under 50ms, and I'd choose my conflict resolution strategy upfront: LWW for user preferences, CRDTs for counters and sets, or region-pinning to avoid conflicts entirely. The cost is 2-3x single region for active-active, so it needs to be justified. Netflix validates their multi-region setup by running Chaos Kong exercises that kill entire regions — untested failover plans fail when you need them."

"What I cannot create, I do not understand." — Richard Feynman