Multi-Region Architecture Design

A multi-region architecture deploys your application across multiple geographic regions to achieve low latency for global users, high availability during regional outages, and compliance with data residency regulations. Going multi-region is one of the most complex infrastructure decisions you can make. It transforms every simple problem — caching, sessions, database writes, deployments — into a distributed systems problem with no clean answers.

Why Go Multi-Region?

Motivation	Single-Region Risk	Multi-Region Benefit
Availability	Regional AWS outage takes you offline	Traffic shifts to healthy region
Latency	Users in Asia get 200ms+ to us-east-1	Users hit nearest region, < 50ms
Compliance	EU data stored in US violates GDPR	Data stays in eu-west-1
Disaster recovery	Single region failure = total outage	RPO/RTO measured in minutes, not hours
Traffic distribution	Single region absorbs all load	Load split across regions

Active-Passive vs Active-Active

The two fundamental multi-region strategies have drastically different complexity and cost profiles.

Active-Passive

Aspect	Details
How it works	All traffic goes to primary. DR region is warm standby.
Failover	DNS or load balancer switches traffic to DR region
RTO	5-30 minutes (DNS propagation + cache warming)
RPO	Seconds to minutes (depends on replication lag)
Cost	~1.3-1.5x single region (standby infra is underutilized)
Complexity	Moderate — replication + failover automation

Failover process:

typescript

// Automated failover with health checks
class FailoverController {
  private consecutiveFailures = 0;
  private readonly failoverThreshold = 3;

  async checkPrimaryHealth(): Promise<void> {
    try {
      const response = await fetch('https://primary.example.com/health', {
        signal: AbortSignal.timeout(5000),
      });

      if (response.ok) {
        this.consecutiveFailures = 0;
        return;
      }

      this.consecutiveFailures++;
    } catch {
      this.consecutiveFailures++;
    }

    if (this.consecutiveFailures >= this.failoverThreshold) {
      await this.initiateFailover();
    }
  }

  private async initiateFailover(): Promise<void> {
    console.log('Initiating failover to DR region...');

    // 1. Promote DB replica to primary
    await this.promoteReplica('us-west-2');

    // 2. Update DNS to point to DR region
    await this.updateDNS('api.example.com', 'dr-lb.us-west-2.example.com');

    // 3. Warm caches in DR region
    await this.warmCaches('us-west-2');

    // 4. Alert on-call team
    await this.notify('FAILOVER EXECUTED: traffic now in us-west-2');
  }
}

Active-Active

Aspect	Details
How it works	All regions serve traffic simultaneously. Writes accepted everywhere.
Failover	Automatic — DNS removes unhealthy region
RTO	Near-zero (other regions already serving traffic)
RPO	Depends on conflict resolution strategy
Cost	~2-3x single region (all regions fully active)
Complexity	Very high — write conflicts, data consistency, split-brain

Comparison Matrix

Factor	Active-Passive	Active-Active
Latency	High for remote users	Low globally
Availability	99.95% (with fast failover)	99.99%+
Cost	1.3-1.5x	2-3x
Complexity	Moderate	Very high
Data consistency	Strong (single writer)	Eventual (multi-writer)
Write conflicts	None	Must be resolved
Failover time	5-30 minutes	Seconds
Operational burden	Medium	Very high

DNS Routing Strategies

DNS is the front door of multi-region architecture. The routing strategy determines which region serves each user.

Latency-Based Routing

Route users to the region with the lowest measured latency. AWS Route 53 continuously measures latency from its global network to each region.

hcl

# Terraform — Route 53 latency-based routing
resource "aws_route53_record" "api_us" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "us-east-1"

  latency_routing_policy {
    region = "us-east-1"
  }

  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "api_eu" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "eu-west-1"

  latency_routing_policy {
    region = "eu-west-1"
  }

  alias {
    name                   = aws_lb.eu_west.dns_name
    zone_id                = aws_lb.eu_west.zone_id
    evaluate_target_health = true
  }
}

Geolocation-Based Routing

Route based on the geographic location of the DNS resolver. Useful for compliance (GDPR) and content localization.

Strategy	Route by	Best For
Latency	Measured network latency	Performance optimization
Geolocation	Country/continent of user	Compliance, localization
Weighted	Configured percentages	Gradual migration, canary
Failover	Primary + secondary with health checks	Active-passive DR

Weighted Routing for Gradual Migration

hcl

# Gradually shift traffic from old to new region
resource "aws_route53_record" "api_old" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "old-region"

  weighted_routing_policy {
    weight = 90  # Start at 90%, reduce over time
  }

  alias {
    name    = aws_lb.old_region.dns_name
    zone_id = aws_lb.old_region.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "api_new" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "new-region"

  weighted_routing_policy {
    weight = 10  # Start at 10%, increase over time
  }

  alias {
    name    = aws_lb.new_region.dns_name
    zone_id = aws_lb.new_region.zone_id
    evaluate_target_health = true
  }
}

Data Replication Strategies

Strategy 1: Single-Writer with Read Replicas

Replication lag handling:

typescript

// Read-your-writes consistency with replication lag
class MultiRegionReadService {
  constructor(
    private primaryDb: Database,  // us-east-1
    private localReplica: Database, // local region replica
  ) {}

  async getUser(userId: string, afterVersion?: number): Promise<User> {
    // If the client specifies a minimum version (from a recent write),
    // check if the local replica has caught up
    if (afterVersion) {
      const localVersion = await this.localReplica.query(
        'SELECT version FROM users WHERE id = $1',
        [userId]
      );

      if (localVersion < afterVersion) {
        // Replica hasn't caught up — read from primary
        return this.primaryDb.query(
          'SELECT * FROM users WHERE id = $1',
          [userId]
        );
      }
    }

    // Read from local replica (low latency)
    return this.localReplica.query(
      'SELECT * FROM users WHERE id = $1',
      [userId]
    );
  }
}

Strategy 2: Multi-Writer with Conflict Resolution

When both regions accept writes, conflicts are inevitable. Two users updating the same record simultaneously in different regions creates a conflict that must be resolved.

Conflict Resolution Strategies

Last-Writer-Wins (LWW)

The write with the latest timestamp wins. Simple but can lose data.

typescript

// Last-Writer-Wins with Lamport timestamps
interface VersionedRecord {
  data: any;
  timestamp: number;     // Lamport timestamp
  regionId: string;      // Tiebreaker
}

function resolveConflict(local: VersionedRecord, remote: VersionedRecord): VersionedRecord {
  if (local.timestamp > remote.timestamp) return local;
  if (remote.timestamp > local.timestamp) return remote;
  // Tiebreaker: lexicographic comparison of region IDs
  return local.regionId > remote.regionId ? local : remote;
}

When to use LWW: User profile updates, preferences, non-critical data where "last update wins" is acceptable.

When to avoid LWW: Counters, inventory, financial data — losing a write means losing money.

CRDTs (Conflict-Free Replicated Data Types)

CRDTs are data structures designed to merge concurrent updates without conflicts. Every update can be applied in any order and the result converges to the same state. See our CRDT Fundamentals for a deep dive.

typescript

// G-Counter (Grow-only Counter) CRDT
class GCounter {
  // Each region maintains its own counter
  private counts: Map<string, number> = new Map();

  constructor(private regionId: string) {}

  increment(amount: number = 1): void {
    const current = this.counts.get(this.regionId) || 0;
    this.counts.set(this.regionId, current + amount);
  }

  value(): number {
    let total = 0;
    for (const count of this.counts.values()) {
      total += count;
    }
    return total;
  }

  // Merge with another replica — always converges
  merge(other: GCounter): void {
    for (const [region, count] of other.counts) {
      const current = this.counts.get(region) || 0;
      this.counts.set(region, Math.max(current, count));
    }
  }
}

// Usage: page view counter across regions
const usCounter = new GCounter('us-east-1');
const euCounter = new GCounter('eu-west-1');

usCounter.increment(); // US visitor
usCounter.increment(); // US visitor
euCounter.increment(); // EU visitor

// After replication, both converge to 3
usCounter.merge(euCounter);
euCounter.merge(usCounter);
// usCounter.value() === 3
// euCounter.value() === 3

typescript

// LWW-Element-Set CRDT for shopping carts
class LWWSet<T> {
  private addSet: Map<string, { value: T; timestamp: number }> = new Map();
  private removeSet: Map<string, { value: T; timestamp: number }> = new Map();

  add(key: string, value: T, timestamp: number): void {
    const existing = this.addSet.get(key);
    if (!existing || timestamp > existing.timestamp) {
      this.addSet.set(key, { value, timestamp });
    }
  }

  remove(key: string, timestamp: number): void {
    const existing = this.removeSet.get(key);
    if (!existing || timestamp > existing.timestamp) {
      this.removeSet.set(key, { value: this.addSet.get(key)!.value, timestamp });
    }
  }

  lookup(key: string): T | undefined {
    const addEntry = this.addSet.get(key);
    const removeEntry = this.removeSet.get(key);

    if (!addEntry) return undefined;
    if (!removeEntry) return addEntry.value;
    // Added after removed? Item is in the set
    return addEntry.timestamp > removeEntry.timestamp ? addEntry.value : undefined;
  }

  merge(other: LWWSet<T>): void {
    for (const [key, entry] of other.addSet) {
      this.add(key, entry.value, entry.timestamp);
    }
    for (const [key, entry] of other.removeSet) {
      this.remove(key, entry.timestamp);
    }
  }
}

Custom Merge Functions

For domain-specific conflicts, define a custom merge strategy.

typescript

// Custom merge for an inventory system
interface InventoryRecord {
  productId: string;
  stock: number;
  reservations: Reservation[];
  lastModified: number;
}

function mergeInventory(
  local: InventoryRecord,
  remote: InventoryRecord,
): InventoryRecord {
  return {
    productId: local.productId,
    // Stock: take the lower value (conservative approach)
    stock: Math.min(local.stock, remote.stock),
    // Reservations: union of both sets (no lost reservations)
    reservations: mergeReservations(local.reservations, remote.reservations),
    lastModified: Math.max(local.lastModified, remote.lastModified),
  };
}

function mergeReservations(a: Reservation[], b: Reservation[]): Reservation[] {
  const merged = new Map<string, Reservation>();
  for (const r of [...a, ...b]) {
    const existing = merged.get(r.id);
    if (!existing || r.updatedAt > existing.updatedAt) {
      merged.set(r.id, r);
    }
  }
  return Array.from(merged.values());
}

Conflict Resolution Strategy Selection

Strategy	Data Safety	Complexity	Best For
LWW	Low (lossy)	Low	Preferences, profiles, settings
CRDTs	High (lossless)	Medium	Counters, sets, collaborative editing
Custom merge	High	High	Domain-specific business rules
Region-pinning	High	Low	Avoid conflicts entirely — route user to owner region

Consistency Tradeoffs

CAP Theorem in Multi-Region

In a multi-region deployment, you are always choosing between consistency and availability during a network partition between regions.

Database Options for Multi-Region

Database	Multi-Region Model	Consistency	Write Latency
Aurora Global	Single writer + read replicas	Strong (single writer)	Low in writer region
DynamoDB Global Tables	Multi-writer	Eventual (LWW)	Low in all regions
CockroachDB	Multi-writer with consensus	Strong (serializable)	Higher (cross-region consensus)
Cassandra	Multi-writer	Tunable (per-query)	Low in all regions
Spanner	Multi-writer with TrueTime	Strong (external consistency)	Higher (GPS-synchronized)
PostgreSQL (Citus)	Sharded, distributed	Strong per shard	Low for shard-local

Testing Multi-Region

Testing multi-region systems requires simulating conditions that are hard to reproduce in development.

What to Test

typescript

// Multi-region test scenarios
describe('Multi-Region System', () => {
  it('should failover within 30 seconds when primary region is unavailable', async () => {
    // Simulate region failure
    await blockRegionTraffic('us-east-1');

    const startTime = Date.now();

    // Wait for health check to detect failure
    await waitForCondition(async () => {
      const response = await fetch('https://api.example.com/health');
      const region = response.headers.get('X-Served-By-Region');
      return region === 'eu-west-1'; // Failover detected
    }, { timeout: 30_000 });

    const failoverTime = Date.now() - startTime;
    expect(failoverTime).toBeLessThan(30_000);

    // Verify data is consistent
    const data = await fetch('https://api.example.com/orders/recent');
    expect(data.status).toBe(200);
  });

  it('should handle replication lag gracefully', async () => {
    // Write to primary
    const order = await createOrder({ region: 'us-east-1' });

    // Read from replica immediately — may not be there yet
    const result = await getOrder(order.id, {
      region: 'eu-west-1',
      afterVersion: order.version, // Read-your-writes header
    });

    expect(result.id).toBe(order.id);
  });

  it('should resolve write conflicts correctly', async () => {
    // Concurrent writes to different regions
    const [resultUS, resultEU] = await Promise.all([
      updateProfile('user-1', { name: 'Alice' }, { region: 'us-east-1' }),
      updateProfile('user-1', { name: 'Bob' }, { region: 'eu-west-1' }),
    ]);

    // Wait for replication convergence
    await sleep(5000);

    // Both regions should agree on the same value
    const usProfile = await getProfile('user-1', { region: 'us-east-1' });
    const euProfile = await getProfile('user-1', { region: 'eu-west-1' });
    expect(usProfile.name).toBe(euProfile.name);
  });
});

Chaos Engineering for Multi-Region

Test	Simulates	Tool
Block inter-region traffic	Network partition between regions	AWS VPC NACLs, Chaos Monkey
Kill primary database	Primary region DB failure	AWS FIS
Inject replication lag	Degraded network between regions	tc netem
DNS resolution failure	DNS outage	Modify Route 53 health checks
Saturate cross-region bandwidth	Bandwidth contention	iperf3 + tc

Cost Considerations

Component	2 Regions (Active-Passive)	3 Regions (Active-Active)
Compute	1.3x (standby at reduced capacity)	3x (full capacity in each)
Database	1.5x (replica cost)	3x (full instances)
Data transfer	$0.02/GB cross-region replication	$0.02/GB x replication factor
Load balancers	2x	3x
Monitoring	1.5x	3x
Estimated multiplier	1.3-1.5x single region	2.5-3x single region

See our Cost of Scale page for detailed cost breakdowns.

Key Takeaways

Start with active-passive — it gives you DR without the complexity of multi-writer conflicts
Active-active is for companies that need 99.99%+ availability and can afford the engineering investment
DNS routing strategy matters — latency-based routing is the best default for performance
Choose your conflict resolution strategy before you write code — retrofitting is painful
CRDTs are underused — they solve many multi-writer problems without custom conflict logic
Test the failover regularly — untested failover plans fail when you need them most
Data transfer costs are significant — cross-region replication at TB scale adds up fast

CRDT Fundamentals — deep dive into conflict-free data types
Consistency Models — CAP, PACELC, linearizability
Replication — database replication strategies
Global Load Balancing — routing users to regions
Multi-Region Infrastructure — Terraform multi-region setup
DNS Deep Dive — how DNS routing works
Traffic Routing — infra-level routing configuration
Failover Strategies — automated failover implementation
Cost of Scale — multi-region cost analysis

Real-World Examples

Netflix

Netflix runs active-active across 3 AWS regions (us-east-1, us-west-2, eu-west-1). Every region can handle 100% of global traffic if the other two fail. They use EVCache with consistent hashing for per-region caching, Cassandra with cross-region replication for persistence, and Zuul for regional routing. Their annual Chaos Engineering exercises (Chaos Kong) simulate entire region failures to validate failover.

Uber

Uber uses a cell-based multi-region architecture. Each city is assigned to a primary data center, and driver/rider data is partitioned geographically. A trip in Paris is served by their EU region with sub-50ms latency. If a region fails, affected cities fail over to the backup region. This geographic partitioning avoids the write conflict problem of active-active because each city's data has a single writer.

DynamoDB Global Tables

Amazon's DynamoDB Global Tables provides multi-writer replication across any AWS regions using last-writer-wins (LWW) conflict resolution. Each region accepts reads and writes locally with single-digit millisecond latency. Writes are asynchronously replicated to other regions within 1-2 seconds. Companies like Lyft and Samsung use Global Tables for session stores and user profiles where LWW is acceptable.

Interview Tip

What to say

"I'd start with active-passive for multi-region because it gives disaster recovery without multi-writer conflict complexity — one region handles all writes, others serve reads locally. The RTO is 5-30 minutes, which is acceptable for most applications. I'd move to active-active only when I need zero-downtime failover or global write latency under 50ms, and I'd choose my conflict resolution strategy upfront: LWW for user preferences, CRDTs for counters and sets, or region-pinning to avoid conflicts entirely. The cost is 2-3x single region for active-active, so it needs to be justified. Netflix validates their multi-region setup by running Chaos Kong exercises that kill entire regions — untested failover plans fail when you need them."

Multi-Region Architecture Design ​

Why Go Multi-Region? ​

Active-Passive vs Active-Active ​

Active-Passive ​

Active-Active ​

Comparison Matrix ​

DNS Routing Strategies ​

Latency-Based Routing ​

Geolocation-Based Routing ​

Weighted Routing for Gradual Migration ​

Data Replication Strategies ​

Strategy 1: Single-Writer with Read Replicas ​

Strategy 2: Multi-Writer with Conflict Resolution ​

Conflict Resolution Strategies ​

Last-Writer-Wins (LWW) ​

CRDTs (Conflict-Free Replicated Data Types) ​

Custom Merge Functions ​

Conflict Resolution Strategy Selection ​

Consistency Tradeoffs ​

CAP Theorem in Multi-Region ​

Database Options for Multi-Region ​

Testing Multi-Region ​

What to Test ​

Chaos Engineering for Multi-Region ​

Cost Considerations ​

Key Takeaways ​

Related Pages ​

Real-World Examples ​

Interview Tip ​

Related Pages

Multi-Region Architecture Design

Why Go Multi-Region?

Active-Passive vs Active-Active

Active-Passive

Active-Active

Comparison Matrix

DNS Routing Strategies

Latency-Based Routing

Geolocation-Based Routing

Weighted Routing for Gradual Migration

Data Replication Strategies

Strategy 1: Single-Writer with Read Replicas

Strategy 2: Multi-Writer with Conflict Resolution

Conflict Resolution Strategies

Last-Writer-Wins (LWW)

CRDTs (Conflict-Free Replicated Data Types)

Custom Merge Functions

Conflict Resolution Strategy Selection

Consistency Tradeoffs

CAP Theorem in Multi-Region

Database Options for Multi-Region

Testing Multi-Region

What to Test

Chaos Engineering for Multi-Region

Cost Considerations

Key Takeaways

Related Pages

Real-World Examples

Interview Tip