Skip to content
Unverified — AI-generated content. Help verify this page

Availability Patterns

Availability measures the percentage of time a system is operational and accessible. Five nines (99.999%) sounds impressive until you calculate it — that is 5.26 minutes of total downtime per year. Every extra nine costs exponentially more money and engineering effort. The patterns on this page are how you buy those nines.

Availability is not about preventing failures. Hardware fails. Networks partition. Software has bugs. Availability is about ensuring failures do not become outages. The difference between a system that is 99% available and one that is 99.99% is not better hardware — it is better patterns.

Availability Math

Before diving into patterns, understand the math that drives decisions.

AvailabilityDowntime/YearDowntime/MonthDowntime/Week
99% (two nines)3.65 days7.31 hours1.68 hours
99.9% (three nines)8.77 hours43.83 min10.08 min
99.99% (four nines)52.6 min4.38 min1.01 min
99.999% (five nines)5.26 min26.3 sec6.05 sec

Serial dependencies reduce availability: If Service A (99.9%) calls Service B (99.9%), the combined availability is 99.9% x 99.9% = 99.8%. Add more services in a chain and availability drops fast.

Parallel redundancy increases availability: If you have two independent instances, each at 99%, the system availability is 1 - (1 - 0.99)^2 = 99.99%. Redundancy is how you buy nines.

python
def serial_availability(*components: float) -> float:
    """Combined availability of components in series (all must work)."""
    result = 1.0
    for a in components:
        result *= a
    return result


def parallel_availability(*components: float) -> float:
    """Combined availability of components in parallel (any must work)."""
    failure_prob = 1.0
    for a in components:
        failure_prob *= (1 - a)
    return 1 - failure_prob


# Example: Three 99.9% services in series
serial = serial_availability(0.999, 0.999, 0.999)
print(f"Serial: {serial:.4%}")  # 99.7003% — worse than any single component

# Example: Two 99.9% instances in parallel
parallel = parallel_availability(0.999, 0.999)
print(f"Parallel: {parallel:.6%}")  # 99.999900% — better than either alone

Failover Patterns

Failover is the process of switching to a standby component when the primary fails.

Active-Passive (Hot Standby)

One server handles all traffic. A standby server is ready to take over if the primary fails. The standby receives replicated data but does not serve traffic.

How failover happens:

  1. Health monitor detects primary is unresponsive (missed heartbeats)
  2. Monitor promotes standby to primary (VIP/DNS switch)
  3. Standby starts serving traffic
  4. Recovery time: seconds to minutes depending on implementation

Trade-offs:

  • Simple to understand and implement
  • Standby wastes resources (paying for idle capacity)
  • Data loss possible if replication is asynchronous (last few writes may be lost)
  • Failover is not instant — there is a detection window + promotion time

Active-Active

All servers handle traffic simultaneously. If one fails, the others absorb its load. No idle resources.

Trade-offs:

  • No wasted capacity — all servers are productive
  • Higher throughput (N servers serving traffic, not N-1)
  • Conflict resolution needed (two servers modify the same data)
  • More complex to implement correctly
  • Split-brain risk if network partitions
AspectActive-PassiveActive-Active
Resource utilization50% (standby idle)100%
Failover timeSeconds to minutesImmediate (LB routes around)
ComplexityLowerHigher
Data conflictsNone (single writer)Must handle conflicts
Cost efficiencyLowerHigher
Best forDatabases, stateful servicesStateless app servers, CDNs

Multi-Region Active-Active

For global availability, run active instances in multiple regions.

Health Checks

Health checks are the foundation of automated failover. Without them, you cannot detect failures.

Health Check Types

python
from dataclasses import dataclass
from enum import Enum
from datetime import datetime
from typing import Optional


class HealthStatus(Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"


@dataclass
class HealthCheckResult:
    status: HealthStatus
    component: str
    latency_ms: float
    message: Optional[str] = None
    checked_at: datetime = None

    def __post_init__(self):
        self.checked_at = self.checked_at or datetime.utcnow()


class HealthChecker:
    """Comprehensive health check implementation."""

    async def liveness_check(self) -> HealthCheckResult:
        """Is the process running? (Kubernetes liveness probe)
        If this fails, the container should be restarted.
        Keep it simple — just confirm the process is not deadlocked."""
        return HealthCheckResult(
            status=HealthStatus.HEALTHY,
            component="process",
            latency_ms=0.1,
            message="Process is alive"
        )

    async def readiness_check(self) -> HealthCheckResult:
        """Can the service handle requests? (Kubernetes readiness probe)
        Check dependencies: DB connection, cache, required services."""
        checks = []

        # Database connectivity
        try:
            start = datetime.utcnow()
            await self.db.execute("SELECT 1")
            latency = (datetime.utcnow() - start).total_seconds() * 1000
            checks.append(HealthCheckResult(
                status=HealthStatus.HEALTHY if latency < 100
                else HealthStatus.DEGRADED,
                component="database",
                latency_ms=latency
            ))
        except Exception as e:
            checks.append(HealthCheckResult(
                status=HealthStatus.UNHEALTHY,
                component="database",
                latency_ms=0,
                message=str(e)
            ))

        # Cache connectivity
        try:
            start = datetime.utcnow()
            await self.cache.ping()
            latency = (datetime.utcnow() - start).total_seconds() * 1000
            checks.append(HealthCheckResult(
                status=HealthStatus.HEALTHY if latency < 10
                else HealthStatus.DEGRADED,
                component="cache",
                latency_ms=latency
            ))
        except Exception as e:
            checks.append(HealthCheckResult(
                status=HealthStatus.UNHEALTHY,
                component="cache",
                latency_ms=0,
                message=str(e)
            ))

        # Aggregate: unhealthy if any critical dependency is down
        if any(c.status == HealthStatus.UNHEALTHY for c in checks):
            overall = HealthStatus.UNHEALTHY
        elif any(c.status == HealthStatus.DEGRADED for c in checks):
            overall = HealthStatus.DEGRADED
        else:
            overall = HealthStatus.HEALTHY

        return HealthCheckResult(
            status=overall,
            component="service",
            latency_ms=max(c.latency_ms for c in checks)
        )

Health Check Patterns

PatternMechanismDetection SpeedFalse Positives
TCP checkConnect to portFast (ms)Low
HTTP checkGET /healthFastLow
Deep healthCheck all dependenciesSlower (100ms+)Higher
Peer healthServices check each otherMediumMedium
Synthetic transactionsExecute real operationsSlow (seconds)Very low

Graceful Degradation

When a component fails, the system should degrade gracefully — offering reduced functionality rather than complete failure.

python
class GracefulDegradation:
    """Degrade functionality rather than fail entirely."""

    async def get_product_page(self, product_id: str):
        # Core data — if this fails, we cannot serve the page
        product = await self._get_product(product_id)
        if not product:
            raise NotFoundException(f"Product {product_id} not found")

        # Non-critical enrichments — degrade if unavailable
        recommendations = await self._safe_call(
            self.recommendation_service.get(product_id),
            fallback=[],
            timeout=0.5
        )

        reviews = await self._safe_call(
            self.review_service.get(product_id),
            fallback={"count": 0, "items": [], "degraded": True},
            timeout=1.0
        )

        personalization = await self._safe_call(
            self.personalization_service.get(product_id, self.user_id),
            fallback=None,
            timeout=0.3
        )

        return ProductPage(
            product=product,
            recommendations=recommendations,
            reviews=reviews,
            personalization=personalization
        )

    async def _safe_call(self, coroutine, fallback, timeout: float):
        """Call with timeout and fallback on any failure."""
        try:
            import asyncio
            return await asyncio.wait_for(coroutine, timeout=timeout)
        except Exception:
            return fallback

Bulkhead Isolation

Named after ship bulkheads that contain flooding to one compartment, this pattern isolates failures so they cannot spread.

go
package bulkhead

import (
	"context"
	"errors"
	"sync"
	"time"
)

var ErrBulkheadFull = errors.New("bulkhead: max concurrent calls reached")

// Bulkhead limits concurrent calls to a downstream service.
type Bulkhead struct {
	name       string
	semaphore  chan struct{}
	maxWait    time.Duration
	mu         sync.Mutex
	active     int
	rejected   int
}

func NewBulkhead(name string, maxConcurrent int, maxWait time.Duration) *Bulkhead {
	return &Bulkhead{
		name:      name,
		semaphore: make(chan struct{}, maxConcurrent),
		maxWait:   maxWait,
	}
}

func (b *Bulkhead) Execute(ctx context.Context, fn func() (interface{}, error)) (interface{}, error) {
	// Try to acquire a slot
	timer := time.NewTimer(b.maxWait)
	defer timer.Stop()

	select {
	case b.semaphore <- struct{}{}:
		// Got a slot
		defer func() { <-b.semaphore }()

		b.mu.Lock()
		b.active++
		b.mu.Unlock()

		defer func() {
			b.mu.Lock()
			b.active--
			b.mu.Unlock()
		}()

		return fn()

	case <-timer.C:
		b.mu.Lock()
		b.rejected++
		b.mu.Unlock()
		return nil, ErrBulkheadFull

	case <-ctx.Done():
		return nil, ctx.Err()
	}
}

func (b *Bulkhead) Stats() (active int, rejected int) {
	b.mu.Lock()
	defer b.mu.Unlock()
	return b.active, b.rejected
}

Bulkhead Strategies

StrategyIsolationOverheadBest For
Thread poolPer-service thread poolHigher (thread cost)JVM-based services
SemaphoreCount-based concurrency limitLowNon-blocking I/O
Process isolationSeparate processesHighestCritical isolation
Container isolationSeparate containers/podsMediumKubernetes
Connection poolDedicated DB connectionsMediumDatabase calls

Circuit Breaker Pattern

Circuit breakers prevent cascading failures by stopping calls to a failing dependency. When a service starts failing, the circuit "opens" and calls fail immediately without attempting the network call.

For a complete deep dive into circuit breaker implementation, state machines, and tuning, see our dedicated Circuit Breaker Pattern page.

The key availability insight: circuit breakers trade partial failure (one feature unavailable) for system stability (everything else keeps working).

Retry with Exponential Backoff

Retries handle transient failures. Exponential backoff prevents retry storms from overwhelming a recovering service.

python
import random
import asyncio
from typing import TypeVar, Callable

T = TypeVar("T")


async def retry_with_backoff(
    fn: Callable[..., T],
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    jitter: bool = True,
    retryable_exceptions: tuple = (Exception,)
) -> T:
    """Retry with exponential backoff and optional jitter."""
    last_exception = None

    for attempt in range(max_retries + 1):
        try:
            return await fn()
        except retryable_exceptions as e:
            last_exception = e

            if attempt == max_retries:
                break

            delay = min(base_delay * (2 ** attempt), max_delay)

            if jitter:
                # Full jitter: random between 0 and calculated delay
                delay = random.uniform(0, delay)

            await asyncio.sleep(delay)

    raise last_exception

Jitter is critical. Without jitter, all clients retry at the exact same time, creating a "thundering herd." With jitter, retries spread out.

Backoff TypeDelay PatternThundering Herd Risk
No backoff1s, 1s, 1sVery high
Linear1s, 2s, 3sHigh
Exponential1s, 2s, 4s, 8sMedium
Exponential + full jitterrand(0,1), rand(0,2), rand(0,4)Low
Decorrelated jitterrand(base, prev*3)Lowest

Load Shedding

When a system is overloaded, it is better to reject some requests quickly than to serve all requests slowly. Load shedding intentionally drops low-priority traffic to maintain service for high-priority traffic.

python
from enum import IntEnum
from collections import deque
from time import time


class Priority(IntEnum):
    CRITICAL = 0   # Health checks, payments in progress
    HIGH = 1       # Authenticated user actions
    NORMAL = 2     # General API requests
    LOW = 3        # Analytics, batch operations
    BEST_EFFORT = 4  # Prefetch, speculative requests


class LoadShedder:
    """Shed load by priority when approaching capacity."""

    def __init__(self, max_concurrent: int, shed_threshold: float = 0.8):
        self.max_concurrent = max_concurrent
        self.shed_threshold = shed_threshold
        self.current_load = 0
        self.request_times = deque(maxlen=1000)

    def should_accept(self, priority: Priority) -> bool:
        load_ratio = self.current_load / self.max_concurrent

        if load_ratio < self.shed_threshold:
            return True  # Under threshold, accept everything

        # Over threshold: shed by priority
        # Higher load = more aggressive shedding
        if priority == Priority.CRITICAL:
            return True  # Always accept critical
        elif priority == Priority.HIGH:
            return load_ratio < 0.95
        elif priority == Priority.NORMAL:
            return load_ratio < 0.90
        elif priority == Priority.LOW:
            return load_ratio < 0.85
        else:
            return False  # Shed best-effort first

Chaos Engineering

You cannot be confident in your availability patterns unless you test them. Chaos engineering deliberately introduces failures to verify your system handles them correctly.

Principles

  1. Define "steady state" — What does normal look like? (latency, error rate, throughput)
  2. Hypothesize — "If we kill one database replica, latency stays under 200ms"
  3. Introduce failure — Kill the replica
  4. Observe — Did the system behave as expected?
  5. Learn — Fix what broke, add monitoring, improve runbooks

Common Chaos Experiments

ExperimentWhat It TestsTools
Kill a server instanceAuto-scaling, load balancingChaos Monkey, kill -9
Inject network latencyTimeout handling, circuit breakerstc netem, Toxiproxy
Fill diskLog rotation, alertingdd, stress-ng
DNS failureFallback, cachingiptables, Chaos Mesh
Kill a database primaryFailover, data consistencyManual, LitmusChaos
Corrupt a responseInput validation, error handlingToxiproxy
Region failureMulti-region failoverAWS Fault Injection Simulator

Chaos in Practice

yaml
# LitmusChaos experiment: kill a pod
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-kill-experiment
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "false"

Availability Pattern Decision Matrix

PatternNines It BuysCostComplexityStart Using At
Health checksBase requirementLowLowDay 1
Load balancing99% -> 99.9%LowLowDay 1
Active-passive failover99.9% -> 99.95%MediumMedium1K+ users
Active-active99.95% -> 99.99%HighHigh100K+ users
Circuit breakersPrevents cascadingLowMediumAny microservice
Bulkhead isolationPrevents cascadingLowMedium3+ dependencies
Graceful degradationReduces blast radiusMediumMediumComplex features
Multi-region99.99% -> 99.999%Very highVery highGlobal product
Chaos engineeringValidates everythingMediumMediumAfter patterns are in place

Cross-References


High availability is not a feature you add at the end. It is a property that emerges from patterns applied consistently across every layer of your architecture. Start with health checks and load balancing, add circuit breakers and bulkheads as you grow, and validate everything with chaos engineering.

Real-World Examples

Netflix (Chaos Monkey)

Netflix invented Chaos Monkey in 2011, which randomly kills production EC2 instances to verify their systems handle failures gracefully. They expanded this to the full Simian Army including Chaos Gorilla (kills entire AZs) and Chaos Kong (kills entire regions). This approach validated that their availability patterns actually work under real failure conditions, helping them achieve 99.99% availability.

Amazon

Amazon uses bulkhead isolation through their cell-based architecture. Each "cell" is an independent unit with its own compute, storage, and database. If one cell fails, only the customers assigned to that cell are affected — other cells continue operating normally. This limits the blast radius of any failure to ~1/N of total customers.

Slack

Slack uses graceful degradation extensively. When their real-time messaging infrastructure is under pressure, they disable non-critical features like typing indicators, presence updates, and message previews. Users can still send and read messages — the core function — while peripheral features are temporarily unavailable. This keeps the service usable during incidents.

Interview Tip

What to say

"Availability is about making failures invisible, not preventing them. My approach has three layers: first, redundancy at every level — multiple app servers, Multi-AZ database, replicated cache. Second, resilience patterns — circuit breakers prevent cascading failures, retries with exponential backoff handle transient issues, and graceful degradation keeps core features working when non-critical services fail. Third, validation — chaos engineering to prove it actually works. The math matters too: serial dependencies multiply failure probability, so I prefer async communication to limit synchronous chains. Netflix proves this works — they kill production servers daily and maintain 99.99% availability."

"What I cannot create, I do not understand." — Richard Feynman