Availability Patterns
Availability measures the percentage of time a system is operational and accessible. Five nines (99.999%) sounds impressive until you calculate it — that is 5.26 minutes of total downtime per year. Every extra nine costs exponentially more money and engineering effort. The patterns on this page are how you buy those nines.
Availability is not about preventing failures. Hardware fails. Networks partition. Software has bugs. Availability is about ensuring failures do not become outages. The difference between a system that is 99% available and one that is 99.99% is not better hardware — it is better patterns.
Availability Math
Before diving into patterns, understand the math that drives decisions.
| Availability | Downtime/Year | Downtime/Month | Downtime/Week |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.31 hours | 1.68 hours |
| 99.9% (three nines) | 8.77 hours | 43.83 min | 10.08 min |
| 99.99% (four nines) | 52.6 min | 4.38 min | 1.01 min |
| 99.999% (five nines) | 5.26 min | 26.3 sec | 6.05 sec |
Serial dependencies reduce availability: If Service A (99.9%) calls Service B (99.9%), the combined availability is 99.9% x 99.9% = 99.8%. Add more services in a chain and availability drops fast.
Parallel redundancy increases availability: If you have two independent instances, each at 99%, the system availability is 1 - (1 - 0.99)^2 = 99.99%. Redundancy is how you buy nines.
def serial_availability(*components: float) -> float:
"""Combined availability of components in series (all must work)."""
result = 1.0
for a in components:
result *= a
return result
def parallel_availability(*components: float) -> float:
"""Combined availability of components in parallel (any must work)."""
failure_prob = 1.0
for a in components:
failure_prob *= (1 - a)
return 1 - failure_prob
# Example: Three 99.9% services in series
serial = serial_availability(0.999, 0.999, 0.999)
print(f"Serial: {serial:.4%}") # 99.7003% — worse than any single component
# Example: Two 99.9% instances in parallel
parallel = parallel_availability(0.999, 0.999)
print(f"Parallel: {parallel:.6%}") # 99.999900% — better than either aloneFailover Patterns
Failover is the process of switching to a standby component when the primary fails.
Active-Passive (Hot Standby)
One server handles all traffic. A standby server is ready to take over if the primary fails. The standby receives replicated data but does not serve traffic.
How failover happens:
- Health monitor detects primary is unresponsive (missed heartbeats)
- Monitor promotes standby to primary (VIP/DNS switch)
- Standby starts serving traffic
- Recovery time: seconds to minutes depending on implementation
Trade-offs:
- Simple to understand and implement
- Standby wastes resources (paying for idle capacity)
- Data loss possible if replication is asynchronous (last few writes may be lost)
- Failover is not instant — there is a detection window + promotion time
Active-Active
All servers handle traffic simultaneously. If one fails, the others absorb its load. No idle resources.
Trade-offs:
- No wasted capacity — all servers are productive
- Higher throughput (N servers serving traffic, not N-1)
- Conflict resolution needed (two servers modify the same data)
- More complex to implement correctly
- Split-brain risk if network partitions
| Aspect | Active-Passive | Active-Active |
|---|---|---|
| Resource utilization | 50% (standby idle) | 100% |
| Failover time | Seconds to minutes | Immediate (LB routes around) |
| Complexity | Lower | Higher |
| Data conflicts | None (single writer) | Must handle conflicts |
| Cost efficiency | Lower | Higher |
| Best for | Databases, stateful services | Stateless app servers, CDNs |
Multi-Region Active-Active
For global availability, run active instances in multiple regions.
Health Checks
Health checks are the foundation of automated failover. Without them, you cannot detect failures.
Health Check Types
from dataclasses import dataclass
from enum import Enum
from datetime import datetime
from typing import Optional
class HealthStatus(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
@dataclass
class HealthCheckResult:
status: HealthStatus
component: str
latency_ms: float
message: Optional[str] = None
checked_at: datetime = None
def __post_init__(self):
self.checked_at = self.checked_at or datetime.utcnow()
class HealthChecker:
"""Comprehensive health check implementation."""
async def liveness_check(self) -> HealthCheckResult:
"""Is the process running? (Kubernetes liveness probe)
If this fails, the container should be restarted.
Keep it simple — just confirm the process is not deadlocked."""
return HealthCheckResult(
status=HealthStatus.HEALTHY,
component="process",
latency_ms=0.1,
message="Process is alive"
)
async def readiness_check(self) -> HealthCheckResult:
"""Can the service handle requests? (Kubernetes readiness probe)
Check dependencies: DB connection, cache, required services."""
checks = []
# Database connectivity
try:
start = datetime.utcnow()
await self.db.execute("SELECT 1")
latency = (datetime.utcnow() - start).total_seconds() * 1000
checks.append(HealthCheckResult(
status=HealthStatus.HEALTHY if latency < 100
else HealthStatus.DEGRADED,
component="database",
latency_ms=latency
))
except Exception as e:
checks.append(HealthCheckResult(
status=HealthStatus.UNHEALTHY,
component="database",
latency_ms=0,
message=str(e)
))
# Cache connectivity
try:
start = datetime.utcnow()
await self.cache.ping()
latency = (datetime.utcnow() - start).total_seconds() * 1000
checks.append(HealthCheckResult(
status=HealthStatus.HEALTHY if latency < 10
else HealthStatus.DEGRADED,
component="cache",
latency_ms=latency
))
except Exception as e:
checks.append(HealthCheckResult(
status=HealthStatus.UNHEALTHY,
component="cache",
latency_ms=0,
message=str(e)
))
# Aggregate: unhealthy if any critical dependency is down
if any(c.status == HealthStatus.UNHEALTHY for c in checks):
overall = HealthStatus.UNHEALTHY
elif any(c.status == HealthStatus.DEGRADED for c in checks):
overall = HealthStatus.DEGRADED
else:
overall = HealthStatus.HEALTHY
return HealthCheckResult(
status=overall,
component="service",
latency_ms=max(c.latency_ms for c in checks)
)Health Check Patterns
| Pattern | Mechanism | Detection Speed | False Positives |
|---|---|---|---|
| TCP check | Connect to port | Fast (ms) | Low |
| HTTP check | GET /health | Fast | Low |
| Deep health | Check all dependencies | Slower (100ms+) | Higher |
| Peer health | Services check each other | Medium | Medium |
| Synthetic transactions | Execute real operations | Slow (seconds) | Very low |
Graceful Degradation
When a component fails, the system should degrade gracefully — offering reduced functionality rather than complete failure.
class GracefulDegradation:
"""Degrade functionality rather than fail entirely."""
async def get_product_page(self, product_id: str):
# Core data — if this fails, we cannot serve the page
product = await self._get_product(product_id)
if not product:
raise NotFoundException(f"Product {product_id} not found")
# Non-critical enrichments — degrade if unavailable
recommendations = await self._safe_call(
self.recommendation_service.get(product_id),
fallback=[],
timeout=0.5
)
reviews = await self._safe_call(
self.review_service.get(product_id),
fallback={"count": 0, "items": [], "degraded": True},
timeout=1.0
)
personalization = await self._safe_call(
self.personalization_service.get(product_id, self.user_id),
fallback=None,
timeout=0.3
)
return ProductPage(
product=product,
recommendations=recommendations,
reviews=reviews,
personalization=personalization
)
async def _safe_call(self, coroutine, fallback, timeout: float):
"""Call with timeout and fallback on any failure."""
try:
import asyncio
return await asyncio.wait_for(coroutine, timeout=timeout)
except Exception:
return fallbackBulkhead Isolation
Named after ship bulkheads that contain flooding to one compartment, this pattern isolates failures so they cannot spread.
package bulkhead
import (
"context"
"errors"
"sync"
"time"
)
var ErrBulkheadFull = errors.New("bulkhead: max concurrent calls reached")
// Bulkhead limits concurrent calls to a downstream service.
type Bulkhead struct {
name string
semaphore chan struct{}
maxWait time.Duration
mu sync.Mutex
active int
rejected int
}
func NewBulkhead(name string, maxConcurrent int, maxWait time.Duration) *Bulkhead {
return &Bulkhead{
name: name,
semaphore: make(chan struct{}, maxConcurrent),
maxWait: maxWait,
}
}
func (b *Bulkhead) Execute(ctx context.Context, fn func() (interface{}, error)) (interface{}, error) {
// Try to acquire a slot
timer := time.NewTimer(b.maxWait)
defer timer.Stop()
select {
case b.semaphore <- struct{}{}:
// Got a slot
defer func() { <-b.semaphore }()
b.mu.Lock()
b.active++
b.mu.Unlock()
defer func() {
b.mu.Lock()
b.active--
b.mu.Unlock()
}()
return fn()
case <-timer.C:
b.mu.Lock()
b.rejected++
b.mu.Unlock()
return nil, ErrBulkheadFull
case <-ctx.Done():
return nil, ctx.Err()
}
}
func (b *Bulkhead) Stats() (active int, rejected int) {
b.mu.Lock()
defer b.mu.Unlock()
return b.active, b.rejected
}Bulkhead Strategies
| Strategy | Isolation | Overhead | Best For |
|---|---|---|---|
| Thread pool | Per-service thread pool | Higher (thread cost) | JVM-based services |
| Semaphore | Count-based concurrency limit | Low | Non-blocking I/O |
| Process isolation | Separate processes | Highest | Critical isolation |
| Container isolation | Separate containers/pods | Medium | Kubernetes |
| Connection pool | Dedicated DB connections | Medium | Database calls |
Circuit Breaker Pattern
Circuit breakers prevent cascading failures by stopping calls to a failing dependency. When a service starts failing, the circuit "opens" and calls fail immediately without attempting the network call.
For a complete deep dive into circuit breaker implementation, state machines, and tuning, see our dedicated Circuit Breaker Pattern page.
The key availability insight: circuit breakers trade partial failure (one feature unavailable) for system stability (everything else keeps working).
Retry with Exponential Backoff
Retries handle transient failures. Exponential backoff prevents retry storms from overwhelming a recovering service.
import random
import asyncio
from typing import TypeVar, Callable
T = TypeVar("T")
async def retry_with_backoff(
fn: Callable[..., T],
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 30.0,
jitter: bool = True,
retryable_exceptions: tuple = (Exception,)
) -> T:
"""Retry with exponential backoff and optional jitter."""
last_exception = None
for attempt in range(max_retries + 1):
try:
return await fn()
except retryable_exceptions as e:
last_exception = e
if attempt == max_retries:
break
delay = min(base_delay * (2 ** attempt), max_delay)
if jitter:
# Full jitter: random between 0 and calculated delay
delay = random.uniform(0, delay)
await asyncio.sleep(delay)
raise last_exceptionJitter is critical. Without jitter, all clients retry at the exact same time, creating a "thundering herd." With jitter, retries spread out.
| Backoff Type | Delay Pattern | Thundering Herd Risk |
|---|---|---|
| No backoff | 1s, 1s, 1s | Very high |
| Linear | 1s, 2s, 3s | High |
| Exponential | 1s, 2s, 4s, 8s | Medium |
| Exponential + full jitter | rand(0,1), rand(0,2), rand(0,4) | Low |
| Decorrelated jitter | rand(base, prev*3) | Lowest |
Load Shedding
When a system is overloaded, it is better to reject some requests quickly than to serve all requests slowly. Load shedding intentionally drops low-priority traffic to maintain service for high-priority traffic.
from enum import IntEnum
from collections import deque
from time import time
class Priority(IntEnum):
CRITICAL = 0 # Health checks, payments in progress
HIGH = 1 # Authenticated user actions
NORMAL = 2 # General API requests
LOW = 3 # Analytics, batch operations
BEST_EFFORT = 4 # Prefetch, speculative requests
class LoadShedder:
"""Shed load by priority when approaching capacity."""
def __init__(self, max_concurrent: int, shed_threshold: float = 0.8):
self.max_concurrent = max_concurrent
self.shed_threshold = shed_threshold
self.current_load = 0
self.request_times = deque(maxlen=1000)
def should_accept(self, priority: Priority) -> bool:
load_ratio = self.current_load / self.max_concurrent
if load_ratio < self.shed_threshold:
return True # Under threshold, accept everything
# Over threshold: shed by priority
# Higher load = more aggressive shedding
if priority == Priority.CRITICAL:
return True # Always accept critical
elif priority == Priority.HIGH:
return load_ratio < 0.95
elif priority == Priority.NORMAL:
return load_ratio < 0.90
elif priority == Priority.LOW:
return load_ratio < 0.85
else:
return False # Shed best-effort firstChaos Engineering
You cannot be confident in your availability patterns unless you test them. Chaos engineering deliberately introduces failures to verify your system handles them correctly.
Principles
- Define "steady state" — What does normal look like? (latency, error rate, throughput)
- Hypothesize — "If we kill one database replica, latency stays under 200ms"
- Introduce failure — Kill the replica
- Observe — Did the system behave as expected?
- Learn — Fix what broke, add monitoring, improve runbooks
Common Chaos Experiments
| Experiment | What It Tests | Tools |
|---|---|---|
| Kill a server instance | Auto-scaling, load balancing | Chaos Monkey, kill -9 |
| Inject network latency | Timeout handling, circuit breakers | tc netem, Toxiproxy |
| Fill disk | Log rotation, alerting | dd, stress-ng |
| DNS failure | Fallback, caching | iptables, Chaos Mesh |
| Kill a database primary | Failover, data consistency | Manual, LitmusChaos |
| Corrupt a response | Input validation, error handling | Toxiproxy |
| Region failure | Multi-region failover | AWS Fault Injection Simulator |
Chaos in Practice
# LitmusChaos experiment: kill a pod
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-kill-experiment
spec:
appinfo:
appns: production
applabel: app=payment-service
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"Availability Pattern Decision Matrix
| Pattern | Nines It Buys | Cost | Complexity | Start Using At |
|---|---|---|---|---|
| Health checks | Base requirement | Low | Low | Day 1 |
| Load balancing | 99% -> 99.9% | Low | Low | Day 1 |
| Active-passive failover | 99.9% -> 99.95% | Medium | Medium | 1K+ users |
| Active-active | 99.95% -> 99.99% | High | High | 100K+ users |
| Circuit breakers | Prevents cascading | Low | Medium | Any microservice |
| Bulkhead isolation | Prevents cascading | Low | Medium | 3+ dependencies |
| Graceful degradation | Reduces blast radius | Medium | Medium | Complex features |
| Multi-region | 99.99% -> 99.999% | Very high | Very high | Global product |
| Chaos engineering | Validates everything | Medium | Medium | After patterns are in place |
Cross-References
- Circuit Breaker Pattern — full implementation deep dive
- Health Checks — load balancer health check configuration
- Load Balancing Algorithms — distributing traffic across healthy instances
- CAP Theorem — availability vs consistency trade-off
- Database Replication — replication for failover
- Consistency Patterns — what you trade for availability
High availability is not a feature you add at the end. It is a property that emerges from patterns applied consistently across every layer of your architecture. Start with health checks and load balancing, add circuit breakers and bulkheads as you grow, and validate everything with chaos engineering.
Real-World Examples
Netflix (Chaos Monkey)
Netflix invented Chaos Monkey in 2011, which randomly kills production EC2 instances to verify their systems handle failures gracefully. They expanded this to the full Simian Army including Chaos Gorilla (kills entire AZs) and Chaos Kong (kills entire regions). This approach validated that their availability patterns actually work under real failure conditions, helping them achieve 99.99% availability.
Amazon
Amazon uses bulkhead isolation through their cell-based architecture. Each "cell" is an independent unit with its own compute, storage, and database. If one cell fails, only the customers assigned to that cell are affected — other cells continue operating normally. This limits the blast radius of any failure to ~1/N of total customers.
Slack
Slack uses graceful degradation extensively. When their real-time messaging infrastructure is under pressure, they disable non-critical features like typing indicators, presence updates, and message previews. Users can still send and read messages — the core function — while peripheral features are temporarily unavailable. This keeps the service usable during incidents.
Interview Tip
What to say
"Availability is about making failures invisible, not preventing them. My approach has three layers: first, redundancy at every level — multiple app servers, Multi-AZ database, replicated cache. Second, resilience patterns — circuit breakers prevent cascading failures, retries with exponential backoff handle transient issues, and graceful degradation keeps core features working when non-critical services fail. Third, validation — chaos engineering to prove it actually works. The math matters too: serial dependencies multiply failure probability, so I prefer async communication to limit synchronous chains. Netflix proves this works — they kill production servers daily and maintain 99.99% availability."