Failure Detectors
How do you know if a remote process is dead? You send it a message. It doesn't respond. Is it dead? Or is the network slow? Or is the process overloaded, about to respond in 3 seconds? This seemingly simple question — "is it dead?" — is actually impossible to answer correctly in an asynchronous distributed system. This impossibility result is one of the most important theorems in computer science, and the entire field of failure detectors exists to work around it.
Why It Exists
The Fundamental Problem
In a distributed system, the only way to learn about a remote process is through messages. If you send a message and get no reply, there are exactly two possibilities:
- The process has crashed — it will never respond
- The process is slow or the network is slow — it will respond eventually
There is no way to distinguish these cases in finite time. Any timeout you choose is arbitrary: too short and you falsely declare live processes dead (false positive); too long and you waste time waiting for dead processes (slow detection).
This isn't an engineering limitation to be solved with better hardware. It is a proven mathematical impossibility.
Why Correct Failure Detection Matters
Getting failure detection wrong has consequences:
- False positive (declare alive process as dead): The system removes a healthy node, rebalances data, triggers unnecessary failovers. If the "dead" node comes back, you have a split-brain scenario
- False negative (fail to detect a dead process): Requests pile up waiting for a response that will never come. Timeouts propagate. Latency cascades through the system
- Slow detection: A dead node sits in the cluster for minutes, receiving traffic it can't process. Users experience errors
War Story
A large e-commerce platform used a 30-second heartbeat timeout for their service mesh. During a garbage collection pause on one of the primary database nodes, the GC pause lasted 31 seconds. The failure detector declared the node dead. The cluster initiated failover: promoted a replica, redirected traffic, and began re-replicating data. When the GC pause ended and the old primary came back online, both nodes believed they were the primary. Each accepted writes for 47 seconds before the conflict was detected. The result: 2,300 orders with inconsistent state, requiring three days of manual reconciliation. The root cause was not the GC pause — it was a failure detector that could only say "dead" or "alive" with no nuance in between.
First Principles
System Model: Synchronous vs. Asynchronous
The theoretical treatment of failure detection begins with the system model:
Synchronous system:
- There exists a known upper bound
on message delivery time - There exists a known upper bound
on process step time - Failure detection is trivial: if no response arrives within
, the process is dead
Asynchronous system:
- No bound on message delivery time
- No bound on process step time
- Messages can be delayed arbitrarily (but eventually delivered if the sender and receiver are correct)
Real systems are neither purely synchronous nor purely asynchronous. They are partially synchronous — most of the time, messages arrive within predictable bounds, but occasionally, delays exceed any fixed bound (GC pauses, network congestion, CPU saturation).
The FLP Impossibility Result
In 1985, Michael Fischer, Nancy Lynch, and Michael Paterson proved the most important impossibility result in distributed computing:
FLP Theorem: In an asynchronous distributed system where even a single process can crash, there is no deterministic algorithm that solves consensus.
What is consensus? All correct processes must agree on a value. More formally, consensus requires three properties:
- Agreement — No two correct processes decide differently
- Validity — If all processes propose the same value
, then is the decided value - Termination — Every correct process eventually decides
FLP proves that no algorithm can guarantee all three in an asynchronous system with even one faulty process.
The FLP Proof (Intuition)
The proof constructs an adversarial execution. The key insight:
There must exist some initial configuration where the decision could go either way (a bivalent configuration). If all initial configurations led to the same decision, the algorithm would be trivially useless (violating validity)
From any bivalent configuration, the adversary can always find a way to reach another bivalent configuration — by delaying the right message at the right time
Therefore, the algorithm can be kept in a bivalent state forever, never reaching a decision
The crucial step is (2). The adversary exploits the asynchronous model: since there is no upper bound on message delay, the adversary can delay a message from process
Why FLP Matters Practically
FLP seems to say "consensus is impossible," but Paxos, Raft, and ZooKeeper solve consensus every day. The resolution:
- FLP assumes pure asynchrony. Real systems are partially synchronous — timeouts work most of the time
- FLP proves impossibility for deterministic algorithms. Randomized algorithms (like Ben-Or's) can solve consensus with probability 1
- FLP assumes guaranteed termination. If you allow algorithms that might not terminate in pathological scenarios (but always terminate in practice), you can solve consensus
Failure detectors are the theoretical framework for understanding option (1) — they formalize what assumptions about timing are needed to make consensus solvable.
Core Mechanics: Unreliable Failure Detectors
Chandra-Toueg Framework (1996)
Tushar Deepak Chandra and Sam Toueg introduced the concept of unreliable failure detectors — oracles that provide (possibly incorrect) hints about which processes have crashed. They showed that even imperfect failure information is sufficient to solve consensus.
A failure detector
Completeness and Accuracy
Failure detectors are classified along two dimensions:
Completeness — do we eventually suspect all crashed processes?
| Property | Definition |
|---|---|
| Strong completeness | Eventually, every crashed process is permanently suspected by every correct process |
| Weak completeness | Eventually, every crashed process is permanently suspected by some correct process |
Accuracy — do we avoid falsely suspecting correct processes?
| Property | Definition |
|---|---|
| Strong accuracy | No correct process is ever suspected |
| Weak accuracy | Some correct process is never suspected |
| Eventual strong accuracy | After some time |
| Eventual weak accuracy | After some time |
The Eight Failure Detector Classes
Combining completeness and accuracy yields eight classes:
The Key Result: Is Sufficient for Consensus
Chandra and Toueg proved two landmark results:
(Eventually Strong) is the weakest failure detector that can solve consensus in asynchronous systems with crash failures and a majority of correct processes (the leader oracle) — which outputs a single process ID, and eventually all correct processes output the same correct process — is equivalent to for solving consensus
This means: to solve consensus, you need a failure detector that eventually:
- Suspects all crashed processes (eventual completeness)
- Stops falsely suspecting at least one correct process (eventual weak accuracy)
This is remarkably weak. The failure detector can make arbitrarily many mistakes for an arbitrarily long time. It just needs to eventually get its act together.
Reducibility Between Failure Detectors
The eight classes form a hierarchy:
where
Also: weak completeness can be transformed into strong completeness by having processes gossip their suspicions. If process
Core Mechanics: Heartbeat-Based Detection
The Basic Protocol
The simplest practical failure detector:
- Every process sends a heartbeat message to its neighbors every
seconds - If process
doesn't receive a heartbeat from process within a timeout , suspects - If
later receives a heartbeat from , un-suspects
Choosing the Timeout
The timeout
- Small
: Detect failures quickly, but more false positives (falsely suspect slow processes) - Large
: Fewer false positives, but slow detection
For a system where message delay
where
Problems with fixed timeouts:
- Network conditions change over time
- Different process pairs may have different latencies
- Sudden load spikes can cause temporary delays that exceed
Adaptive Timeout (TCP-Style)
Inspired by TCP's retransmission timeout calculation:
where:
= smoothed round-trip time = round-trip time variation , (standard values from RFC 6298) = measured arrival time of the -th heartbeat
This adapts to changing network conditions but still produces a binary output: suspected or not.
Core Mechanics: The Phi Accrual Failure Detector
The Key Insight
Hayashibara, Défago, Yared, and Katayama (2004) observed that the binary "suspected/not-suspected" output of traditional failure detectors loses information. A process that missed one heartbeat by 1ms is in the same state as one that missed 10 heartbeats. This is absurd.
The phi accrual failure detector outputs a suspicion level
How It Works
The detector maintains a sliding window of inter-arrival times of heartbeats from each monitored process.
Step 1: Collect samples
Let
Step 2: Model the distribution
Assume heartbeat inter-arrival times follow a normal distribution
Step 3: Compute
When the current time is
The probability that a heartbeat would be this late or later if the process is alive:
where
The suspicion level is defined as:
Interpreting
| Interpretation | ||
|---|---|---|
| 0 | ~100% | Heartbeat expected soon, nothing unusual |
| 1 | ~90% | Slightly late, probably fine |
| 3 | ~99.9% | Quite late, suspicious |
| 5 | ~99.999% | Very suspicious |
| 8 | ~99.999999% | Almost certainly dead |
| 12 | ~ | Dead |
The application sets a threshold
- Cassandra uses
by default - Applications tolerant of false positives might use
- Mission-critical applications might use
Why Is Better Than Timeouts
- Self-tuning — The window of inter-arrival times automatically tracks changes in network conditions
- Calibrated —
means the same thing regardless of the network (the probability of a false positive is ) - Application-configurable — Different parts of the application can use different thresholds from the same detector
- Smooth degradation — As network conditions worsen,
increases gradually instead of suddenly flipping from "alive" to "dead"
Mathematical Properties
The phi accrual detector satisfies:
If a process is truly dead,
The probability of a false positive is exponentially small in the threshold (parametric accuracy).
TypeScript Implementation
// Phi Accrual Failure Detector Implementation
interface PhiAccrualConfig {
/** Number of inter-arrival samples to keep */
windowSize: number;
/** Minimum standard deviation to avoid division issues (ms) */
minStdDeviation: number;
/** Initial inter-arrival time estimate when no data (ms) */
initialHeartbeatInterval: number;
/** Phi threshold above which a node is considered down */
threshold: number;
/** Maximum number of missed intervals before forced suspicion */
maxSampleSize: number;
}
const DEFAULT_CONFIG: PhiAccrualConfig = {
windowSize: 1000,
minStdDeviation: 100,
initialHeartbeatInterval: 500,
threshold: 8,
maxSampleSize: 200,
};
class ArrivalWindow {
private intervals: number[] = [];
private readonly maxSize: number;
private mean: number = 0;
private variance: number = 0;
private count: number = 0;
constructor(maxSize: number) {
this.maxSize = maxSize;
}
/**
* Record a new inter-arrival interval.
* Uses Welford's online algorithm for numerically stable
* mean and variance computation.
*/
record(interval: number): void {
if (this.intervals.length >= this.maxSize) {
// Remove oldest and adjust running stats
this.intervals.shift();
}
this.intervals.push(interval);
this.count++;
// Recompute mean and variance from scratch for accuracy
// (Welford's with window removal is tricky; full recompute
// is fine for window sizes < 1000)
const n = this.intervals.length;
this.mean = this.intervals.reduce((a, b) => a + b, 0) / n;
this.variance =
this.intervals.reduce((sum, val) => sum + (val - this.mean) ** 2, 0) / n;
}
getMean(): number {
return this.mean;
}
getStdDev(minValue: number): number {
return Math.max(Math.sqrt(this.variance), minValue);
}
getSampleCount(): number {
return this.intervals.length;
}
}
class HeartbeatHistory {
private lastHeartbeatTime: number | null = null;
private readonly window: ArrivalWindow;
private readonly config: PhiAccrualConfig;
constructor(config: PhiAccrualConfig) {
this.config = config;
this.window = new ArrivalWindow(config.windowSize);
}
recordHeartbeat(timestamp: number): void {
if (this.lastHeartbeatTime !== null) {
const interval = timestamp - this.lastHeartbeatTime;
if (interval > 0) {
this.window.record(interval);
}
}
this.lastHeartbeatTime = timestamp;
}
getLastHeartbeatTime(): number | null {
return this.lastHeartbeatTime;
}
/**
* Compute the phi value for the given timestamp.
*
* phi = -log10(1 - CDF(timeSinceLastHeartbeat))
*
* where CDF is the cumulative distribution function of
* the normal distribution with parameters estimated from
* the arrival window.
*/
phi(currentTime: number): number {
if (this.lastHeartbeatTime === null) {
return 0; // No data yet, cannot compute
}
const timeSinceLast = currentTime - this.lastHeartbeatTime;
const mean =
this.window.getSampleCount() > 0
? this.window.getMean()
: this.config.initialHeartbeatInterval;
const stdDev = this.window.getStdDev(this.config.minStdDeviation);
return this.computePhi(timeSinceLast, mean, stdDev);
}
private computePhi(
timeSinceLast: number,
mean: number,
stdDev: number
): number {
// Compute the probability of observing a delay >= timeSinceLast
// assuming normal distribution N(mean, stdDev^2)
const y = (timeSinceLast - mean) / stdDev;
const probability = 1.0 - this.normalCDF(y);
// phi = -log10(probability)
// Clamp probability to avoid -log10(0) = Infinity
const clampedProbability = Math.max(probability, 1e-15);
return -Math.log10(clampedProbability);
}
/**
* Approximation of the standard normal CDF using
* Abramowitz and Stegun formula 7.1.26.
* Maximum error: 1.5 × 10^-7
*/
private normalCDF(x: number): number {
if (x < -8.0) return 0.0;
if (x > 8.0) return 1.0;
// Constants for the approximation
const a1 = 0.254829592;
const a2 = -0.284496736;
const a3 = 1.421413741;
const a4 = -1.453152027;
const a5 = 1.061405429;
const p = 0.3275911;
const sign = x < 0 ? -1 : 1;
const absX = Math.abs(x);
const t = 1.0 / (1.0 + p * absX);
const y =
1.0 -
((((a5 * t + a4) * t + a3) * t + a2) * t + a1) *
t *
Math.exp(-absX * absX / 2) /
Math.sqrt(2 * Math.PI);
return 0.5 * (1.0 + sign * y);
}
}
class PhiAccrualFailureDetector {
private readonly config: PhiAccrualConfig;
private readonly histories: Map<string, HeartbeatHistory> = new Map();
private readonly listeners: Array<(nodeId: string, phi: number, isDown: boolean) => void> = [];
constructor(config: Partial<PhiAccrualConfig> = {}) {
this.config = { ...DEFAULT_CONFIG, ...config };
}
/**
* Record a heartbeat from a node.
*/
heartbeat(nodeId: string, timestamp: number = Date.now()): void {
let history = this.histories.get(nodeId);
if (!history) {
history = new HeartbeatHistory(this.config);
this.histories.set(nodeId, history);
}
history.recordHeartbeat(timestamp);
}
/**
* Get the current phi value for a node.
*/
phi(nodeId: string, currentTime: number = Date.now()): number {
const history = this.histories.get(nodeId);
if (!history) {
return 0; // Unknown node, no suspicion
}
return history.phi(currentTime);
}
/**
* Check if a node is considered down.
*/
isAvailable(nodeId: string, currentTime: number = Date.now()): boolean {
return this.phi(nodeId, currentTime) < this.config.threshold;
}
/**
* Get all nodes and their current status.
*/
getNodeStatuses(currentTime: number = Date.now()): Map<string, {
phi: number;
isAvailable: boolean;
lastHeartbeat: number | null;
}> {
const statuses = new Map();
for (const [nodeId, history] of this.histories) {
const phiValue = history.phi(currentTime);
statuses.set(nodeId, {
phi: phiValue,
isAvailable: phiValue < this.config.threshold,
lastHeartbeat: history.getLastHeartbeatTime(),
});
}
return statuses;
}
/**
* Register a listener for failure detection events.
*/
onSuspicionChange(
callback: (nodeId: string, phi: number, isDown: boolean) => void
): void {
this.listeners.push(callback);
}
/**
* Run a monitoring cycle: check all nodes and notify listeners.
*/
monitor(currentTime: number = Date.now()): void {
for (const [nodeId] of this.histories) {
const phiValue = this.phi(nodeId, currentTime);
const isDown = phiValue >= this.config.threshold;
for (const listener of this.listeners) {
listener(nodeId, phiValue, isDown);
}
}
}
remove(nodeId: string): void {
this.histories.delete(nodeId);
}
}
// ─── Simulation ─────────────────────────────────────────────────
function runPhiAccrualSimulation(): void {
console.log('=== Phi Accrual Failure Detector Simulation ===\n');
const detector = new PhiAccrualFailureDetector({
threshold: 8,
windowSize: 100,
minStdDeviation: 50,
initialHeartbeatInterval: 1000,
});
// Register listener
detector.onSuspicionChange((nodeId, phi, isDown) => {
const status = isDown ? '*** SUSPECTED DOWN ***' : 'OK';
console.log(` [Monitor] ${nodeId}: phi=${phi.toFixed(2)} ${status}`);
});
let time = 0;
// Simulate Node A: healthy, regular heartbeats ~1000ms apart
console.log('--- Phase 1: Regular heartbeats from Node A ---');
for (let i = 0; i < 20; i++) {
time += 1000 + Math.random() * 100 - 50; // 950-1050ms
detector.heartbeat('NodeA', time);
}
detector.monitor(time);
// Simulate Node B: irregular heartbeats ~1000ms with high variance
console.log('\n--- Phase 2: Irregular heartbeats from Node B ---');
time = 0;
for (let i = 0; i < 20; i++) {
time += 1000 + Math.random() * 600 - 300; // 700-1300ms
detector.heartbeat('NodeB', time);
}
detector.monitor(time);
// Now simulate Node A going silent (crash)
console.log('\n--- Phase 3: Node A stops heartbeating (crash) ---');
const lastHeartbeat = 20000;
for (let elapsed = 1000; elapsed <= 10000; elapsed += 1000) {
const checkTime = lastHeartbeat + elapsed;
const phi = detector.phi('NodeA', checkTime);
console.log(
` t+${elapsed}ms: phi=${phi.toFixed(2)} ` +
`available=${detector.isAvailable('NodeA', checkTime)}`
);
}
// Compare: how long until each node is detected as failed
console.log('\n--- Phase 4: Detection time comparison ---');
console.log(' Node A (low variance): detected faster');
console.log(' Node B (high variance): detected slower');
console.log(' This is correct! High variance means longer silences are normal.');
}
runPhiAccrualSimulation();Edge Cases and Failure Modes
Byzantine Failures
All the failure detectors discussed so far assume crash-stop failures: a process either works correctly or stops entirely. Byzantine failures are far worse — a process can send arbitrary, malicious messages.
With Byzantine failures:
- A process can send heartbeats while producing wrong results
- A process can send heartbeats to some monitors but not others
- A process can forge heartbeats on behalf of other processes
Byzantine failure detection requires fundamentally different approaches:
- Challenge-response protocols — Send a computation challenge; verify the response
- Cross-checking — Compare outputs of multiple processes
- Reputation systems — Track historical accuracy
Partial Failures
In cloud environments, a process might be "partially failed":
- Able to send heartbeats but unable to process requests (e.g., out of memory for application data but not for the heartbeat thread)
- Network interface partially failed (can reach some nodes but not others)
- Disk failed but process still running in memory
Heartbeat-based detectors miss partial failures. Solutions:
- Application-level health checks — The heartbeat includes a small computation or database query
- Liveness vs. readiness probes — Kubernetes distinguishes between "process is alive" (liveness) and "process can serve requests" (readiness)
- Deep health checks — Verify not just that the process responds, but that it responds correctly
Gray Failures
A "gray failure" occurs when a component's failure is observable by some but not all observers. Example:
- Node A can reach Node C
- Node B cannot reach Node C
- A and B disagree about whether C is alive
This is particularly insidious because:
- A sees C as healthy and continues sending it traffic
- B sees C as failed and triggers failover
- The system is in an inconsistent state
Detection strategies:
- Multi-probe detection — Multiple monitors probe each node from different network paths
- Consensus on failure — Require a quorum of monitors to agree before declaring failure
- Indirect monitoring — A monitors B's perception of C (gossip-based approach)
Clock Skew
If the failure detector uses timestamps, clock skew between nodes can cause:
- Heartbeats appearing to arrive in the future (negative inter-arrival time)
- Heartbeats appearing much more delayed than they actually are
The phi accrual detector is particularly vulnerable because it computes inter-arrival statistics. A clock jump on the monitored node can corrupt the arrival window.
Mitigation: use monotonic clocks (not wall clocks) for local time measurements, and measure inter-arrival times locally (the monitoring node's clock only).
Performance Analysis
Detection Time vs. False Positive Rate
For a heartbeat-based detector with interval
where
| Detection time (if | ||
|---|---|---|
| 1 | 15.87% | 2s |
| 2 | 2.28% | 3s |
| 3 | 0.13% | 4s |
| 4 | 0.003% | 5s |
| 5 | 0.00003% | 6s |
Network Overhead
For
For
Solutions:
- Ring topology — Each node monitors only its neighbors:
messages - Random sampling — Each node monitors
random peers: messages - Hierarchical — Monitor in groups; group leaders monitor each other
- SWIM — Combines direct and indirect probing (see below)
Bandwidth
Each heartbeat message is typically small (< 100 bytes), but at high message rates:
For
This is significant — using heartbeats for failure detection in large clusters requires careful topology design.
Math Foundations
Modeling Failure Detection Quality
Define the quality of service (QoS) of a failure detector by three metrics (Chen, Toueg, and Aguilera, 2002):
- Detection time (
) — Time from actual crash to detection - Mistake rate (
) — Frequency of false positives (falsely suspecting a correct process) - Mistake duration (
) — How long a false suspicion lasts before being corrected
These are related by the heartbeat interval
where
The Impossibility Spectrum
The FLP result is one point on a spectrum of impossibility results:
Probabilistic Guarantees of Phi Accrual
For the phi accrual detector with threshold
Expected time between false suspicions:
For
This is why Cassandra's default of
The SWIM Protocol
Scalable Weakly-consistent Infection-style Membership
SWIM (Das, Gupta, Maniatis, 2002) is a membership and failure detection protocol designed for large-scale clusters. It was adopted by HashiCorp's Serf and Consul (via the memberlist library).
Problems with Traditional Heartbeating
In a cluster of
- Message load:
messages per detection period - False positive probability increases with
: With monitored nodes and false positive probability per node, the probability of at least one false positive is
SWIM solves both problems.
SWIM Protocol Mechanics
Each protocol period
- Direct probe:
selects a random node and sends a PING - If
responds with ACKwithin a timeout,is alive. Done. - Indirect probe (if direct fails):
selects random nodes and sends each an INDIRECT-PING(M_j)request - Each of these
nodes sends a PINGtoand forwards the ACK(if received) back to - If
receives no ACK(direct or indirect) for, marks as suspect - After a suspect timeout, if no refutation arrives,
is declared failed
SWIM Analysis
Message complexity:
Detection time: Expected
False positive rate: The indirect probe mechanism reduces false positives caused by network issues between specific node pairs. If
SWIM Suspicion Mechanism
SWIM introduces a suspicion sub-protocol to further reduce false positives:
- Instead of directly marking a node as failed, mark it as SUSPECT
- Broadcast the suspicion through gossip
- The suspected node has a window of time to refute the suspicion (by sending an
ALIVEmessage with a higher incarnation number) - Only after the suspicion timer expires without refutation is the node declared CONFIRM DEAD
The incarnation number prevents stale suspicions from causing false failures. When a node ALIVE message. Other nodes accept ALIVE messages with higher incarnation numbers, overriding suspicions.
War Story
HashiCorp's Consul uses a modified SWIM protocol (via the memberlist library) for cluster membership. In a production deployment with 5,000 nodes across three datacenters, they discovered that SWIM's random probe target selection was causing uneven detection times — nodes that happened to not be selected for probing for several rounds could be unresponsive for up to 30 seconds before being detected. The fix was to use a round-robin probe target selection instead of purely random: each node maintains a shuffled list of all other nodes and probes them in order, ensuring every node is probed within
Comparison of Failure Detection Strategies
| Strategy | Detection Time | False Positive Rate | Message Complexity | Scalability | Used By |
|---|---|---|---|---|---|
| Fixed timeout | Depends on | Poor | Simple systems | ||
| Adaptive timeout | Lower than fixed | Poor | TCP | ||
| Phi accrual | Moderate | Cassandra, Akka | |||
| SWIM | Low (indirect probes) | Excellent | Consul, Serf | ||
| SWIM + suspicion | Very low | Excellent | Consul | ||
| Hierarchical | Depends on hierarchy | Varies | Good | ZooKeeper |
Real-World Implementations
How Akka Detects Failures
Akka (now Apache Pekko) uses a phi accrual failure detector with:
- Default threshold:
(configurable) - Heartbeat interval: 1 second
- Acceptable heartbeat pause: 3 seconds (accounts for GC pauses)
- Monitoring: each node monitors a configurable number of other nodes (not all-to-all)
Akka also provides a cluster sharding rebalance mechanism that uses failure detection to trigger shard rebalancing when nodes join or leave.
How Cassandra Detects Failures
Cassandra uses the phi accrual failure detector with:
- Default threshold:
- Heartbeat arrives via gossip protocol (every 1 second)
- Window size: 1000 samples
- Arrival window is per-node, allowing for different network characteristics to different peers
When a node is suspected:
- It is marked as DOWN in the gossip state
- If the node recovers and sends heartbeats again, it is marked UP
- If the node remains DOWN for a configurable time (default: 10 minutes), a human operator must manually remove it (to prevent data loss from automatic removal)
How ZooKeeper Detects Failures
ZooKeeper uses a session-based failure detection mechanism:
- Each client maintains a session with the ZooKeeper ensemble
- The session has a timeout
(configurable, typically 2-20 seconds) - The client sends heartbeats (pings) at
intervals - If the server doesn't receive a heartbeat within
, the session expires - Session expiration triggers ephemeral node deletion and watch notifications
This is essentially a fixed-timeout heartbeat mechanism, but the server side uses the concept of "session" to provide higher-level semantics (ephemeral nodes, watches).
ZooKeeper's leader election (for the ZooKeeper ensemble itself) uses:
tickTime: base heartbeat interval (default 2000ms)initLimit: ticks for initial synchronization (followers connecting to leader)syncLimit: ticks for followers to sync with leader during normal operation- If a follower doesn't hear from the leader within
syncLimit × tickTime, it triggers a new election
How Kubernetes Detects Node Failures
Kubernetes uses a multi-layered approach:
- kubelet heartbeats: Each node's kubelet sends heartbeats to the API server (default: every 10 seconds)
- Node controller: The kube-controller-manager monitors heartbeats. If no heartbeat is received within
node-monitor-grace-period(default: 40 seconds), the node is marked asNotReady - Pod eviction: After
pod-eviction-timeout(default: 5 minutes), pods onNotReadynodes are evicted (rescheduled elsewhere)
Total time from node failure to pod rescheduling: ~5 minutes 40 seconds by default. This is deliberately conservative to avoid cascading evictions during network blips.
Decision Framework
Choosing a Failure Detection Strategy
How many nodes?
├── < 50: All-to-all heartbeating is fine
│ ├── Need calibrated false positive rate? → Phi accrual
│ └── Fixed timeout is acceptable? → Adaptive timeout
├── 50 - 5000: Need scalable approach
│ ├── SWIM (O(n) messages, proven at scale)
│ └── Hierarchical (group-based monitoring)
└── > 5000: Hierarchical + SWIM hybrid
What's your priority?
├── Minimize false positives:
│ ├── Phi accrual with high threshold (ϕ ≥ 10)
│ └── SWIM with suspicion sub-protocol
├── Minimize detection time:
│ ├── Aggressive timeout (accept higher false positives)
│ └── Multiple probe paths (SWIM indirect probes)
└── Minimize network overhead:
└── SWIM (O(n) messages vs O(n²))
What failure model?
├── Crash-stop only:
│ └── Any heartbeat-based approach works
├── Crash-recovery:
│ └── Need incarnation numbers (SWIM) or epoch tracking
├── Omission (messages lost):
│ └── SWIM indirect probes handle this well
└── Byzantine (malicious):
└── Challenge-response with cryptographic verificationTimeout Configuration Guide
| Environment | Heartbeat Interval | Timeout | Rationale |
|---|---|---|---|
| Same rack | 100ms | 500ms | Ultra-low latency, negligible jitter |
| Same datacenter | 500ms - 1s | 2-5s | Low latency, minor jitter |
| Cross-datacenter | 2-5s | 10-30s | Higher latency, more jitter |
| Over internet | 5-10s | 30-60s | High latency, high variance |
| With GC pauses | Interval + max GC | 3× max GC | Must survive worst-case GC |
Advanced Topics
Failure Detectors and Consensus
The theoretical importance of failure detectors lies in their relationship to consensus:
Theorem (Chandra, Hadzilacos, Toueg, 1996):
This means every consensus algorithm (Paxos, Raft, Viewstamped Replication) implicitly relies on
Paxos: The leader is the proposer. If multiple proposers compete, Paxos may not terminate (livelock). In practice, a leader election protocol (implementing
Raft: The leader election mechanism IS the
Quorum-Based Failure Detection
Instead of each node independently deciding about failures, require a quorum of monitors to agree:
Benefits:
- Prevents split-brain caused by network partition between a single monitor and the monitored node
- Reduces false positives from localized network issues
- Provides stronger guarantees (closer to
than )
Costs:
- Requires communication between monitors to reach agreement
- Slower detection (must wait for quorum)
- Adds complexity
Self-Tuning Failure Detectors
Research systems that automatically optimize failure detection:
- Machine learning-based: Train a classifier on historical heartbeat patterns to predict crashes. Features include inter-arrival time statistics, trend direction, and system metrics (CPU, memory, disk)
- Workload-aware: Adjust timeouts based on current system load. During high-load periods, increase timeouts; during low-load periods, decrease them
- Network-topology-aware: Use knowledge of the network topology (same rack, same datacenter, cross-region) to set per-link timeout parameters
Failure Detection in Serverless
Serverless platforms face unique failure detection challenges:
- Functions are ephemeral — they start, execute, and die. Traditional heartbeating doesn't apply
- Cold starts look like slow responses, not failures
- The platform (not the function) must detect function failures
Serverless platforms use invocation-level timeout (each function call has a deadline) combined with platform-level health checks (the platform monitors the worker VMs that execute functions).
Speculative Failure Detection
Instead of waiting to confirm a failure, speculatively treat a slow response as a failure and hedge:
- Send the request to the primary node
- If no response within
-th percentile latency, send the same request to a backup node - Use whichever response arrives first
This is not traditional failure detection — it's a latency optimization that treats slow nodes as "effectively failed." Google's "tail at scale" paper describes this approach for reducing tail latency.
For
The Relationship Between Failure Detection and Consensus
There is a deep duality between failure detection and consensus:
- Failure detection can be used to build consensus (
consensus possible) - Consensus can be used to build failure detection (agree on who is alive)
This creates a chicken-and-egg problem: you need failure detection to build consensus, but the best failure detection uses consensus. In practice, this is resolved by using weaker, standalone failure detection (
War Story
A distributed database team implemented a phi accrual failure detector but set the threshold too low (
Failure Detectors Beyond Crash Failures
The classical theory focuses on crash failures, but modern systems face richer failure modes:
| Failure Mode | Detection Approach | Example |
|---|---|---|
| Crash-stop | Heartbeat timeout | Process terminates |
| Crash-recovery | Heartbeat + epoch number | Process restarts |
| Omission | Indirect probing | Dropped packets |
| Timing | Statistical analysis | Slow responses |
| Performance | Latency monitoring | Degraded throughput |
| Byzantine | Challenge-response + voting | Corrupted responses |
| Gray | Multi-observer correlation | Partial reachability |
Each failure mode requires specialized detection mechanisms. Production systems typically layer multiple detectors:
- L1 — Network liveness: Can we reach the node at all? (ICMP ping, TCP connect)
- L2 — Process liveness: Is the application process running? (Heartbeat response)
- L3 — Functional liveness: Can the process serve requests? (Health check endpoint)
- L4 — Correctness liveness: Is the process producing correct results? (Semantic health check)
A node might pass L1-L3 but fail L4 (producing wrong results due to data corruption). A comprehensive failure detection strategy checks all levels.
Key Takeaways
Failure detection in distributed systems is fundamentally constrained by the FLP impossibility result — in a purely asynchronous system, you cannot reliably distinguish a dead process from a slow one. The Chandra-Toueg framework provides the theoretical foundation for understanding what kinds of imperfect failure detection are sufficient for solving consensus.
In practice, the choice comes down to: phi accrual detectors for calibrated suspicion levels with tunable false positive rates, SWIM for scalable detection in large clusters, and application-specific health checks for detecting partial and gray failures. The most robust production systems layer multiple detection mechanisms, use conservative timeouts with gradual escalation, and never take irreversible action (like permanent node removal) based on a single failure detector's output.