Distributed Systems
A distributed system is a collection of independent computers that appears to its users as a single coherent system. This seemingly simple definition conceals decades of research, thousands of papers, and some of the hardest problems in computer science.
This section doesn't just describe distributed systems — it takes you from the fundamental impossibility results (FLP, CAP) through the algorithms that work around them (Raft, Paxos, CRDTs) to the engineering decisions you'll make when building real systems.
Why Distributed Systems Matter
Every production system you build will be distributed. The moment you have:
- A web server talking to a database → distributed system
- Two services communicating over HTTP → distributed system
- A cache layer between your app and storage → distributed system
- A read replica for your database → distributed system
You cannot escape distribution. You can only choose whether to understand it or be surprised by it.
The Eight Fallacies
In 1994, Peter Deutsch and James Gosling identified assumptions that developers make about distributed systems that are all false:
- The network is reliable — packets drop, connections reset, cables get cut
- Latency is zero — every network call adds milliseconds to seconds
- Bandwidth is infinite — you can saturate any link with enough traffic
- The network is secure — every byte traversing the network can be intercepted
- Topology doesn't change — nodes join, leave, and fail constantly
- There is one administrator — multiple teams, multiple policies, multiple agendas
- Transport cost is zero — serialization, deserialization, encryption all cost CPU
- The network is homogeneous — different hardware, OS versions, protocol versions
Every decision in this section traces back to these realities.
Concept Map
Learning Path
Follow this order for the most coherent understanding. Richer than any single course — covers theory, algorithms, and production patterns.
Part 1 — Foundations
The theorems and models that shape every design decision.
| Order | Topic | Why This Order |
|---|---|---|
| 1 | CAP Theorem | The foundational trade-off — consistency vs availability during partitions |
| 2 | Consistency Models | What "consistent" actually means — strong, causal, eventual, and everything between |
| 3 | FLP Impossibility | Why consensus is impossible in async systems — and how real protocols escape it |
Part 2 — Data Distribution
How data gets spread across nodes and kept in sync.
| Order | Topic | Why This Order |
|---|---|---|
| 4 | Replication | Single-leader, multi-leader, leaderless — the full spectrum with trade-offs |
| 5 | Consistent Hashing | How to partition data across nodes with minimal reshuffling |
Part 3 — Consensus & Coordination
The algorithms that let distributed nodes agree on things.
| Order | Topic | Why This Order |
|---|---|---|
| 6 | Distributed Transactions | 2PC, 3PC, Sagas — coordinating writes across service boundaries |
| 7 | Paxos | The original consensus algorithm — theory every engineer should know |
| 8 | Leader Election | Bully, Ring, Raft election, ZooKeeper — and how to avoid split-brain |
| 9 | Distributed Locking | Mutex semantics over a network — fencing tokens, Redlock, and why it's hard |
Part 4 — Time & Order
Why clocks lie and how to order events without them.
| Order | Topic | Why This Order |
|---|---|---|
| 10 | Clock Synchronization | NTP, PTP, TrueTime — and why you still can't trust wall clocks |
| 11 | Vector Clocks & Lamport Timestamps | Causal ordering of events without synchronized clocks |
| 12 | Exactly-Once Semantics | Idempotency, transactional outbox, Kafka EOS — the hardest messaging guarantee |
Part 5 — Failure & Recovery
How nodes detect, gossip about, and recover from failures.
| Order | Topic | Why This Order |
|---|---|---|
| 13 | Failure Detectors | Φ-accrual, heartbeats, timeouts — how nodes know when peers are dead |
| 14 | Gossip Protocols | Epidemic information spreading — membership, failure detection at scale |
| 15 | Distributed Snapshots | Chandy-Lamport — capturing consistent global state of a running system |
Part 6 — Advanced Theory
The deep end — CRDTs, Byzantine failures, probabilistic structures.
| Order | Topic | Why This Order |
|---|---|---|
| 16 | CRDTs | Data structures that merge without conflicts — the elegant alternative to locking |
| 17 | Byzantine Fault Tolerance | When nodes can lie — PBFT, PoW, the 3f+1 requirement |
Part 7 — Practical Tools
Applied patterns you'll use daily in production.
| Order | Topic | Why This Order |
|---|---|---|
| 18 | Rate Limiting | Token bucket, sliding window, Redis-based distributed enforcement |
| 19 | Circuit Breaker | Failure isolation — preventing cascading failures across services |
| 20 | Bloom Filters | Probabilistic set membership — space-efficient duplicate detection |
| 21 | Queueing Theory | Little's Law, M/M/1 queues — the math behind capacity planning |
Key Insight
The entire field of distributed systems can be reduced to one question:
How do we get multiple machines to agree on something when any of them can fail at any time, messages can be lost or delayed, and there is no shared clock?
Every algorithm, every protocol, every pattern in this section is an answer to some aspect of this question, each making different trade-offs about what guarantees it provides and what it gives up.