Skip to content
System Design Interview0%

Distributed Systems

A distributed system is a collection of independent computers that appears to its users as a single coherent system. This seemingly simple definition conceals decades of research, thousands of papers, and some of the hardest problems in computer science.

This section doesn't just describe distributed systems — it takes you from the fundamental impossibility results (FLP, CAP) through the algorithms that work around them (Raft, Paxos, CRDTs) to the engineering decisions you'll make when building real systems.

Why Distributed Systems Matter

Every production system you build will be distributed. The moment you have:

  • A web server talking to a database → distributed system
  • Two services communicating over HTTP → distributed system
  • A cache layer between your app and storage → distributed system
  • A read replica for your database → distributed system

You cannot escape distribution. You can only choose whether to understand it or be surprised by it.

The Eight Fallacies

In 1994, Peter Deutsch and James Gosling identified assumptions that developers make about distributed systems that are all false:

  1. The network is reliable — packets drop, connections reset, cables get cut
  2. Latency is zero — every network call adds milliseconds to seconds
  3. Bandwidth is infinite — you can saturate any link with enough traffic
  4. The network is secure — every byte traversing the network can be intercepted
  5. Topology doesn't change — nodes join, leave, and fail constantly
  6. There is one administrator — multiple teams, multiple policies, multiple agendas
  7. Transport cost is zero — serialization, deserialization, encryption all cost CPU
  8. The network is homogeneous — different hardware, OS versions, protocol versions

Every decision in this section traces back to these realities.

Concept Map

Learning Path

Follow this order for the most coherent understanding. Richer than any single course — covers theory, algorithms, and production patterns.

Part 1 — Foundations

The theorems and models that shape every design decision.

OrderTopicWhy This Order
1CAP TheoremThe foundational trade-off — consistency vs availability during partitions
2Consistency ModelsWhat "consistent" actually means — strong, causal, eventual, and everything between
3FLP ImpossibilityWhy consensus is impossible in async systems — and how real protocols escape it

Part 2 — Data Distribution

How data gets spread across nodes and kept in sync.

OrderTopicWhy This Order
4ReplicationSingle-leader, multi-leader, leaderless — the full spectrum with trade-offs
5Consistent HashingHow to partition data across nodes with minimal reshuffling

Part 3 — Consensus & Coordination

The algorithms that let distributed nodes agree on things.

OrderTopicWhy This Order
6Distributed Transactions2PC, 3PC, Sagas — coordinating writes across service boundaries
7PaxosThe original consensus algorithm — theory every engineer should know
8Leader ElectionBully, Ring, Raft election, ZooKeeper — and how to avoid split-brain
9Distributed LockingMutex semantics over a network — fencing tokens, Redlock, and why it's hard

Part 4 — Time & Order

Why clocks lie and how to order events without them.

OrderTopicWhy This Order
10Clock SynchronizationNTP, PTP, TrueTime — and why you still can't trust wall clocks
11Vector Clocks & Lamport TimestampsCausal ordering of events without synchronized clocks
12Exactly-Once SemanticsIdempotency, transactional outbox, Kafka EOS — the hardest messaging guarantee

Part 5 — Failure & Recovery

How nodes detect, gossip about, and recover from failures.

OrderTopicWhy This Order
13Failure DetectorsΦ-accrual, heartbeats, timeouts — how nodes know when peers are dead
14Gossip ProtocolsEpidemic information spreading — membership, failure detection at scale
15Distributed SnapshotsChandy-Lamport — capturing consistent global state of a running system

Part 6 — Advanced Theory

The deep end — CRDTs, Byzantine failures, probabilistic structures.

OrderTopicWhy This Order
16CRDTsData structures that merge without conflicts — the elegant alternative to locking
17Byzantine Fault ToleranceWhen nodes can lie — PBFT, PoW, the 3f+1 requirement

Part 7 — Practical Tools

Applied patterns you'll use daily in production.

OrderTopicWhy This Order
18Rate LimitingToken bucket, sliding window, Redis-based distributed enforcement
19Circuit BreakerFailure isolation — preventing cascading failures across services
20Bloom FiltersProbabilistic set membership — space-efficient duplicate detection
21Queueing TheoryLittle's Law, M/M/1 queues — the math behind capacity planning

Key Insight

The entire field of distributed systems can be reduced to one question:

How do we get multiple machines to agree on something when any of them can fail at any time, messages can be lost or delayed, and there is no shared clock?

Every algorithm, every protocol, every pattern in this section is an answer to some aspect of this question, each making different trade-offs about what guarantees it provides and what it gives up.

"What I cannot create, I do not understand." — Richard Feynman