Skip to content
Unverified — AI-generated content. Help verify this page

Designing Data-Intensive Applications Summary

Martin Kleppmann's Designing Data-Intensive Applications (DDIA) is the most important book in modern system design. It covers the fundamental principles behind the databases, message queues, and distributed systems you use daily. This page summarizes each part with the key concepts, connects them to our deep-dive pages, and highlights the insights that matter most for system design interviews and real-world architecture.

Part I: Foundations of Data Systems

Chapter 1: Reliable, Scalable, and Maintainable Applications

Every data system must balance three goals:

GoalDefinitionThreats
ReliabilitySystem works correctly even when things go wrongHardware faults, software bugs, human errors
ScalabilitySystem handles growth in data, traffic, or complexityLoad increases, data growth
MaintainabilitySystem can be worked on productively over timeComplexity, knowledge silos, technical debt

Key insight: Reliability is not just "stays up." It means the system performs correctly (returns right answers), tolerates faults (hardware, software, human), and prevents unauthorized access. A system that returns wrong answers with 100% uptime is not reliable.

Scalability is not "handles any load." It is the ability to describe load (what are your key parameters?), describe performance (what happens when load increases?), and identify approaches for coping with growth.

Maintainability has three sub-goals:

  • Operability — easy for ops teams to keep running (monitoring, automation, documentation)
  • Simplicity — easy for new engineers to understand (abstractions, clean APIs)
  • Evolvability — easy to make changes (decoupled modules, clear boundaries)

Chapter 2: Data Models and Query Languages

The data model shapes how you think about the problem. The choice between relational, document, and graph models is not about technology preference — it is about the shape of your data and access patterns.

ModelStructureBest ForWeakness
RelationalTables, rows, foreign keysStructured data with relationships, complex joinsSchema rigidity, object-relational mismatch
DocumentNested JSON/BSON documentsSelf-contained objects, variable schemasPoor joins, data duplication
GraphNodes + edgesHighly connected data, relationship queriesComplex query optimization

Key insight: Document databases are not "schema-less" — they are "schema-on-read." The schema still exists; it is just enforced by the application code instead of the database. This trades database-level guarantees for application-level flexibility.

The relational vs document debate is really about: do your data items have many relationships (relational wins), or do your data items are mostly self-contained (document wins)?

See our Database Selection Guide and Graph Databases.

Chapter 3: Storage and Data Structures

How databases physically store and retrieve data. The two main families:

EngineWrite SpeedRead SpeedSpace AmplificationUse Case
LSM-TreeFaster (sequential writes)Slower (check multiple levels)Lower (compaction)Write-heavy: logs, time-series
B-TreeSlower (random I/O for in-place updates)Faster (single tree traversal)Higher (page overhead)Read-heavy: OLTP, indexes

See our Storage Engines and Write-Ahead Logging pages.

Chapter 4: Encoding and Evolution

Data encoding (serialization) matters more than most engineers think. When you change a schema, old code must read new data (backward compatibility) and new code must read old data (forward compatibility).

FormatHuman-ReadableSchemaSizeSpeedCompatibility
JSONYesExternal (JSON Schema)LargeSlowManual
Protocol BuffersNoBuilt-in (.proto files)SmallFastExcellent
AvroNoBuilt-in (schema registry)SmallestFastBest (schema evolution)
ThriftNoBuilt-in (.thrift files)SmallFastGood

Key insight: Schema evolution is critical for zero-downtime deployments. You cannot update all producers and consumers simultaneously. Avro and Protocol Buffers handle this gracefully with field numbering and optional fields.

See our gRPC Internals (which uses Protocol Buffers).

Part II: Distributed Data

Chapter 5: Replication

Replication keeps copies of data on multiple machines for reliability, availability, and latency reduction.

Key insight: Replication lag is not a bug — it is a feature of asynchronous replication. The question is not whether lag exists, but what guarantees your application needs:

GuaranteeMeaningImplementation
Read-your-writesYou see your own writes immediatelyRoute reads to leader after write
Monotonic readsYou never see time go backwardPin user to same replica
Consistent prefix readsYou see causally related writes in orderUse causal ordering

See our Replication and Consistency Models pages.

Chapter 6: Partitioning (Sharding)

Partitioning splits data across nodes so each node handles a subset. This is the primary mechanism for scaling beyond a single machine.

StrategyHow It WorksProsCons
Key rangePartition by range (A-F, G-M, N-Z)Range scans are efficientHot spots if keys are sequential
HashPartition by hash(key) % NEven distributionRange scans require scatter-gather
CompositeHash the first part, range on the secondBalanced for time-series-like dataMore complex routing

Key insight: The hardest problem in partitioning is not splitting data — it is handling operations that span multiple partitions. A query that touches one partition is fast. A query that touches all partitions (scatter-gather) is slow. Design your partition key to match your access patterns.

See our Sharding and Consistent Hashing pages.

Chapter 7: Transactions

Transactions provide safety guarantees grouped under ACID:

PropertyMeaningReality
AtomicityAll or nothing — if any part fails, everything is rolled backWell-implemented everywhere
ConsistencyApplication invariants are always maintainedActually the application's responsibility
IsolationConcurrent transactions don't interfere with each otherThe complicated one — many levels
DurabilityCommitted data survives crashesWAL + fsync, but not absolute (disk failure)

Isolation levels (from weakest to strongest):

LevelPreventsAllowsPerformance
Read UncommittedNothingDirty reads, non-repeatable reads, phantomsFastest
Read CommittedDirty readsNon-repeatable reads, phantomsFast
Snapshot IsolationDirty reads, non-repeatable readsWrite skew, phantomsMedium
SerializableEverythingNothingSlowest

Key insight: Most databases default to Read Committed (PostgreSQL) or Repeatable Read (MySQL/InnoDB). True serializable isolation is rare in practice because of performance cost. Understanding the anomalies each level permits is crucial for designing correct systems.

See our Isolation Levels and MVCC pages.

Chapter 8: The Trouble with Distributed Systems

This chapter is a cold shower of reality. In distributed systems, you cannot rely on:

AssumptionReality
The network is reliablePackets get lost, delayed, duplicated, reordered
Latency is zeroCross-datacenter: 100ms+, cross-continent: 200ms+
Clocks are synchronizedClock skew of seconds is common; NTP is best-effort
Nodes don't failAny node can crash at any time, including mid-operation

Partial failures are the defining characteristic of distributed systems. In a single machine, either everything works or everything fails. In a distributed system, some parts can fail while others continue — and you cannot always tell which is which.

Key insight: "I sent a request and got no response" is ambiguous. The request may have been lost. The response may have been lost. The remote node may be dead. The remote node may be alive but slow. You cannot distinguish these cases without additional mechanisms (timeouts, heartbeats, consensus protocols).

See our Failure Detectors and CAP Theorem pages.

Chapter 9: Consistency and Consensus

Consensus is the fundamental problem of distributed systems: getting multiple nodes to agree on a value.

Linearizability is the strongest consistency model — every operation appears to take effect atomically at some point between invocation and response. It is expensive (requires coordination) and often unnecessary.

Key insight: Most applications do not need linearizability. They need causal consistency (if A happened before B, everyone sees them in that order). Causal consistency is achievable without the performance penalty of linearizability.

See our Raft Full Walkthrough, Paxos Made Simple, and Leader Election pages.

Part III: Derived Data

Chapter 10: Batch Processing

Batch processing transforms large datasets from one form to another. The fundamental model is MapReduce, but modern systems (Spark, Flink) improve on it.

Key insight: Batch processing has a beautiful property — if something goes wrong, you can rerun the entire job. Input data is immutable, so the process is repeatable. This makes batch systems much easier to reason about than systems that mutate state in place.

The Unix philosophy applied to data: small tools that do one thing well, connected by pipes (in data terms: datasets). MapReduce jobs are the "pipes" of data processing — each job reads input, transforms it, and writes output for the next job.

Chapter 11: Stream Processing

Stream processing handles events as they arrive, rather than waiting for a batch. An event stream is an unbounded dataset — it never ends.

ConceptBatchStream
InputBounded dataset (files)Unbounded event stream
ProcessingRun-to-completionContinuous
LatencyMinutes to hoursMilliseconds to seconds
Fault toleranceRerun the jobExactly-once semantics (harder)
TimeTrivial (all data is historical)Complex (event time vs processing time)

Key insight: The hardest problem in stream processing is not speed — it is handling time correctly. An event might arrive late (delayed in the network), and your windowed aggregation might have already closed. Do you drop it? Recompute? This "event time vs processing time" distinction is fundamental.

See our Kafka Internals, Kafka Streams, and Exactly-Once Semantics pages.

Chapter 12: The Future of Data Systems

Kleppmann's vision of the future revolves around "derived data" — systems where the source of truth is an immutable log of events, and all other representations (databases, caches, search indexes) are derived from that log.

Key insight: If the event log is the source of truth, you can rebuild any derived view by replaying the log. Lost your search index? Replay the log. Need a new analytics view? Create a new consumer. This is the foundation of event sourcing and CQRS.

See our CQRS: When to Use It and Event-Driven APIs pages.

DDIA Concepts to Knowledge Vault Mapping

DDIA ConceptOur Deep Dive Page
Reliability / Fault ToleranceCircuit Breaker
ScalabilityCost of Scale
Storage EnginesStorage Engines
ReplicationReplication
Partitioning / ShardingSharding
TransactionsIsolation Levels
ConsistencyConsistency Models
ConsensusRaft Full Walkthrough
Batch ProcessingData Engineering
Stream ProcessingKafka Internals
Derived DataCQRS
EncodinggRPC Internals
CAP TheoremCAP Theorem

Key Takeaways from DDIA

  1. There is no single "best" database — every choice is a tradeoff between read speed, write speed, consistency, and complexity
  2. Replication and partitioning are orthogonal — you can have either, both, or neither
  3. Transactions are not free — stronger isolation levels reduce concurrency
  4. Distributed systems are fundamentally different from single-node systems — partial failures change everything
  5. Consensus is expensive — avoid it when possible, but you need it for leader election and atomic commits
  6. Immutable event logs are the most flexible data architecture — you can derive any view from the log
  7. The most important question is always "what happens when X fails?" — not if, when

"What I cannot create, I do not understand." — Richard Feynman