Designing Data-Intensive Applications Summary

Martin Kleppmann's Designing Data-Intensive Applications (DDIA) is the most important book in modern system design. It covers the fundamental principles behind the databases, message queues, and distributed systems you use daily. This page summarizes each part with the key concepts, connects them to our deep-dive pages, and highlights the insights that matter most for system design interviews and real-world architecture.

Part I: Foundations of Data Systems

Chapter 1: Reliable, Scalable, and Maintainable Applications

Every data system must balance three goals:

Goal	Definition	Threats
Reliability	System works correctly even when things go wrong	Hardware faults, software bugs, human errors
Scalability	System handles growth in data, traffic, or complexity	Load increases, data growth
Maintainability	System can be worked on productively over time	Complexity, knowledge silos, technical debt

Key insight: Reliability is not just "stays up." It means the system performs correctly (returns right answers), tolerates faults (hardware, software, human), and prevents unauthorized access. A system that returns wrong answers with 100% uptime is not reliable.

Scalability is not "handles any load." It is the ability to describe load (what are your key parameters?), describe performance (what happens when load increases?), and identify approaches for coping with growth.

Maintainability has three sub-goals:

Operability — easy for ops teams to keep running (monitoring, automation, documentation)
Simplicity — easy for new engineers to understand (abstractions, clean APIs)
Evolvability — easy to make changes (decoupled modules, clear boundaries)

Chapter 2: Data Models and Query Languages

The data model shapes how you think about the problem. The choice between relational, document, and graph models is not about technology preference — it is about the shape of your data and access patterns.

Model	Structure	Best For	Weakness
Relational	Tables, rows, foreign keys	Structured data with relationships, complex joins	Schema rigidity, object-relational mismatch
Document	Nested JSON/BSON documents	Self-contained objects, variable schemas	Poor joins, data duplication
Graph	Nodes + edges	Highly connected data, relationship queries	Complex query optimization

Key insight: Document databases are not "schema-less" — they are "schema-on-read." The schema still exists; it is just enforced by the application code instead of the database. This trades database-level guarantees for application-level flexibility.

The relational vs document debate is really about: do your data items have many relationships (relational wins), or do your data items are mostly self-contained (document wins)?

See our Database Selection Guide and Graph Databases.

Chapter 3: Storage and Data Structures

How databases physically store and retrieve data. The two main families:

Engine	Write Speed	Read Speed	Space Amplification	Use Case
LSM-Tree	Faster (sequential writes)	Slower (check multiple levels)	Lower (compaction)	Write-heavy: logs, time-series
B-Tree	Slower (random I/O for in-place updates)	Faster (single tree traversal)	Higher (page overhead)	Read-heavy: OLTP, indexes

See our Storage Engines and Write-Ahead Logging pages.

Chapter 4: Encoding and Evolution

Data encoding (serialization) matters more than most engineers think. When you change a schema, old code must read new data (backward compatibility) and new code must read old data (forward compatibility).

Format	Human-Readable	Schema	Size	Speed	Compatibility
JSON	Yes	External (JSON Schema)	Large	Slow	Manual
Protocol Buffers	No	Built-in (.proto files)	Small	Fast	Excellent
Avro	No	Built-in (schema registry)	Smallest	Fast	Best (schema evolution)
Thrift	No	Built-in (.thrift files)	Small	Fast	Good

Key insight: Schema evolution is critical for zero-downtime deployments. You cannot update all producers and consumers simultaneously. Avro and Protocol Buffers handle this gracefully with field numbering and optional fields.

See our gRPC Internals (which uses Protocol Buffers).

Part II: Distributed Data

Chapter 5: Replication

Replication keeps copies of data on multiple machines for reliability, availability, and latency reduction.

Key insight: Replication lag is not a bug — it is a feature of asynchronous replication. The question is not whether lag exists, but what guarantees your application needs:

Guarantee	Meaning	Implementation
Read-your-writes	You see your own writes immediately	Route reads to leader after write
Monotonic reads	You never see time go backward	Pin user to same replica
Consistent prefix reads	You see causally related writes in order	Use causal ordering

See our Replication and Consistency Models pages.

Chapter 6: Partitioning (Sharding)

Partitioning splits data across nodes so each node handles a subset. This is the primary mechanism for scaling beyond a single machine.

Strategy	How It Works	Pros	Cons
Key range	Partition by range (A-F, G-M, N-Z)	Range scans are efficient	Hot spots if keys are sequential
Hash	Partition by hash(key) % N	Even distribution	Range scans require scatter-gather
Composite	Hash the first part, range on the second	Balanced for time-series-like data	More complex routing

Key insight: The hardest problem in partitioning is not splitting data — it is handling operations that span multiple partitions. A query that touches one partition is fast. A query that touches all partitions (scatter-gather) is slow. Design your partition key to match your access patterns.

See our Sharding and Consistent Hashing pages.

Chapter 7: Transactions

Transactions provide safety guarantees grouped under ACID:

Property	Meaning	Reality
Atomicity	All or nothing — if any part fails, everything is rolled back	Well-implemented everywhere
Consistency	Application invariants are always maintained	Actually the application's responsibility
Isolation	Concurrent transactions don't interfere with each other	The complicated one — many levels
Durability	Committed data survives crashes	WAL + fsync, but not absolute (disk failure)

Isolation levels (from weakest to strongest):

Level	Prevents	Allows	Performance
Read Uncommitted	Nothing	Dirty reads, non-repeatable reads, phantoms	Fastest
Read Committed	Dirty reads	Non-repeatable reads, phantoms	Fast
Snapshot Isolation	Dirty reads, non-repeatable reads	Write skew, phantoms	Medium
Serializable	Everything	Nothing	Slowest

Key insight: Most databases default to Read Committed (PostgreSQL) or Repeatable Read (MySQL/InnoDB). True serializable isolation is rare in practice because of performance cost. Understanding the anomalies each level permits is crucial for designing correct systems.

See our Isolation Levels and MVCC pages.

Chapter 8: The Trouble with Distributed Systems

This chapter is a cold shower of reality. In distributed systems, you cannot rely on:

Assumption	Reality
The network is reliable	Packets get lost, delayed, duplicated, reordered
Latency is zero	Cross-datacenter: 100ms+, cross-continent: 200ms+
Clocks are synchronized	Clock skew of seconds is common; NTP is best-effort
Nodes don't fail	Any node can crash at any time, including mid-operation

Partial failures are the defining characteristic of distributed systems. In a single machine, either everything works or everything fails. In a distributed system, some parts can fail while others continue — and you cannot always tell which is which.

Key insight: "I sent a request and got no response" is ambiguous. The request may have been lost. The response may have been lost. The remote node may be dead. The remote node may be alive but slow. You cannot distinguish these cases without additional mechanisms (timeouts, heartbeats, consensus protocols).

See our Failure Detectors and CAP Theorem pages.

Chapter 9: Consistency and Consensus

Consensus is the fundamental problem of distributed systems: getting multiple nodes to agree on a value.

Linearizability is the strongest consistency model — every operation appears to take effect atomically at some point between invocation and response. It is expensive (requires coordination) and often unnecessary.

Key insight: Most applications do not need linearizability. They need causal consistency (if A happened before B, everyone sees them in that order). Causal consistency is achievable without the performance penalty of linearizability.

See our Raft Full Walkthrough, Paxos Made Simple, and Leader Election pages.

Part III: Derived Data

Chapter 10: Batch Processing

Batch processing transforms large datasets from one form to another. The fundamental model is MapReduce, but modern systems (Spark, Flink) improve on it.

Key insight: Batch processing has a beautiful property — if something goes wrong, you can rerun the entire job. Input data is immutable, so the process is repeatable. This makes batch systems much easier to reason about than systems that mutate state in place.

The Unix philosophy applied to data: small tools that do one thing well, connected by pipes (in data terms: datasets). MapReduce jobs are the "pipes" of data processing — each job reads input, transforms it, and writes output for the next job.

Chapter 11: Stream Processing

Stream processing handles events as they arrive, rather than waiting for a batch. An event stream is an unbounded dataset — it never ends.

Concept	Batch	Stream
Input	Bounded dataset (files)	Unbounded event stream
Processing	Run-to-completion	Continuous
Latency	Minutes to hours	Milliseconds to seconds
Fault tolerance	Rerun the job	Exactly-once semantics (harder)
Time	Trivial (all data is historical)	Complex (event time vs processing time)

Key insight: The hardest problem in stream processing is not speed — it is handling time correctly. An event might arrive late (delayed in the network), and your windowed aggregation might have already closed. Do you drop it? Recompute? This "event time vs processing time" distinction is fundamental.

See our Kafka Internals, Kafka Streams, and Exactly-Once Semantics pages.

Chapter 12: The Future of Data Systems

Kleppmann's vision of the future revolves around "derived data" — systems where the source of truth is an immutable log of events, and all other representations (databases, caches, search indexes) are derived from that log.

Key insight: If the event log is the source of truth, you can rebuild any derived view by replaying the log. Lost your search index? Replay the log. Need a new analytics view? Create a new consumer. This is the foundation of event sourcing and CQRS.

See our CQRS: When to Use It and Event-Driven APIs pages.

DDIA Concepts to Knowledge Vault Mapping

DDIA Concept	Our Deep Dive Page
Reliability / Fault Tolerance	Circuit Breaker
Scalability	Cost of Scale
Storage Engines	Storage Engines
Replication	Replication
Partitioning / Sharding	Sharding
Transactions	Isolation Levels
Consistency	Consistency Models
Consensus	Raft Full Walkthrough
Batch Processing	Data Engineering
Stream Processing	Kafka Internals
Derived Data	CQRS
Encoding	gRPC Internals
CAP Theorem	CAP Theorem

Key Takeaways from DDIA

There is no single "best" database — every choice is a tradeoff between read speed, write speed, consistency, and complexity
Replication and partitioning are orthogonal — you can have either, both, or neither
Transactions are not free — stronger isolation levels reduce concurrency
Distributed systems are fundamentally different from single-node systems — partial failures change everything
Consensus is expensive — avoid it when possible, but you need it for leader election and atomic commits
Immutable event logs are the most flexible data architecture — you can derive any view from the log
The most important question is always "what happens when X fails?" — not if, when

Database Selection Guide — choosing the right database
Consistency Models — deep dive into consistency
Raft Full Walkthrough — consensus in practice
Kafka Internals — the event log in practice
Sharding — partitioning strategies
Storage Engines — LSM vs B-Tree in detail
Isolation Levels — transaction isolation deep dive
CAP Theorem — consistency vs availability

Designing Data-Intensive Applications Summary ​

Part I: Foundations of Data Systems ​

Chapter 1: Reliable, Scalable, and Maintainable Applications ​

Chapter 2: Data Models and Query Languages ​

Chapter 3: Storage and Data Structures ​

Chapter 4: Encoding and Evolution ​

Part II: Distributed Data ​

Chapter 5: Replication ​

Chapter 6: Partitioning (Sharding) ​

Chapter 7: Transactions ​

Chapter 8: The Trouble with Distributed Systems ​

Chapter 9: Consistency and Consensus ​

Part III: Derived Data ​

Chapter 10: Batch Processing ​

Chapter 11: Stream Processing ​

Chapter 12: The Future of Data Systems ​

DDIA Concepts to Knowledge Vault Mapping ​

Key Takeaways from DDIA ​

Related Pages ​

Related Pages