System Design Characteristics

When someone asks "Is this a good system?", they are really asking about a set of measurable characteristics. Every system design discussion — every interview, every architecture review, every production decision — revolves around these properties.

This page teaches you the eight most important characteristics, what they actually mean, how they are measured, and what real-world numbers look like. Understanding these characteristics is like learning the vocabulary of system design. You cannot have a meaningful conversation about architecture without them.

The Eight Characteristics

These characteristics are often in tension with each other. Making a system more available might make it less consistent. Making it more durable might increase latency. System design is the art of choosing the right trade-offs for your specific use case.

1. Scalability

Scalability is the ability of a system to handle increased load by adding resources.

A system is scalable if performance does not degrade proportionally as load increases. If your system handles 1,000 users at 50ms latency, and it handles 10,000 users at 55ms latency (after adding resources), that is good scalability. If it handles 10,000 users at 500ms latency, that is bad scalability.

Measuring Scalability

Scalability is measured by how performance changes as you add resources:

Real-world example: A stateless web server scales almost linearly. If one server handles 1,000 requests/second, two servers handle ~2,000 requests/second, and ten servers handle ~10,000 requests/second. A database does not scale this well — adding a read replica might give you 1.7x read throughput (not 2x) because of replication overhead.

For a complete guide to scaling approaches, see Scaling Fundamentals.

Types of Scalability

Type	Definition	Example
Horizontal	Add more machines	Add web servers behind a load balancer
Vertical	Upgrade to a bigger machine	Upgrade from 16 GB to 128 GB RAM
Data	System handles growing data volume	Sharding a database as it grows to 10 TB
Geographic	System serves global users	Multi-region deployment

2. Availability

Availability is the percentage of time a system is operational and accessible.

This is usually expressed as a percentage, and the industry talks about "nines." The more nines, the more available the system.

The Nines Table

This table is critical. Memorize it.

Availability	Designation	Downtime/Year	Downtime/Month	Downtime/Week
99%	Two nines	3.65 days	7.3 hours	1.68 hours
99.9%	Three nines	8.77 hours	43.8 minutes	10.1 minutes
99.95%	Three and a half nines	4.38 hours	21.9 minutes	5.04 minutes
99.99%	Four nines	52.6 minutes	4.38 minutes	1.01 minutes
99.999%	Five nines	5.26 minutes	26.3 seconds	6.05 seconds
99.9999%	Six nines	31.5 seconds	2.63 seconds	0.605 seconds

WARNING

Going from 99.9% to 99.99% is much harder than going from 99% to 99.9%. Each additional nine typically costs 10x more in engineering effort and infrastructure.

Real-World Availability Numbers

Service	Typical SLA	What It Means
AWS S3	99.99%	52.6 minutes downtime/year
AWS EC2	99.99%	52.6 minutes downtime/year
Google Cloud SQL	99.95%	4.38 hours downtime/year
Stripe API	99.99%+	Almost no downtime
Your startup's first server	~99%	3.65 days downtime/year (be honest)

How to Calculate Availability of Combined Systems

If your system has multiple components in series (all must work), the total availability is the product:

System = Web Server → Database

Web Server availability: 99.9%
Database availability:   99.9%

System availability: 99.9% × 99.9% = 99.8%

That is worse than either component alone. Every component you add in series reduces overall availability.

If components are in parallel (either can work), availability improves:

Two database replicas, each 99.9% available:

Probability both are down: (1 - 0.999) × (1 - 0.999) = 0.000001
System availability: 1 - 0.000001 = 99.9999%

For strategies to improve availability, see Redundancy & Replication.

3. Reliability

Reliability is the probability that a system performs its intended function without failure over a given time period.

Availability asks "Is it up?". Reliability asks "Is it working correctly?" A system can be available but unreliable — it is running but returning wrong results.

Measuring Reliability: MTBF and MTTR

Two key metrics:

MTBF (Mean Time Between Failures): How long, on average, the system runs before something breaks.

MTTR (Mean Time To Recovery): How long, on average, it takes to fix the system after a failure.

Availability = MTBF / (MTBF + MTTR)

Example:

Your system fails once every 30 days (MTBF = 30 days)
It takes 2 hours to fix each time (MTTR = 2 hours)
Availability = 720 hours / (720 + 2) = 99.72%

To improve availability, you either:

Increase MTBF — Make failures less frequent (better hardware, better code, redundancy)
Decrease MTTR — Fix failures faster (automated failover, runbooks, monitoring)

Decreasing MTTR is usually easier and more impactful. This is why companies invest heavily in on-call processes, automated recovery, and monitoring. See Observability Tools.

Real-World Reliability Numbers

Component	MTBF	Source
Server-grade HDD	300,000 hours (~34 years)	Backblaze data
Server-grade SSD	2,000,000 hours (~228 years)	Manufacturer spec
AWS EC2 instance	~7,000 hours (~290 days)	Industry observation
In-house server	~1,000-5,000 hours	Varies widely

These numbers are averages. A server that "should" last 290 days might die on day 1. This is why you always plan for failures.

4. Consistency

Consistency means all users see the same data at the same time. In a distributed system (multiple servers), keeping data consistent is surprisingly hard.

The Three Levels of Consistency

Level	Guarantee	Latency Impact	Example
Strong	Every read sees the latest write	Higher (must wait for sync)	Bank balance after transfer
Eventual	Reads will catch up "eventually"	Lower (reads can be stale)	Social media likes count
Weak	No guarantee	Lowest	Video streaming (dropped frames)

When to use which:

Bank transactions: Strong consistency (you must see the correct balance)
Social media feed: Eventual consistency (it is OK if a like shows up 2 seconds late)
Live video: Weak consistency (a dropped frame is better than a paused video)

For a deep dive, see Consistency Models and CAP Theorem.

5. Latency

Latency is the time between a request being sent and the response being received. It is what users feel as "speed."

Understanding Percentiles

Latency is measured in percentiles because averages lie. If 99 requests take 10ms and 1 request takes 10,000ms, the average is 109ms — which describes nobody's actual experience.

Percentile	Meaning	Why It Matters
p50 (median)	50% of requests are faster than this	The "typical" experience
p90	90% of requests are faster than this	The experience for most users
p95	95% of requests are faster than this	Used in most SLAs
p99	99% of requests are faster than this	The tail — where expensive users live
p99.9	99.9% of requests are faster than this	Extreme tail — often reveals bugs

Real-World Latency Numbers

These are approximate numbers for reference. Memorize the order of magnitude.

Operation	Time
L1 cache reference	0.5 ns
L2 cache reference	7 ns
Main memory reference	100 ns
SSD random read	16 microseconds
HDD random read	2-10 ms
Read 1 MB from memory	3 microseconds
Read 1 MB from SSD	49 microseconds
Read 1 MB from HDD	825 microseconds
Redis GET	0.1-0.5 ms
PostgreSQL simple query	1-5 ms
PostgreSQL complex query	10-100 ms
Round trip within data center	0.5 ms
Round trip US coast-to-coast	40 ms
Round trip US to Europe	80-100 ms
Round trip US to Asia	150-200 ms

Jeff Dean's Numbers

The latency numbers above are famously attributed to Jeff Dean at Google. They help you estimate how fast or slow a system will be before you build it. When someone says "let's add a cache," you know a Redis GET (0.1ms) is 10-50x faster than a PostgreSQL query (5ms), which tells you caching is worth it.

What Causes High Latency

For performance debugging techniques, see API Slow Debugging Playbook and Node.js Profiling.

6. Throughput

Throughput is the amount of work a system can handle per unit of time. While latency is about how fast one request completes, throughput is about how many requests complete per second.

System	Throughput Metric	Typical Numbers
Web server (Node.js)	Requests per second	5,000-30,000 RPS
Web server (Go)	Requests per second	50,000-100,000 RPS
PostgreSQL	Queries per second	10,000-50,000 QPS
Redis	Commands per second	100,000-500,000 ops/sec
Kafka (single broker)	Messages per second	200,000-1,000,000 msg/sec
Nginx (static files)	Requests per second	100,000+ RPS

Throughput vs Latency: The Tradeoff

Analogy: A highway has two relevant metrics — how fast a single car can go (latency) and how many cars pass per hour (throughput). Making each car faster (wider lanes) is like reducing latency. Adding more lanes is like increasing throughput. At some point, too many cars on the highway cause traffic jams — throughput goes up but latency suffers.

Bottleneck Identification

The throughput of a system is limited by its slowest component (the bottleneck):

In this example, no matter how fast your load balancer, cache, or database are, the system can only handle 10,000 requests per second because the app server is the bottleneck.

7. Durability

Durability is the guarantee that once data is successfully written, it will not be lost — even if the server crashes, the disk fails, or the data center burns down.

Levels of Durability

Level	Protection Against	Example
In-memory only	Nothing (data lost on restart)	Redis without persistence
Single disk	Process crash	PostgreSQL with fsync
RAID (multiple disks)	Single disk failure	RAID-1 or RAID-10
Replication (multiple servers)	Server failure	PostgreSQL streaming replication
Multi-AZ	Availability zone failure	AWS RDS Multi-AZ
Multi-region	Regional disaster	Cross-region replication
Multi-cloud	Cloud provider outage	Data replicated to AWS + GCP

Real-World Durability Numbers

Service	Durability	What It Means
AWS S3	99.999999999% (11 nines)	In 10 million objects, you might lose 1 every 10 million years
AWS EBS	99.999% (5 nines)	Annual failure rate of 0.1-0.2%
Redis (AOF, every second)	~99.99%	Might lose up to 1 second of writes on crash
Redis (no persistence)	0%	Everything lost on restart

WARNING

Durability and availability are different. A system can be durable (data is safe) but unavailable (you cannot access it right now). S3 has 99.99% availability but 99.999999999% durability — meaning your data is almost certainly safe even if S3 has a brief outage.

8. Maintainability

Maintainability is how easy it is to operate, modify, and extend the system over time.

This is the least glamorous characteristic but arguably the most important in the long run. Most software spends far more time being maintained than being built.

Three Aspects of Maintainability

Real-world example of poor maintainability: A system with no tests, no documentation, no monitoring, and business logic scattered across 200 microservices that no one fully understands. Every change risks breaking something, deployments take a full day, and debugging requires tribal knowledge from the one engineer who built it in 2019.

For testing approaches that improve maintainability, see Unit Testing and Property-Based Testing.

How Characteristics Trade Off Against Each Other

Almost every design decision involves trading one characteristic for another:

If You Want More...	You Might Sacrifice...	Example
Consistency	Availability and latency	Synchronous replication waits for all nodes
Availability	Consistency	Eventual consistency allows stale reads
Low latency	Durability	Write to memory instead of disk
Throughput	Latency	Batch processing instead of real-time
Simplicity	Scalability	Monolith is simpler but harder to scale
Availability (more nines)	Cost	Redundancy costs money

The CAP theorem formalizes the most famous of these trade-offs: you cannot have perfect Consistency, Availability, and Partition tolerance simultaneously. See CAP Theorem.

Putting It Together: A Real Example

Consider designing a payments system (like Stripe):

Characteristic	Requirement	Why
Availability	99.999% (5 nines)	Payments must always work; downtime = lost revenue
Consistency	Strong	You cannot charge someone twice or lose a charge
Durability	99.999999999%	Financial records must never be lost
Latency	p99 < 500ms	Users expect fast checkout
Throughput	10,000+ TPS	Handle peak traffic (Black Friday)
Reliability	MTBF > 90 days, MTTR < 5 min	Failures must be rare and recovery automatic
Maintainability	High	Financial regulations require auditable changes
Scalability	Must handle 10x peak loads	Seasonal traffic spikes

Now compare that to a social media feed:

Characteristic	Requirement	Why
Availability	99.9%	Annoying but not catastrophic if down
Consistency	Eventual	OK if a like shows up 2 seconds late
Durability	99.99%	Losing a post is bad but not legally disastrous
Latency	p50 < 100ms	Feed must feel instant
Throughput	100,000+ RPS	Millions of users scrolling simultaneously
Scalability	Horizontal	Must handle viral moments

Different systems, different trade-offs, different designs.

Key Vocabulary Summary

Term	One-Line Definition
Scalability	Ability to handle increased load by adding resources
Availability	Percentage of time the system is operational
Reliability	Probability the system works correctly over time
Consistency	All users see the same data at the same time
Latency	Time from request to response
Throughput	Amount of work completed per time unit
Durability	Guarantee that written data survives failures
Maintainability	Ease of operating, understanding, and changing the system
MTBF	Mean Time Between Failures
MTTR	Mean Time To Recovery
SLA	Service Level Agreement (contractual availability guarantee)
p99	99th percentile latency

What to Learn Next

Scaling Fundamentals — How to achieve scalability in practice
Redundancy & Replication — How to achieve availability through redundancy
CAP Theorem — The formal proof that consistency and availability trade off
Estimation Practice — Practice calculating these numbers for real systems
Zero to Million Users — See how characteristics drive architectural decisions

Real-World Examples

Google Spanner

Google Spanner achieves strong consistency globally using GPS-synchronized TrueTime clocks across data centers. It provides serializable transactions with 99.999% availability — proving the CAP theorem is about trade-offs, not impossibilities. Spanner sacrifices some write latency (cross-region consensus) to give both consistency and availability.

Amazon DynamoDB

DynamoDB defaults to eventual consistency for reads (cheaper, faster) and offers strongly consistent reads as an opt-in per-request. This design choice reflects Amazon's "availability first" philosophy — they would rather show a slightly stale shopping cart than show an error page.

Stripe

Stripe designs for 99.999% availability on their payments API. They achieve this with multi-region active-active deployment, automatic failover, and extensive chaos engineering. Their durability requirement is absolute — losing a payment record is legally and financially unacceptable, so they use synchronous replication with write-ahead logs.

Interview Tip

What to say

"Every system design decision is a trade-off between characteristics. I always start by asking: which characteristics are non-negotiable? For a payments system, consistency and durability are mandatory — I'd use strong consistency even though it costs latency. For a social media feed, I'd choose eventual consistency and optimize for low latency and high throughput, because showing a like 2 seconds late is acceptable. The key insight is that each additional nine of availability costs roughly 10x more engineering effort."

System Design Characteristics ​

The Eight Characteristics ​

1. Scalability ​

Measuring Scalability ​

Types of Scalability ​

2. Availability ​

The Nines Table ​

Real-World Availability Numbers ​

How to Calculate Availability of Combined Systems ​

3. Reliability ​

Measuring Reliability: MTBF and MTTR ​

Real-World Reliability Numbers ​

4. Consistency ​

The Three Levels of Consistency ​

5. Latency ​

Understanding Percentiles ​

Real-World Latency Numbers ​

What Causes High Latency ​

6. Throughput ​

Throughput vs Latency: The Tradeoff ​

Bottleneck Identification ​

7. Durability ​

Levels of Durability ​

Real-World Durability Numbers ​

8. Maintainability ​

Three Aspects of Maintainability ​

How Characteristics Trade Off Against Each Other ​

Putting It Together: A Real Example ​

Key Vocabulary Summary ​

What to Learn Next ​

Real-World Examples ​

Interview Tip ​

Related Pages