NewSQL

NewSQL is the category of databases that solves the problem traditional databases left unsolved: how to provide the full power of SQL — ACID transactions, relational schema, secondary indexes, JOINs — while scaling horizontally across many machines and, in many cases, across multiple geographic regions.

For two decades, the choice was binary: use a relational database and accept its scaling limits, or use a NoSQL database and give up transactions, JOINs, and the relational model. NewSQL databases reject this trade-off. They achieve horizontal scalability through distributed consensus protocols (Raft, Paxos) while maintaining the SQL interface and ACID guarantees that developers depend on.

This page covers the four most important NewSQL databases — Google Spanner, CockroachDB, TiDB, and YugabyteDB — examining not just what they do but HOW they work internally.

What NewSQL Solves

The problems NewSQL addresses:

Problem	Traditional Solution	NewSQL Solution
Single-node write bottleneck	Application-level sharding (complex, error-prone)	Automatic range-based sharding with rebalancing
Cross-shard transactions	Two-phase commit across application shards (brittle)	Built-in distributed transactions with serializable isolation
Geographic distribution	Read replicas (stale) or manual multi-master (conflicting)	Consensus-based replication with configurable locality
Schema changes at scale	Downtime or complex rolling migrations	Online schema changes (DDL without locks)
Failover	Manual promotion of replicas	Automatic leader election via consensus

The core insight: if you split data into ranges, replicate each range via a consensus protocol (Raft or Paxos), and layer a SQL engine on top that can coordinate distributed transactions — you get a horizontally scalable SQL database with strong consistency.

Google Spanner

Spanner is Google's globally distributed database. It is the original NewSQL database (though Google doesn't use that term) and remains the most technically impressive. Its key innovation — TrueTime — is something no other database has replicated.

TrueTime

TrueTime is a globally synchronized clock API that returns not a single timestamp but an interval representing the uncertainty in the current time:

TrueTime.now() = [t_{earliest}, t_{latest}]

The actual time is guaranteed to be within this interval. The uncertainty (epsilon, $ε$ ) is typically 1-7 milliseconds.

How TrueTime works:

Each Google datacenter has a set of time masters — servers equipped with:

GPS receivers — synchronized to atomic clocks on GPS satellites
Atomic clocks (rubidium or cesium) — provide holdover when GPS signal is lost

Every server in the datacenter polls multiple time masters every 30 seconds using Marzullo's algorithm to compute its local uncertainty interval. Between polls, the uncertainty grows at the rate of the local clock's drift (about 200 microseconds per second).

External Consistency

Spanner provides external consistency — the strongest possible consistency guarantee for a distributed database. It is stronger than linearizability:

If transaction T_{1} commits before T_{2} starts (in real time),

then T_{1} ’s commit timestamp < T_{2} ’s commit timestamp.

This means the database's timestamp order matches the real-world order of events — even across continents.

How Spanner achieves this using TrueTime:

When a transaction commits at timestamp $s$ :

The leader acquires the commit timestamp $s$
The leader waits until $TrueTime.now().earliest > s$ — this is called the commit wait
Only after the wait does the leader respond to the client

The commit wait ensures that no other transaction can start with a timestamp less than $s$ after the commit response is sent to the client. This is because the waiting guarantees that $s$ is truly in the past by the time the response is sent.

Commit wait duration \leq 2 ε \approx 2 - 14 ms

Why Nobody Else Has TrueTime

TrueTime requires specialized hardware (GPS receivers and atomic clocks in every datacenter) and a global network of time masters. No cloud provider other than Google offers this. CockroachDB and YugabyteDB use regular NTP-synchronized clocks with looser uncertainty bounds — which means they cannot provide external consistency (they provide serializable isolation instead, which is slightly weaker but sufficient for virtually all applications).

Spanner Architecture

Key concepts:

Splits: Data is divided into contiguous ranges of the primary key (splits). Each split is replicated across zones using a Paxos group.

Paxos groups: Each split is managed by a Paxos group with a leader and followers across multiple zones. The leader handles all reads and writes for that split. Followers participate in consensus and serve stale reads.

Split/merge: When a split grows too large, Spanner automatically splits it into two. When adjacent splits shrink, they are merged. This is transparent to the application.

Directories: A directory is a set of contiguous keys that share a common prefix. Directories are the unit of data placement — you can control which zone a directory's Paxos leader lives in, enabling locality-aware data placement.

How Paxos Groups Work in Spanner

Each Paxos group in Spanner manages one split (range of data). The protocol ensures that all replicas agree on the order of operations:

CockroachDB

CockroachDB is the most successful open-source NewSQL database. It provides PostgreSQL-compatible SQL on top of a distributed key-value store replicated via Raft consensus.

Architecture Layers

The KV Layer

At its core, CockroachDB is a sorted, distributed key-value store. Every SQL table, index, and system table is encoded into key-value pairs:

Table: users (id INT PRIMARY KEY, name STRING, email STRING)

Encoded KV pairs:
/Table/users/1/name  → "Alice"
/Table/users/1/email → "alice@example.com"
/Table/users/2/name  → "Bob"
/Table/users/2/email → "bob@example.com"

Index: users_email_idx ON users(email)
/Index/users_email_idx/"alice@example.com" → /Table/users/1
/Index/users_email_idx/"bob@example.com"   → /Table/users/2

This sorted KV space is divided into ranges — contiguous spans of the key space. Each range is typically 512 MB and is replicated across 3 nodes using a Raft consensus group.

Range Distribution

Full KV space:
├─── Range 1: /Min ... /Table/users/1000    → Node 1 (leader), Node 3, Node 5
├─── Range 2: /Table/users/1001 ... /2000   → Node 2 (leader), Node 4, Node 6
├─── Range 3: /Table/users/2001 ... /3000   → Node 3 (leader), Node 1, Node 4
├─── Range 4: /Table/orders/1 ... /5000     → Node 4 (leader), Node 2, Node 6
└─── Range N: ...                           → ...

Range splitting: When a range exceeds 512 MB, it is automatically split into two ranges. The split happens at a key boundary that minimizes future splits.

Range merging: When adjacent ranges are both small (< 128 MB), they are merged into one range to reduce overhead.

Rebalancing: The system continuously monitors the distribution of ranges across nodes. If one node has significantly more ranges than others, the rebalancing algorithm moves ranges to achieve even distribution.

How Distributed Transactions Work

CockroachDB implements distributed transactions using a protocol inspired by Percolator (Google's incremental processing system on top of Bigtable). Every transaction in CockroachDB is serializable — the strongest isolation level.

Transaction lifecycle:

Key concepts:

Write intents: Instead of directly writing values, CockroachDB writes "intents" — provisional values tagged with the transaction ID. Intents are visible to concurrent transactions that can check the transaction record to determine if the intent is committed, aborted, or still pending.

Transaction record: Each transaction has a record stored in the range that contains the first write. This record tracks the transaction's status (PENDING, COMMITTED, ABORTED) and its commit timestamp.

Intent resolution: After a transaction commits, its intents are asynchronously resolved — converted from provisional intents to regular MVCC values. This is an optimization that avoids making the commit path wait for intent cleanup.

Timestamp ordering: CockroachDB uses MVCC with hybrid-logical clocks (HLC). Each value has a timestamp, and reads at timestamp T see only values written at timestamps <= T. Conflicts are resolved by timestamp ordering.

Serializable Isolation

CockroachDB provides serializable isolation by default — every transaction behaves as if it executed in serial order:

For any concurrent execution of transactions T_{1}, T_{2}, \dots, T_{n}

\exists a serial ordering T_{i_{1}}, T_{i_{2}}, \dots, T_{i_{n}} that produces the same result

This is achieved through:

MVCC timestamps — every read and write has a timestamp
Read refreshing — if a transaction's read timestamp is pushed forward due to a conflicting write, CockroachDB checks if the read would return the same result at the new timestamp. If yes, the transaction continues. If no, it retries.
Write-write conflict detection — if two transactions write to the same key, one will be aborted and retried.
Closed timestamp tracking — each range tracks a "closed timestamp" below which no new writes can occur, enabling consistent reads across ranges.

Follow-the-Workload

CockroachDB can automatically move range leaseholders closer to the nodes that read from them most frequently. This reduces read latency for geographic workloads.

Initial state:
  Range R1 leaseholder: Node in us-east
  Most reads for R1: coming from eu-west clients

After follow-the-workload:
  Range R1 leaseholder: Moved to Node in eu-west
  Read latency: Reduced from ~100ms to ~10ms

This is configured through zone configurations and locality flags:

sql

-- Tell CockroachDB where each node is located
cockroach start --locality=region=us-east,zone=us-east-1

-- Configure a table to prefer leases in a specific region
ALTER TABLE users CONFIGURE ZONE USING
    lease_preferences = '[[+region=eu-west]]';

TiDB

TiDB is a MySQL-compatible NewSQL database that separates compute (TiDB servers) from storage (TiKV) and supports both OLTP and OLAP workloads (HTAP) through TiFlash.

Architecture

TiKV Storage

TiKV is a distributed key-value store that uses RocksDB as its local storage engine (LSM tree) and Raft for replication.

Data organization:

The KV space is divided into Regions (similar to CockroachDB's Ranges). Each Region:

Covers a contiguous range of keys
Default size: 96 MB (configurable)
Is replicated across 3 TiKV nodes via a Raft group
Has one leader that handles all reads and writes

Region 1: [key_0000, key_1000) → TiKV-1 (leader), TiKV-2, TiKV-3
Region 2: [key_1000, key_2000) → TiKV-2 (leader), TiKV-3, TiKV-1
Region 3: [key_2000, key_3000) → TiKV-3 (leader), TiKV-1, TiKV-2

Raft Groups in TiKV

Each Region is a Raft group. The Raft leader handles all reads and writes, and replicates the write-ahead log to followers:

PD (Placement Driver) Scheduling

PD is TiDB's brain — it makes all scheduling decisions:

Timestamp Oracle (TSO):

Allocates globally unique, monotonically increasing timestamps
Every transaction gets a start timestamp and a commit timestamp from PD
Timestamps are used for MVCC — determines which versions a transaction can see
TSO throughput: ~1 million timestamps per second

Region scheduling:

Monitors Region sizes and splits large Regions
Monitors Region load and moves hot Regions to less-loaded nodes
Ensures each Region has the configured number of replicas
Handles node failures by re-replicating under-replicated Regions

Leader balancing:

Distributes Raft leaders evenly across TiKV nodes
Moves leaders closer to the clients that access them most

MySQL Compatibility

TiDB speaks the MySQL wire protocol and is compatible with most MySQL syntax:

sql

-- These all work in TiDB, same as MySQL:
CREATE TABLE users (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255),
    email VARCHAR(255) UNIQUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com');

SELECT * FROM users WHERE email = 'alice@example.com';

-- MySQL dump/restore works:
-- mysqldump --host=tidb-host ... | mysql --host=tidb-host ...

Compatibility gaps:

Stored procedures and triggers: limited support
Some MySQL-specific functions may behave differently
AUTO_INCREMENT behavior differs (not strictly sequential — allocated in batches per TiDB server)
Foreign key constraints: supported since TiDB 6.6, but not recommended for performance reasons
SELECT ... FOR UPDATE semantics differ under certain conditions

HTAP with TiFlash

TiFlash is TiDB's columnar storage engine for analytical queries. It receives data from TiKV via Raft log replication, meaning it is always consistent with TiKV.

How the optimizer chooses:

The optimizer uses cost-based analysis to decide whether to read from TiKV (row store) or TiFlash (columnar store). You can also force the choice with a hint:

sql

-- Force TiFlash for an analytical query
SELECT /*+ READ_FROM_STORAGE(TIFLASH[orders]) */
    DATE(order_date) AS day,
    SUM(amount) AS total
FROM orders
WHERE order_date > '2024-01-01'
GROUP BY DATE(order_date);

YugabyteDB

YugabyteDB is a PostgreSQL-compatible NewSQL database built on a distributed document store called DocDB.

DocDB Storage

DocDB is YugabyteDB's custom storage engine. It uses a modified version of RocksDB underneath but adds a document-oriented layer on top:

SQL Table: users (id, name, email)

DocDB encoding:
SubDocKey(table_id, hash(id=1), column_id=name) → "Alice"
SubDocKey(table_id, hash(id=1), column_id=email) → "alice@example.com"
SubDocKey(table_id, hash(id=2), column_id=name) → "Bob"
SubDocKey(table_id, hash(id=2), column_id=email) → "bob@example.com"

Key features of DocDB:

Each column value is stored as a separate key-value pair (enables fine-grained MVCC)
Keys include MVCC timestamps (hybrid timestamps, similar to CockroachDB's HLC)
Values are encoded as a custom document format supporting nested structures
Uses hash-based or range-based sharding for the primary key

Raft Replication

Like CockroachDB and TiDB, YugabyteDB uses Raft for replication. The data is divided into tablets (equivalent to CockroachDB ranges or TiDB regions):

Tablet = Range of keys + Raft group (leader + followers)

Each tablet is replicated across 3+ nodes. The Raft leader handles all reads and writes for that tablet.

PostgreSQL Compatibility

YugabyteDB uses the PostgreSQL query layer — it is not just "compatible" with PostgreSQL, it actually embeds the PostgreSQL parser, analyzer, and rewrite engine:

sql

-- Full PostgreSQL syntax works:
CREATE TABLE events (
    id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now(),
    tags TEXT[] DEFAULT '{}'
);

CREATE INDEX idx_events_payload ON events USING GIN (payload);

SELECT id, payload->>'type' AS event_type
FROM events
WHERE payload @> '{"source": "web"}'
  AND created_at > now() - INTERVAL '7 days'
ORDER BY created_at DESC;

Compatibility level:

PostgreSQL wire protocol: full compatibility
SQL syntax: ~99% of PostgreSQL SQL
Extensions: many work (PostGIS, pg_trgm, etc.), but not all
System catalogs: fully compatible
Stored procedures (PL/pgSQL): supported
Foreign keys: supported (distributed enforcement)

YugabyteDB vs CockroachDB

Aspect	CockroachDB	YugabyteDB
SQL compatibility	PostgreSQL (partial)	PostgreSQL (deeper — embeds PG query engine)
Storage engine	Pebble (custom LSM)	DocDB (modified RocksDB)
Default isolation	Serializable	Snapshot (serializable available)
Sharding	Range-based only	Hash-based (default) or range-based
License	BSL (source-available)	Apache 2.0 (core)
HTAP	No columnar engine	No columnar engine
Geo-partitioning	Table-level locality	Row-level geo-partitioning
Change data capture	Built-in (changefeeds)	Built-in (CDC)
Maturity	More mature, larger community	Growing, strong PostgreSQL compat

Comparison Table: All NewSQL Databases

Feature	Spanner	CockroachDB	TiDB	YugabyteDB
SQL Compatibility	Google SQL (non-standard)	PostgreSQL	MySQL	PostgreSQL
Consensus Protocol	Paxos	Raft	Raft	Raft
Storage Engine	Custom (Colossus-based)	Pebble (LSM)	RocksDB (LSM)	DocDB (RocksDB-based)
Strongest Isolation	External consistency	Serializable	Snapshot + serializable	Serializable
Clock Sync	TrueTime (atomic clocks + GPS)	NTP + HLC	PD Timestamp Oracle	NTP + HLC
Sharding	Range-based (splits)	Range-based (ranges)	Range-based (regions)	Hash or range (tablets)
HTAP	No (use BigQuery)	No	Yes (TiFlash)	No
Deployment	Google Cloud only	Self-hosted or Cloud	Self-hosted or Cloud	Self-hosted or Cloud
License	Proprietary (managed)	BSL	Apache 2.0	Apache 2.0 (core)
Global Distribution	Native (designed for it)	Supported	Supported (limited)	Supported
Online DDL	Yes	Yes	Yes	Yes
Max Tested Scale	Exabytes (Google-internal)	100s of TB	100s of TB	100s of TB
Community Size	N/A (managed)	Large	Large (esp. in China)	Growing

When to Use NewSQL

NewSQL vs Sharded PostgreSQL

Choose NewSQL when:

You need distributed ACID transactions that span multiple shards
You want automatic shard management (splitting, merging, rebalancing)
You need geographic distribution with consistency guarantees
You want automatic failover without manual replica promotion
Your application's SQL complexity exceeds what application-level sharding can support

Choose sharded PostgreSQL (Citus, manual sharding, or Vitess) when:

Your sharding key cleanly partitions the data with few cross-shard queries
You need full PostgreSQL feature compatibility (all extensions, stored procedures)
Your data fits on a small number of shards (< 10)
You have existing PostgreSQL expertise and tooling
Write latency must be minimal (single-node writes are faster than distributed consensus)

The breakeven:

Sharded PG wins when: \frac{cross-shard queries}{total queries} < 5 %

NewSQL wins when: \frac{cross-shard queries}{total queries} > 20 %

Between 5% and 20%, both approaches work — choose based on operational preference.

NewSQL vs DynamoDB

Choose NewSQL when:

You need SQL (JOINs, aggregations, subqueries)
You need distributed transactions across multiple records
Access patterns are not fully known upfront (need ad-hoc queries)
You want to avoid vendor lock-in
You need secondary indexes that are consistent with the primary data

Choose DynamoDB when:

Access patterns are well-defined and stable (known key lookups)
You want zero operational overhead (fully managed, serverless)
You are already deep in the AWS ecosystem
You don't need JOINs or complex queries
Cost predictability at extreme scale matters more than flexibility

Decision Flowchart

Migration from PostgreSQL to CockroachDB

Migrating from single-node PostgreSQL to CockroachDB is the most common NewSQL migration path. Here is the practical process:

Step 1: Schema Compatibility Audit

sql

-- Features that need modification:
-- 1. SERIAL columns → Use UUID or INT with unique_rowid()
-- Instead of:
CREATE TABLE users (id SERIAL PRIMARY KEY, ...);
-- Use:
CREATE TABLE users (id UUID DEFAULT gen_random_uuid() PRIMARY KEY, ...);

-- 2. Sequences → Avoid if possible (distributed sequences are slow)
-- CockroachDB supports sequences but they become a bottleneck

-- 3. Enums → Supported since CockroachDB 20.2

-- 4. Triggers → NOT supported (move logic to application)

-- 5. Stored procedures → Limited support (move complex logic to application)

-- 6. Extensions → Most not available (pg_trgm, PostGIS limited, no custom C extensions)

Step 2: Data Migration

bash

# Option 1: IMPORT (fastest for initial load)
cockroach sql --execute="
IMPORT TABLE users (id UUID, name STRING, email STRING)
CSV DATA ('gs://mybucket/users.csv')
WITH delimiter = ',', skip = '1';
"

# Option 2: pg_dump + psql (compatibility mode)
pg_dump --no-owner --no-privileges source_db | \
    cockroach sql --database=target_db

# Option 3: CDC-based (zero-downtime migration)
# Use a tool like molt (CockroachDB's migration tool)
# or custom CDC pipeline with Debezium

Step 3: Query Compatibility Testing

Common query patterns that need adjustment:

sql

-- PostgreSQL: Subquery in FROM
SELECT * FROM (SELECT ...) AS subq;
-- CockroachDB: Usually works, but may need optimization hints

-- PostgreSQL: LATERAL JOIN
SELECT * FROM users, LATERAL (SELECT ...) AS sub;
-- CockroachDB: Supported since v21.1

-- PostgreSQL: Window functions
SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) ...
-- CockroachDB: Fully supported

-- PostgreSQL: CTEs (WITH clause)
WITH regional_sales AS (SELECT ...)
SELECT * FROM regional_sales;
-- CockroachDB: Supported (note: CTEs are NOT materialized by default,
-- same as PostgreSQL 12+)

Step 4: Performance Tuning

sql

-- Key differences to tune for:
-- 1. Primary key design — AVOID sequential IDs
--    Bad:  id INT AUTO_INCREMENT  (creates hotspot on one range)
--    Good: id UUID DEFAULT gen_random_uuid()  (distributed across ranges)

-- 2. Add zone configurations for locality
ALTER TABLE users CONFIGURE ZONE USING
    num_replicas = 5,
    constraints = '{+region=us-east: 2, +region=eu-west: 2, +region=ap-south: 1}',
    lease_preferences = '[[+region=us-east]]';

-- 3. Use EXPLAIN ANALYZE to understand distributed query plans
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';
-- Shows: which nodes were involved, network hops, execution time per node

Performance Characteristics and Trade-offs

Latency Comparison

Operation	Single-Node PostgreSQL	CockroachDB (same region)	CockroachDB (cross-region)	Spanner (cross-region)
Point read (by PK)	0.5ms	2-5ms	50-200ms*	10-100ms
Point write	0.5ms	5-15ms	100-300ms	15-50ms + commit wait
Range scan (1000 rows)	5ms	10-30ms	50-200ms*	20-100ms
Distributed txn (2 ranges)	N/A	10-30ms	200-500ms	30-100ms

*Cross-region reads can be fast if the leaseholder/leader is in the reading region.

Write Amplification

NewSQL databases have inherent write amplification due to consensus replication:

Write amplification = \underset{1x}{\underset{⏟}{W_{WAL}}} + \underset{1x per replica}{\underset{⏟}{W_{Raft log}}} + \underset{10-30x per replica}{\underset{⏟}{W_{LSM compaction}}}

For a 3-replica setup with LSM compaction factor of 10:

Total write amplification \approx 1 + 3 \times (1 + 10) = 34 x

This means a 1 MB write to the application generates ~34 MB of actual disk I/O across the cluster. This is the fundamental cost of distributed consistency.

Read Amplification

Reads in NewSQL databases are typically served by a single node (the Raft leader / leaseholder), so read amplification is similar to a single-node database. The exception is follower reads, which may introduce additional latency but reduce load on the leader.

Read latency = {\begin{cases} Local & if leaseholder is local \\ {RTT}_{to leader} + local & if leaseholder is remote \\ Stale read & if follower read is acceptable \end{cases}

The Fundamental Trade-off

NewSQL databases trade single-node latency for distributed correctness:

{Latency}_{NewSQL} = {Latency}_{single-node} + Consensus overhead + Network RTT

This trade-off is worth it when:

Your data exceeds a single node's capacity
You need automatic failover with zero data loss
You need geographic distribution
You need distributed transactions

It is NOT worth it when:

Your data fits on a single node
You can tolerate manual failover (promote a replica)
All your traffic is in one region
Cross-shard transactions are rare (< 5% of queries)

The Honest Assessment

If your dataset is under 1 TB, your write throughput is under 10,000 writes/second, and you operate in a single region — single-node PostgreSQL with a streaming replica for failover is simpler, faster, and cheaper than any NewSQL database. NewSQL's complexity is justified only when you have exhausted the vertical scaling of a single-node database or when you need multi-region consistency.

NewSQL ​

What NewSQL Solves ​

Google Spanner ​

TrueTime ​

External Consistency ​

Spanner Architecture ​

How Paxos Groups Work in Spanner ​

CockroachDB ​

Architecture Layers ​

The KV Layer ​

Range Distribution ​

How Distributed Transactions Work ​

Serializable Isolation ​

Follow-the-Workload ​

TiDB ​

Architecture ​

TiKV Storage ​

Raft Groups in TiKV ​

PD (Placement Driver) Scheduling ​

MySQL Compatibility ​

HTAP with TiFlash ​

YugabyteDB ​

DocDB Storage ​

Raft Replication ​

PostgreSQL Compatibility ​

YugabyteDB vs CockroachDB ​

Comparison Table: All NewSQL Databases ​

When to Use NewSQL ​

NewSQL vs Sharded PostgreSQL ​

NewSQL vs DynamoDB ​

Decision Flowchart ​

Migration from PostgreSQL to CockroachDB ​

Step 1: Schema Compatibility Audit ​

Step 2: Data Migration ​

Step 3: Query Compatibility Testing ​

Step 4: Performance Tuning ​

Performance Characteristics and Trade-offs ​

Latency Comparison ​

Write Amplification ​

Read Amplification ​

The Fundamental Trade-off ​

Related Pages