Skip to content
Unverified — AI-generated content. Help verify this page

Distributed ID Generation

Every record needs a unique identifier. In a single-database system, AUTO_INCREMENT works perfectly. In distributed systems with multiple databases, services, and data centers, generating globally unique IDs without coordination is a fundamental challenge. The ID generation strategy you choose affects performance, sortability, storage efficiency, and even security.

Why Auto-Increment Fails at Scale

ProblemDescription
Single point of failureOne DB generates all IDs — if it goes down, nothing gets an ID
Throughput bottleneckAll services must call the same DB for IDs
Collision on shardingTwo shards independently generate the same ID
Information leakageSequential IDs reveal total count and creation rate
Merge conflictsCannot merge data from independent databases

UUID (Universally Unique Identifier)

UUIDs are 128-bit identifiers that can be generated anywhere without coordination. The probability of collision is astronomically low.

UUID v4 (Random)

550e8400-e29b-41d4-a716-446655440000
^^^^^^^^ ^^^^ ^^^^ ^^^^ ^^^^^^^^^^^^
 random  rand ver  var    random
python
import uuid

# UUID v4: 122 random bits, 6 bits for version/variant
id1 = uuid.uuid4()
print(id1)  # e.g., 7c9e6679-7425-40de-944b-e07fc1f90ae7

# Properties
print(f"Version: {id1.version}")  # 4
print(f"Bytes: {len(id1.bytes)}")  # 16 bytes = 128 bits
print(f"String length: {len(str(id1))}")  # 36 chars (with hyphens)

Pros: No coordination needed, universally unique, simple Cons: Not sortable by time, poor index performance (random inserts into B-tree), 36 chars as string, no embedded metadata

UUID v7 (Time-Ordered, RFC 9562)

UUID v7 is the modern answer. It embeds a Unix timestamp in the most significant bits, making it naturally sortable by creation time while maintaining uniqueness.

018e5c2e-9b3a-7def-8123-456789abcdef
^^^^^^^^ ^^^^ ^^^^ ^^^^ ^^^^^^^^^^^^
 unix_ms  ms  ver  var   random
python
import time
import os
import struct


def uuid_v7() -> str:
    """Generate a UUID v7 (time-ordered, RFC 9562)."""
    # 48-bit Unix timestamp in milliseconds
    timestamp_ms = int(time.time() * 1000)
    ts_bytes = struct.pack(">Q", timestamp_ms)[2:]  # 6 bytes

    # 10 bytes of randomness
    rand_bytes = bytearray(os.urandom(10))

    # Set version (7) in bits 48-51
    rand_bytes[0] = (rand_bytes[0] & 0x0F) | 0x70

    # Set variant (10) in bits 64-65
    rand_bytes[2] = (rand_bytes[2] & 0x3F) | 0x80

    raw = ts_bytes + bytes(rand_bytes)

    # Format as UUID string
    hex_str = raw.hex()
    return f"{hex_str[:8]}-{hex_str[8:12]}-{hex_str[12:16]}-{hex_str[16:20]}-{hex_str[20:]}"


# Generate time-ordered UUIDs
ids = [uuid_v7() for _ in range(5)]
for id in ids:
    print(id)
# They sort lexicographically by creation time!
assert ids == sorted(ids)

Twitter Snowflake

Twitter Snowflake generates 64-bit IDs that are time-ordered, unique across data centers, and can be generated without coordination. This is the most influential ID generation scheme in distributed systems.

Bit Layout

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|                    Timestamp (41 bits)                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Timestamp    | Datacenter | Machine  |    Sequence (12)    |
|     (continued)  |  ID (5)    | ID (5)   |                     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Bit allocation:
- 1 bit:  Sign (always 0)
- 41 bits: Timestamp (ms since custom epoch) — ~69 years
- 5 bits:  Datacenter ID (0-31)
- 5 bits:  Machine ID (0-31)
- 12 bits: Sequence number (0-4095 per ms)

Max IDs per machine per second: 4,096 * 1,000 = 4,096,000
python
import time
import threading


class SnowflakeGenerator:
    """Twitter Snowflake ID generator."""

    # Custom epoch: 2024-01-01T00:00:00Z
    EPOCH = 1704067200000

    TIMESTAMP_BITS = 41
    DATACENTER_BITS = 5
    MACHINE_BITS = 5
    SEQUENCE_BITS = 12

    MAX_DATACENTER_ID = (1 << DATACENTER_BITS) - 1   # 31
    MAX_MACHINE_ID = (1 << MACHINE_BITS) - 1          # 31
    MAX_SEQUENCE = (1 << SEQUENCE_BITS) - 1            # 4095

    def __init__(self, datacenter_id: int, machine_id: int):
        if datacenter_id > self.MAX_DATACENTER_ID or datacenter_id < 0:
            raise ValueError(f"Datacenter ID must be 0-{self.MAX_DATACENTER_ID}")
        if machine_id > self.MAX_MACHINE_ID or machine_id < 0:
            raise ValueError(f"Machine ID must be 0-{self.MAX_MACHINE_ID}")

        self.datacenter_id = datacenter_id
        self.machine_id = machine_id
        self.sequence = 0
        self.last_timestamp = -1
        self.lock = threading.Lock()

    def generate(self) -> int:
        with self.lock:
            timestamp = self._current_time()

            if timestamp < self.last_timestamp:
                raise Exception("Clock moved backwards!")

            if timestamp == self.last_timestamp:
                self.sequence = (self.sequence + 1) & self.MAX_SEQUENCE
                if self.sequence == 0:
                    # Sequence exhausted in this ms, wait for next ms
                    timestamp = self._wait_next_ms(self.last_timestamp)
            else:
                self.sequence = 0

            self.last_timestamp = timestamp

            return (
                ((timestamp - self.EPOCH) << (self.DATACENTER_BITS + self.MACHINE_BITS + self.SEQUENCE_BITS))
                | (self.datacenter_id << (self.MACHINE_BITS + self.SEQUENCE_BITS))
                | (self.machine_id << self.SEQUENCE_BITS)
                | self.sequence
            )

    def _current_time(self) -> int:
        return int(time.time() * 1000)

    def _wait_next_ms(self, last_ts: int) -> int:
        ts = self._current_time()
        while ts <= last_ts:
            ts = self._current_time()
        return ts

    @staticmethod
    def parse(snowflake_id: int) -> dict:
        """Extract components from a Snowflake ID."""
        sequence = snowflake_id & 0xFFF
        machine_id = (snowflake_id >> 12) & 0x1F
        datacenter_id = (snowflake_id >> 17) & 0x1F
        timestamp = (snowflake_id >> 22) + SnowflakeGenerator.EPOCH
        return {
            "timestamp": timestamp,
            "datacenter_id": datacenter_id,
            "machine_id": machine_id,
            "sequence": sequence,
            "datetime": time.strftime(
                "%Y-%m-%d %H:%M:%S",
                time.gmtime(timestamp / 1000)
            ),
        }


# Usage
gen = SnowflakeGenerator(datacenter_id=1, machine_id=5)
id1 = gen.generate()
id2 = gen.generate()
print(f"ID: {id1}")                         # e.g., 7189462532997120005
print(f"Parsed: {gen.parse(id1)}")          # Shows timestamp, dc, machine, seq
print(f"Time ordered: {id1 < id2}")         # True
print(f"Fits in 64-bit int: {id1 < 2**63}")  # True

Go Implementation

go
package snowflake

import (
	"errors"
	"sync"
	"time"
)

const (
	epoch            = int64(1704067200000) // 2024-01-01T00:00:00Z
	datacenterBits   = 5
	machineBits      = 5
	sequenceBits     = 12
	maxDatacenterID  = -1 ^ (-1 << datacenterBits)
	maxMachineID     = -1 ^ (-1 << machineBits)
	maxSequence      = -1 ^ (-1 << sequenceBits)
	machineShift     = sequenceBits
	datacenterShift  = sequenceBits + machineBits
	timestampShift   = sequenceBits + machineBits + datacenterBits
)

type Generator struct {
	mu            sync.Mutex
	datacenterID  int64
	machineID     int64
	sequence      int64
	lastTimestamp int64
}

func New(datacenterID, machineID int64) (*Generator, error) {
	if datacenterID < 0 || datacenterID > maxDatacenterID {
		return nil, errors.New("datacenter ID out of range")
	}
	if machineID < 0 || machineID > maxMachineID {
		return nil, errors.New("machine ID out of range")
	}
	return &Generator{
		datacenterID: datacenterID,
		machineID:    machineID,
	}, nil
}

func (g *Generator) Generate() (int64, error) {
	g.mu.Lock()
	defer g.mu.Unlock()

	now := time.Now().UnixMilli()

	if now < g.lastTimestamp {
		return 0, errors.New("clock moved backwards")
	}

	if now == g.lastTimestamp {
		g.sequence = (g.sequence + 1) & maxSequence
		if g.sequence == 0 {
			for now <= g.lastTimestamp {
				now = time.Now().UnixMilli()
			}
		}
	} else {
		g.sequence = 0
	}

	g.lastTimestamp = now

	id := ((now - epoch) << timestampShift) |
		(g.datacenterID << datacenterShift) |
		(g.machineID << machineShift) |
		g.sequence

	return id, nil
}

ULID (Universally Unique Lexicographically Sortable Identifier)

ULID combines timestamp ordering with randomness in a URL-safe format.

 01AN4Z07BY      79KA1307SR9X4MV3
|----------|    |----------------|
 Timestamp          Randomness
  48 bits            80 bits
  10 chars           16 chars
python
import os
import time
import struct

# Crockford's Base32 encoding
ENCODING = "0123456789ABCDEFGHJKMNPQRSTVWXYZ"


def generate_ulid() -> str:
    """Generate a ULID: 48-bit timestamp + 80-bit random."""
    timestamp_ms = int(time.time() * 1000)

    # Encode timestamp (48 bits -> 10 chars)
    ts_chars = []
    for _ in range(10):
        ts_chars.append(ENCODING[timestamp_ms & 0x1F])
        timestamp_ms >>= 5
    ts_part = "".join(reversed(ts_chars))

    # Encode randomness (80 bits -> 16 chars)
    rand_bytes = int.from_bytes(os.urandom(10), "big")
    rand_chars = []
    for _ in range(16):
        rand_chars.append(ENCODING[rand_bytes & 0x1F])
        rand_bytes >>= 5
    rand_part = "".join(reversed(rand_chars))

    return ts_part + rand_part


# ULIDs sort lexicographically by time
ids = [generate_ulid() for _ in range(5)]
print(ids)
assert ids == sorted(ids)  # Time-ordered!

# Properties:
# - 26 characters (vs 36 for UUID)
# - Case-insensitive
# - No special characters (URL-safe)
# - Monotonic within same millisecond (if implemented correctly)
# - 128 bits total (same as UUID)

Instagram ID Scheme

Instagram needed IDs that are:

  • Time-ordered (for feed sorting)
  • Generated across multiple PostgreSQL shards
  • 64-bit (fit in a bigint column)

Their solution: a custom scheme similar to Snowflake but using PostgreSQL sequences.

sql
-- Instagram's ID generation using PostgreSQL
-- Each shard has its own sequence, ensuring uniqueness

CREATE SCHEMA IF NOT EXISTS insta5;

CREATE SEQUENCE insta5.table_id_seq;

CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$
DECLARE
    our_epoch bigint := 1314220021721;  -- Instagram epoch
    seq_id bigint;
    now_millis bigint;
    shard_id int := 5;  -- This shard's ID
BEGIN
    SELECT nextval('insta5.table_id_seq') INTO seq_id;
    SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;

    result := (now_millis - our_epoch) << 23;   -- 41 bits timestamp
    result := result | (shard_id << 10);         -- 13 bits shard ID
    result := result | (seq_id & 1023);          -- 10 bits sequence
END;
$$ LANGUAGE PLPGSQL;

-- Usage:
-- INSERT INTO photos (id, ...) VALUES (insta5.next_id(), ...);

Bit layout:

  • 41 bits: Timestamp (ms since custom epoch) — ~69 years
  • 13 bits: Shard ID (0-8191 logical shards)
  • 10 bits: Sequence (0-1023 per ms per shard)

Flickr Ticket Servers

Flickr uses dedicated "ticket server" databases that do nothing but generate sequential IDs. Two servers for redundancy, using MySQL's auto-increment with different offsets.

sql
-- Ticket Server 1: generates odd numbers
-- auto_increment_increment = 2
-- auto_increment_offset = 1
-- IDs: 1, 3, 5, 7, 9, ...

-- Ticket Server 2: generates even numbers
-- auto_increment_increment = 2
-- auto_increment_offset = 2
-- IDs: 2, 4, 6, 8, 10, ...

CREATE TABLE tickets_64 (
    id BIGINT NOT NULL AUTO_INCREMENT,
    stub CHAR(1) NOT NULL DEFAULT '',
    PRIMARY KEY (id),
    UNIQUE KEY stub (stub)
) ENGINE=InnoDB;

-- Get next ID:
REPLACE INTO tickets_64 (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

Pros: Simple, sequential, proven Cons: Single point of coordination (even with two servers), network hop required, not truly distributed

Comprehensive Comparison

SchemeSizeSortableCoordinationSpeedInfo LeakageIndex Performance
Auto-increment32/64 bitYesSingle DBHighHigh (count visible)Excellent
UUID v4128 bitNoNoneHighNonePoor (random B-tree inserts)
UUID v7128 bitYes (time)NoneHighMinimal (time only)Good (sequential)
Snowflake64 bitYes (time)Machine ID assignmentVery highLow (time, DC)Excellent
ULID128 bitYes (time)NoneHighMinimal (time only)Good
Instagram64 bitYes (time)Shard sequenceHighLow (time, shard)Excellent
Flickr tickets64 bitYesTicket serversMediumHigh (sequential)Excellent
NanoIDVariableNoNoneHighNonePoor

Choosing the Right Strategy

Implementation Considerations

Clock Skew Handling

Snowflake-style IDs depend on monotonic clocks. If the clock jumps backward (NTP correction, VM migration), you can generate duplicate IDs.

python
class ClockSkewSafeGenerator:
    """Handle clock skew in Snowflake-style ID generation."""

    def __init__(self, machine_id: int):
        self.machine_id = machine_id
        self.sequence = 0
        self.last_timestamp = 0

    def generate(self) -> int:
        current_ts = self._current_time_ms()

        if current_ts < self.last_timestamp:
            # Clock went backward!
            skew_ms = self.last_timestamp - current_ts

            if skew_ms < 5:
                # Small skew: wait it out
                import time
                time.sleep(skew_ms / 1000.0)
                current_ts = self._current_time_ms()
            else:
                # Large skew: use last timestamp + increment
                current_ts = self.last_timestamp

        # ... rest of generation logic
        self.last_timestamp = current_ts
        return self._build_id(current_ts)

    def _current_time_ms(self) -> int:
        import time
        return int(time.time() * 1000)

    def _build_id(self, timestamp: int) -> int:
        # Simplified Snowflake ID assembly
        self.sequence = (self.sequence + 1) & 0xFFF
        return (timestamp << 22) | (self.machine_id << 12) | self.sequence

Database Index Performance

sql
-- UUID v4: Random inserts fragment the B-tree index
-- Each insert may go to a random page, causing cache misses and page splits
CREATE TABLE users_v4 (
    id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
    name TEXT
);
-- Insert performance degrades as table grows

-- UUID v7 / Snowflake: Sequential inserts are index-friendly
-- New IDs always go to the end of the B-tree (append-only pattern)
CREATE TABLE users_v7 (
    id BIGINT PRIMARY KEY,  -- Snowflake ID
    name TEXT
);
-- Insert performance remains constant

-- Benchmark results (1M rows, PostgreSQL 16):
-- UUID v4 inserts: ~8,000/sec (degrades to ~3,000/sec at 10M rows)
-- Snowflake inserts: ~25,000/sec (consistent at any table size)

Cross-References


Choose the simplest ID scheme that meets your requirements. If you are on a single database, auto-increment is fine. If you need distributed uniqueness without time ordering, UUID v4 is the zero-coordination choice. If you need time ordering, UUID v7 is the modern standard. Only reach for Snowflake if you need 64-bit IDs at massive scale.

Real-World Examples

Twitter (Snowflake)

Twitter created the Snowflake ID generator in 2010 to solve tweet ID generation across multiple data centers. Each tweet ID encodes the creation timestamp, data center, machine, and sequence number in a 64-bit integer — enabling 4 million IDs per second per machine with zero coordination between machines. The time-ordered property allows chronological sorting without an additional index.

Instagram

Instagram designed a custom 64-bit ID scheme using PostgreSQL sequences. Each database shard has its own sequence, and the ID encodes timestamp (41 bits) + shard ID (13 bits) + sequence (10 bits). This fits in a PostgreSQL bigint column, is time-sortable for feed ordering, and avoids the index fragmentation problem of random UUIDs — all using standard database features without external services.

Stripe

Stripe uses prefixed random IDs like ch_1234abcd (charge), pi_5678efgh (payment intent), and cus_9012ijkl (customer). The prefix makes IDs self-documenting in logs and API calls — you can tell at a glance what type of resource an ID refers to. The random suffix provides uniqueness without coordination. This pattern prioritizes developer experience over sortability.

Interview Tip

What to say

"My default is UUID v7 for most distributed systems — it's time-ordered (good for database indexes, since sequential inserts are 3x faster than random UUID v4 in B-trees), requires zero coordination, and is a formal RFC standard. If I need 64-bit IDs (for storage efficiency or JavaScript number safety), I'd use Snowflake — it generates 4 million IDs per second per machine and embeds the timestamp for natural sorting. The key trade-off is that Snowflake requires assigning machine IDs, while UUID v7 is truly coordination-free. I'd only use auto-increment for single-database systems because it creates a single point of failure in distributed setups."

"What I cannot create, I do not understand." — Richard Feynman