Skip to content
Unverified — AI-generated content. Help verify this page

Transactional Outbox Pattern

The Dual-Write Problem

Every event-driven system faces the same fundamental problem: a service needs to update its database and publish an event. These are two separate operations targeting two separate systems. If either one fails after the other succeeds, the system becomes inconsistent.

Consider an order service that must save an order to its database and publish an OrderPlaced event to Kafka:

Three failure scenarios:

ScenarioWhat HappensResult
DB succeeds, Kafka failsOrder saved but event never publishedDownstream services never learn about the order (inventory not reserved, email not sent)
DB fails, Kafka succeedsEvent published but order not savedDownstream services react to a phantom order that doesn't exist
Both succeed, then DB rolls backTransaction rolled back after event publishedSame as above — phantom event

You cannot wrap a database transaction and a Kafka publish in a single atomic operation. They are separate systems with separate transaction managers. Distributed transactions (2PC/XA) technically work but are fragile, slow, and most message brokers don't support them.

The Dual-Write Problem Is Not Theoretical

This is one of the most common bugs in event-driven systems. It often goes undetected for months because the failure rate is low (only happens during network blips or broker outages). Then during a Kafka upgrade or a database failover, hundreds of events are lost or duplicated. The transactional outbox pattern eliminates this class of bugs entirely.

The Solution: Transactional Outbox

Instead of writing to two systems, write to one system — the database — in a single transaction. Store the event in an outbox table within the same database, inside the same transaction as the business data change. A separate process reads the outbox table and publishes events to the message broker.

The atomicity guarantee comes from the database transaction. Either both the order and the outbox event are committed, or neither is. The event relay (publisher) is a separate process that handles the second write (to Kafka) independently, with its own retry logic.

Outbox Table Design

Schema

sql
CREATE TABLE outbox (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type  VARCHAR(255) NOT NULL,     -- e.g., 'Order', 'Payment'
    aggregate_id    VARCHAR(255) NOT NULL,     -- e.g., the order_id
    event_type      VARCHAR(255) NOT NULL,     -- e.g., 'OrderPlaced', 'OrderCancelled'
    payload         JSONB NOT NULL,            -- the full event payload
    metadata        JSONB,                     -- correlation_id, causation_id, etc.
    created_at      TIMESTAMP NOT NULL DEFAULT NOW(),
    published_at    TIMESTAMP,                 -- NULL until published
    retry_count     INT NOT NULL DEFAULT 0,
    status          VARCHAR(20) NOT NULL DEFAULT 'PENDING'
        CHECK (status IN ('PENDING', 'PUBLISHED', 'FAILED'))
);

-- Index for the polling publisher
CREATE INDEX idx_outbox_pending ON outbox (created_at)
    WHERE status = 'PENDING';

-- Index for cleanup
CREATE INDEX idx_outbox_published ON outbox (published_at)
    WHERE status = 'PUBLISHED';

Key Design Decisions

DecisionRecommendationRationale
Payload formatJSONB (PostgreSQL)Allows querying event content for debugging without deserializing
Orderingcreated_at + idGuarantees deterministic ordering for the publisher
Aggregate fieldsaggregate_type + aggregate_idEnables routing events to different Kafka topics per aggregate
Status trackingPENDING / PUBLISHED / FAILEDThe publisher needs to know which events to process
RetentionDelete after 7 days of PUBLISHEDOutbox is transient, not an event store

Writing to the Outbox

typescript
// order.service.ts
import { Pool } from 'pg';
import { v4 as uuidv4 } from 'uuid';

interface CreateOrderRequest {
  customerId: string;
  items: Array<{ productId: string; quantity: number; price: number }>;
}

async function createOrder(pool: Pool, request: CreateOrderRequest): Promise<string> {
  const client = await pool.connect();

  try {
    await client.query('BEGIN');

    // 1. Insert the order (business data)
    const orderId = uuidv4();
    const totalAmount = request.items.reduce((sum, item) => sum + item.price * item.quantity, 0);

    await client.query(
      `INSERT INTO orders (id, customer_id, total_amount, status, created_at)
       VALUES ($1, $2, $3, 'CREATED', NOW())`,
      [orderId, request.customerId, totalAmount]
    );

    // 2. Insert order items
    for (const item of request.items) {
      await client.query(
        `INSERT INTO order_items (order_id, product_id, quantity, price)
         VALUES ($1, $2, $3, $4)`,
        [orderId, item.productId, item.quantity, item.price]
      );
    }

    // 3. Insert the outbox event (SAME TRANSACTION)
    const eventPayload = {
      orderId,
      customerId: request.customerId,
      totalAmount,
      items: request.items,
      occurredAt: new Date().toISOString(),
    };

    await client.query(
      `INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload, metadata)
       VALUES ($1, $2, $3, $4, $5)`,
      [
        'Order',
        orderId,
        'OrderPlaced',
        JSON.stringify(eventPayload),
        JSON.stringify({
          correlationId: uuidv4(),
          userId: request.customerId,
        }),
      ]
    );

    await client.query('COMMIT');
    return orderId;
  } catch (error) {
    await client.query('ROLLBACK');
    throw error;
  } finally {
    client.release();
  }
}

The Outbox Write Must Be in the Same Transaction

This is the entire point of the pattern. If the outbox write is in a separate transaction (or worse, a separate database), you have reintroduced the dual-write problem. The outbox table must be in the same database as the business data, and the writes must share the same transaction.

Publishing: Polling vs CDC

There are two approaches to reading events from the outbox and publishing them to the message broker.

Approach 1: Polling Publisher

The simplest approach. A background process periodically queries the outbox table for unpublished events and publishes them.

typescript
// polling-publisher.ts
import { Pool } from 'pg';
import { Kafka, Producer } from 'kafkajs';

class PollingPublisher {
  private pool: Pool;
  private producer: Producer;
  private intervalMs: number;
  private batchSize: number;

  constructor(pool: Pool, producer: Producer, intervalMs = 1000, batchSize = 100) {
    this.pool = pool;
    this.producer = producer;
    this.intervalMs = intervalMs;
    this.batchSize = batchSize;
  }

  async start(): Promise<void> {
    await this.producer.connect();

    setInterval(async () => {
      try {
        await this.publishBatch();
      } catch (error) {
        console.error('Polling publisher error:', error);
      }
    }, this.intervalMs);
  }

  private async publishBatch(): Promise<void> {
    const client = await this.pool.connect();

    try {
      // SELECT FOR UPDATE SKIP LOCKED prevents concurrent publishers
      // from processing the same events
      await client.query('BEGIN');

      const result = await client.query(
        `SELECT id, aggregate_type, aggregate_id, event_type, payload, metadata
         FROM outbox
         WHERE status = 'PENDING'
         ORDER BY created_at ASC
         LIMIT $1
         FOR UPDATE SKIP LOCKED`,
        [this.batchSize]
      );

      if (result.rows.length === 0) {
        await client.query('COMMIT');
        return;
      }

      // Publish to Kafka
      const messages = result.rows.map(row => ({
        topic: `${row.aggregate_type.toLowerCase()}.events`,
        messages: [{
          key: row.aggregate_id,
          value: JSON.stringify({
            eventId: row.id,
            eventType: row.event_type,
            aggregateId: row.aggregate_id,
            payload: row.payload,
            metadata: row.metadata,
          }),
        }],
      }));

      await this.producer.sendBatch({ topicMessages: messages });

      // Mark as published
      const ids = result.rows.map(r => r.id);
      await client.query(
        `UPDATE outbox SET status = 'PUBLISHED', published_at = NOW()
         WHERE id = ANY($1)`,
        [ids]
      );

      await client.query('COMMIT');
    } catch (error) {
      await client.query('ROLLBACK');
      throw error;
    } finally {
      client.release();
    }
  }
}

Polling Publisher Trade-offs:

AdvantageDisadvantage
Simple to implementAdds latency (polling interval)
No external dependenciesDatabase load from frequent polling
Easy to debug and monitorOrdering requires careful locking
Works with any databaseScaling requires SKIP LOCKED or partitioning

Approach 2: CDC with Debezium

Change Data Capture (CDC) reads the database's Write-Ahead Log (WAL) to capture inserts to the outbox table in real-time, without polling.

Debezium, the most popular open-source CDC tool, reads the PostgreSQL WAL and publishes every outbox insert as a Kafka message. This eliminates polling, reduces latency to near real-time, and adds zero load to the database.

Debezium Connector Configuration:

json
{
  "name": "outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "secret",
    "database.dbname": "orderservice",
    "database.server.name": "orderservice",
    "table.include.list": "public.outbox",

    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.table.fields.additional.placement": "event_type:header:eventType",
    "transforms.outbox.route.by.field": "aggregate_type",
    "transforms.outbox.route.topic.replacement": "${routedByValue}.events"
  }
}

Debezium's EventRouter transform is purpose-built for the outbox pattern. It extracts the event payload from the outbox row, routes it to the correct Kafka topic based on aggregate_type, and adds the event_type as a Kafka header.

CDC Publisher Trade-offs:

AdvantageDisadvantage
Near real-time latency (<100ms)Requires Debezium infrastructure (Kafka Connect)
Zero database polling loadMore complex to operate and monitor
Guaranteed ordering (WAL order)Database must support logical replication
Exactly the events that were committedWAL retention must be managed

Choosing Between Polling and CDC

Start with polling. It is simpler, easier to debug, and sufficient for most use cases. Move to CDC when: (a) you need sub-second latency, (b) polling frequency is causing database load, or (c) you already have Debezium/Kafka Connect in your infrastructure. Most teams never need CDC for the outbox — polling at 1-second intervals is fast enough.

The Inbox Pattern: Reliable Consumers

The outbox solves the producer side. The inbox pattern solves the consumer side: ensuring that events are processed exactly once, even if they are delivered multiple times (which is inevitable with at-least-once delivery).

Inbox Table

sql
CREATE TABLE inbox (
    event_id    UUID PRIMARY KEY,
    event_type  VARCHAR(255) NOT NULL,
    received_at TIMESTAMP NOT NULL DEFAULT NOW(),
    processed_at TIMESTAMP
);

-- Cleanup: delete entries older than 30 days
-- (events older than your Kafka retention period cannot be redelivered)

Consumer Implementation

typescript
// inbox-consumer.ts

async function handleEvent(pool: Pool, event: DomainEvent): Promise<void> {
  const client = await pool.connect();

  try {
    await client.query('BEGIN');

    // Attempt to insert into inbox (idempotency check)
    const insertResult = await client.query(
      `INSERT INTO inbox (event_id, event_type)
       VALUES ($1, $2)
       ON CONFLICT (event_id) DO NOTHING
       RETURNING event_id`,
      [event.eventId, event.eventType]
    );

    if (insertResult.rowCount === 0) {
      // Event already processed — skip
      await client.query('ROLLBACK');
      console.log(`Duplicate event ${event.eventId}, skipping`);
      return;
    }

    // Process the event (business logic)
    switch (event.eventType) {
      case 'OrderPlaced':
        await reserveInventory(client, event.payload);
        break;
      case 'OrderCancelled':
        await releaseInventory(client, event.payload);
        break;
      default:
        console.warn(`Unknown event type: ${event.eventType}`);
    }

    // Mark as processed
    await client.query(
      `UPDATE inbox SET processed_at = NOW() WHERE event_id = $1`,
      [event.eventId]
    );

    await client.query('COMMIT');
  } catch (error) {
    await client.query('ROLLBACK');
    throw error; // Will trigger redelivery from Kafka
  } finally {
    client.release();
  }
}

The inbox insert and the business logic are in the same transaction. If the business logic fails, the inbox entry is rolled back, and the event will be redelivered and retried. If the business logic succeeds, the inbox entry persists, and any redelivery is detected and skipped.

See Idempotent Consumers for a deeper exploration of idempotency strategies beyond the inbox pattern.

Outbox Maintenance

The outbox table accumulates rows over time. Without maintenance, it will grow unbounded and degrade performance.

sql
-- Delete published events older than 7 days
DELETE FROM outbox
WHERE status = 'PUBLISHED'
  AND published_at < NOW() - INTERVAL '7 days';

-- Alert on stuck events (pending for more than 5 minutes)
SELECT COUNT(*) AS stuck_events
FROM outbox
WHERE status = 'PENDING'
  AND created_at < NOW() - INTERVAL '5 minutes';

-- Move permanently failed events to a dead letter table
INSERT INTO outbox_dead_letter
SELECT * FROM outbox
WHERE status = 'FAILED'
  AND retry_count >= 5;

DELETE FROM outbox
WHERE status = 'FAILED'
  AND retry_count >= 5;

Monitor the Outbox

The outbox is a critical component. If the publisher stops working, events pile up and downstream systems go stale. Set up alerts on:

  • Number of PENDING events (should stay near zero)
  • Age of the oldest PENDING event (should be < polling interval)
  • FAILED event count (should be zero)
  • Outbox table size (should not grow unbounded)

End-to-End Architecture

Common Pitfalls

PitfallProblemSolution
Outbox in a separate databaseReintroduces dual-write problemOutbox must be in the same database as business data
Publishing before commitEvent published but transaction rolls backAlways publish after commit (the relay reads committed data)
No ordering guaranteeEvents arrive out of order at consumersPartition by aggregate_id in Kafka (all events for the same order go to the same partition)
Forgetting cleanupOutbox grows to millions of rows, slowing queriesSchedule daily cleanup of published events
No dead letter handlingPermanently failing events block the publisherMove failed events to a DLQ after N retries
Consumer processes before inbox checkSide effects executed before idempotency checkAlways check inbox first, then process

Further Reading

"What I cannot create, I do not understand." — Richard Feynman