Skip to content
Unverified — AI-generated content. Help verify this page

Webhook Infrastructure at Scale

The Webhook Design Patterns page covers the fundamentals — payload design, signatures, and retry logic. This page goes deeper: building webhook infrastructure that reliably delivers millions of events per day, handles consumer failures gracefully, provides operational visibility, and scales horizontally.

If you are deciding whether to build or buy, we also compare the three leading webhook-as-a-service platforms: Svix, Convoy, and Hookdeck.


The Scale of the Problem

When you send a few hundred webhooks per day, everything works. When you send millions, every edge case becomes a daily occurrence:

Failure ModeFrequency at ScaleImpact
Consumer returns 500Hundreds per hourRetry backlog grows
Consumer is down for hoursWeeklyMillions of retries queued
Consumer is permanently goneMonthlyDead letter queue fills up
DNS resolution failsDailyTransient delivery failures
TLS certificate expiredMonthlyCannot connect to consumer
Consumer is slow (> 30s)Thousands per dayTies up worker threads
Consumer returns 200 but does not processUnknownSilent data loss
Network partitionQuarterlyEntire regions lose connectivity
Your outage sends burst on recoveryQuarterlyThundering herd to consumers

Architecture for Reliable Delivery

Key Design Decisions

DecisionRecommendationWhy
Queue technologyKafka or SQS with DLQDurability, at-least-once, built-in retry
Worker poolDedicated per-consumer poolsOne slow consumer cannot block others
Timeout30 seconds maxPrevent thread exhaustion
Max retries5-8 attempts over 24-48 hoursBalance reliability vs resource waste
IdempotencyInclude event ID, advise consumers to deduplicateAt-least-once delivery means duplicates
OrderingBest-effort (not guaranteed)Strict ordering is extremely expensive

Retry Strategy

Exponential Backoff with Jitter

python
import random
import time

def compute_retry_delay(attempt: int, base_delay: int = 30) -> int:
    """
    Exponential backoff with full jitter.

    Attempt 1: 0-60s    (random within 0 to 2*base)
    Attempt 2: 0-120s
    Attempt 3: 0-240s
    Attempt 4: 0-480s
    Attempt 5: 0-960s   (~16 min max)
    Attempt 6: 0-1920s  (~32 min max)
    Attempt 7: 0-3600s  (capped at 1 hour)
    """
    max_delay = min(base_delay * (2 ** attempt), 3600)  # Cap at 1 hour
    return random.randint(0, max_delay)

# Retry schedule (expected values with jitter):
# Attempt 1:  ~30 seconds later
# Attempt 2:  ~1 minute later
# Attempt 3:  ~2 minutes later
# Attempt 4:  ~4 minutes later
# Attempt 5:  ~8 minutes later
# Attempt 6:  ~16 minutes later
# Attempt 7:  ~30 minutes later
# Attempt 8:  ~1 hour later
# Total span: ~2 hours (but up to 24 hours with variance)

TIP

Full jitter (random between 0 and max) is better than equal jitter (delay/2 + random(delay/2)). It provides better spread and reduces the chance of correlated retries across consumers. See AWS Architecture Blog's analysis for the math.

Retry Response Code Handling

ResponseActionReasoning
2xxSuccess, log, move onConsumer accepted the event
301/302Follow redirect onceConsumer moved; update endpoint if permanent
400Dead letter, do NOT retryPayload is malformed — retrying will not fix it
401/403Dead letter, alert consumerAuth issue — consumer needs to fix credentials
404Dead letter after 3 attemptsEndpoint may not exist yet (race condition)
408RetryRequest timeout, likely transient
410Dead letter, disable endpointConsumer explicitly says "gone"
429Retry with Retry-After headerConsumer is rate limiting you
500/502/503Retry with backoffServer error, likely transient
Connection refusedRetry with backoffConsumer is down
DNS failureRetry with backoffDNS is likely transient
TLS errorRetry 2x, then dead letter + alertCertificate issue — consumer needs to fix
python
class WebhookDeliveryWorker:
    def deliver(self, event, endpoint):
        try:
            response = httpx.post(
                endpoint.url,
                json=event.payload,
                headers=self.build_headers(event, endpoint),
                timeout=httpx.Timeout(
                    connect=5.0,
                    read=25.0,
                    write=5.0,
                    pool=5.0
                ),
            )

            if response.status_code < 300:
                self.mark_success(event, endpoint, response)
                return

            if response.status_code in (400, 401, 403, 410):
                self.dead_letter(event, endpoint, response,
                               reason=f"Client error: {response.status_code}")
                if response.status_code == 410:
                    self.disable_endpoint(endpoint)
                return

            if response.status_code == 429:
                retry_after = int(response.headers.get('Retry-After', 60))
                self.schedule_retry(event, endpoint, delay=retry_after)
                return

            # 5xx — retry with backoff
            self.schedule_retry(event, endpoint)

        except httpx.ConnectError:
            self.schedule_retry(event, endpoint)
        except httpx.TimeoutException:
            self.schedule_retry(event, endpoint)
        except ssl.SSLError:
            if event.attempt < 2:
                self.schedule_retry(event, endpoint)
            else:
                self.dead_letter(event, endpoint, reason="TLS error")
                self.alert_consumer(endpoint, "TLS certificate issue")

Consumer Isolation

The most critical design requirement: one slow or failing consumer must not affect delivery to others.

Per-Consumer Queue Architecture

python
class WebhookDispatcher:
    """Fan out events to per-consumer queues."""

    def dispatch(self, event):
        # Find all subscriptions for this event type
        subscriptions = self.subscription_store.get_active(
            event_type=event.type
        )

        for sub in subscriptions:
            # Each consumer gets its own queue
            queue_name = f"webhook-delivery-{sub.endpoint_id}"

            self.queue.enqueue(
                queue_name,
                {
                    'event_id': event.id,
                    'endpoint_id': sub.endpoint_id,
                    'payload': event.payload,
                    'attempt': 0,
                    'max_attempts': sub.max_retries or 8,
                }
            )

    def get_consumer_health(self, endpoint_id):
        """Track per-consumer delivery health."""
        recent = self.delivery_log.get_recent(endpoint_id, count=100)

        success_rate = sum(1 for d in recent if d.success) / len(recent)
        avg_latency = sum(d.latency_ms for d in recent) / len(recent)

        return {
            'success_rate': success_rate,
            'avg_latency_ms': avg_latency,
            'consecutive_failures': self.count_consecutive_failures(endpoint_id),
        }

Circuit Breaker for Endpoints

When a consumer is persistently failing, stop hammering it:

python
class EndpointCircuitBreaker:
    """Circuit breaker per consumer endpoint."""

    CLOSED = 'closed'       # Normal operation
    OPEN = 'open'           # Endpoint is down, stop sending
    HALF_OPEN = 'half_open' # Testing if endpoint recovered

    def __init__(self, failure_threshold=10, recovery_time=300):
        self.failure_threshold = failure_threshold
        self.recovery_time = recovery_time  # seconds
        self.state = self.CLOSED
        self.failure_count = 0
        self.last_failure_time = None

    def should_attempt(self) -> bool:
        if self.state == self.CLOSED:
            return True

        if self.state == self.OPEN:
            # Check if recovery period has passed
            if time.time() - self.last_failure_time > self.recovery_time:
                self.state = self.HALF_OPEN
                return True  # Allow one test request
            return False

        if self.state == self.HALF_OPEN:
            return True  # Already in test mode

    def record_success(self):
        self.failure_count = 0
        self.state = self.CLOSED

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.failure_count >= self.failure_threshold:
            self.state = self.OPEN

WARNING

When the circuit breaker opens, you must still queue events — do not drop them. Queue events with a future delivery time (after the circuit breaker recovery window). When the breaker transitions to half-open, deliver the oldest queued event as a test. If it succeeds, drain the queue.


Signature Verification

Standard Webhook Signature (Svix-Compatible)

python
import hashlib
import hmac
import time

class WebhookSigner:
    """Generate and verify webhook signatures using the Standard Webhooks spec."""

    def sign(self, payload: str, secret: str) -> dict:
        msg_id = generate_uuid()
        timestamp = str(int(time.time()))

        # Sign: timestamp.payload
        to_sign = f"{msg_id}.{timestamp}.{payload}"
        signature = hmac.new(
            secret.encode(),
            to_sign.encode(),
            hashlib.sha256
        ).hexdigest()

        return {
            'webhook-id': msg_id,
            'webhook-timestamp': timestamp,
            'webhook-signature': f"v1,{signature}",
        }

    def verify(self, payload: str, headers: dict, secret: str,
               tolerance_seconds: int = 300) -> bool:
        msg_id = headers['webhook-id']
        timestamp = headers['webhook-timestamp']
        expected_sig = headers['webhook-signature']

        # Check timestamp tolerance (prevent replay attacks)
        ts = int(timestamp)
        if abs(time.time() - ts) > tolerance_seconds:
            raise ValueError("Timestamp too old or too far in the future")

        # Compute expected signature
        to_sign = f"{msg_id}.{timestamp}.{payload}"
        computed = hmac.new(
            secret.encode(),
            to_sign.encode(),
            hashlib.sha256
        ).hexdigest()

        expected = expected_sig.split(',')[1]  # Remove "v1," prefix
        return hmac.compare_digest(computed, expected)

Key Rotation

python
class SecretRotation:
    """Support two active signing secrets during rotation."""

    def sign_with_current(self, payload, endpoint):
        # Always sign with the current (newest) secret
        current_secret = endpoint.signing_secrets[-1]
        return self.signer.sign(payload, current_secret)

    def verify_with_any(self, payload, headers, endpoint):
        # Accept signatures from any active secret
        for secret in endpoint.signing_secrets:
            try:
                if self.signer.verify(payload, headers, secret):
                    return True
            except Exception:
                continue
        return False

    def rotate_secret(self, endpoint):
        new_secret = generate_secret()
        endpoint.signing_secrets.append(new_secret)

        # Keep only last 2 secrets
        if len(endpoint.signing_secrets) > 2:
            endpoint.signing_secrets = endpoint.signing_secrets[-2:]

        # Notify consumer of new secret via API/dashboard
        self.notify_consumer(endpoint, new_secret)

Event Ordering

Strict ordering is expensive and usually unnecessary. Here is the spectrum:

Ordering LevelImplementationCost
No orderingParallel delivery, fastest winsLowest
Best-effortFIFO queue per consumer, but retries may reorderLow
Per-entityPartition by entity ID, sequential within partitionMedium
Strict globalSingle sequential queue per consumerVery high
python
class OrderedDelivery:
    """Per-entity ordering: events for the same entity are delivered in order."""

    def enqueue(self, event, endpoint):
        entity_key = event.payload.get('entity_id', event.id)

        # Partition key ensures same entity goes to same queue partition
        self.queue.enqueue(
            queue=f"webhook-{endpoint.id}",
            message=event,
            partition_key=entity_key
        )

    def process(self, endpoint_id, partition):
        """Process events for one entity sequentially."""
        while event := self.queue.peek(endpoint_id, partition):
            success = self.deliver(event, endpoint_id)

            if success:
                self.queue.ack(event)
            else:
                # Block this partition until retry succeeds
                # Other partitions (other entities) continue independently
                delay = compute_retry_delay(event.attempt)
                self.queue.requeue(event, delay=delay)
                break  # Stop processing this partition

TIP

Per-entity ordering is the sweet spot: events for order_123 arrive in order (created → paid → shipped), but order_123 and order_456 are independent and can be delivered in parallel. This matches most consumers' expectations without the performance cost of global ordering.


Monitoring and Observability

Delivery Dashboard Metrics

MetricComputationAlert Threshold
Delivery success rateSuccessful / total attempts< 95% (15-min window)
Delivery latency P95Time from event to successful delivery> 5 seconds
Retry rateRetried events / total events> 10%
Dead letter rateDead-lettered / total events> 1%
Queue depthPending events across all consumer queues> 100K
Consumer success ratePer-consumer delivery success< 80% per consumer
Event lagOldest undelivered event age> 1 hour

Consumer Health Notifications

When a consumer's endpoint is failing, proactively notify them:

python
class ConsumerHealthNotifier:
    def check_and_notify(self, endpoint):
        health = self.get_consumer_health(endpoint.id)

        if health['consecutive_failures'] >= 50:
            # Email the consumer's admin
            self.send_notification(
                endpoint.admin_email,
                subject="Webhook endpoint failing",
                body=f"""
Your webhook endpoint {endpoint.url} has failed
{health['consecutive_failures']} consecutive deliveries.

Last error: {health['last_error']}
Last attempt: {health['last_attempt_time']}

Events are being queued and will be retried.
If failures continue for 72 hours, the endpoint
will be automatically disabled.

Check your endpoint health at:
https://dashboard.example.com/webhooks/{endpoint.id}
"""
            )

        if health['consecutive_failures'] >= 500:
            self.disable_endpoint(endpoint)
            self.send_notification(
                endpoint.admin_email,
                subject="Webhook endpoint disabled",
                body="Endpoint disabled after 500 consecutive failures."
            )

Build vs Buy: Webhook Platforms

Comparison: Svix vs Convoy vs Hookdeck

FeatureSvixConvoyHookdeck
TypeWebhook sendingWebhook sending + receivingWebhook receiving / proxy
Open sourceYes (MIT)Yes (MPL-2.0)No (SaaS only)
Self-hostedYesYesNo
Primary useSend webhooks to your customersSend + receive webhooksReceive and manage inbound webhooks
LanguageRustGoN/A (SaaS)
DashboardConsumer-facing portalAdmin portalDeveloper portal
Retry logicExponential backoff, configurableConfigurable retry policiesAutomatic retries
SignaturesStandard Webhooks spec (HMAC)HMAC, API key, basic authVerification built-in
Event typesSchema registryEvent typesTopic-based routing
Rate limitingPer-endpointPer-endpointPer-source
FilteringConsumer-side event type filteringServer-side filteringTransform + filter
IdempotencyEvent ID basedBuilt-in dedupBuilt-in
Logs & replayFull event logs, one-click replayEvent logs, replayFull request logs, replay
SDKsPython, Node, Go, Java, Ruby, KotlinGo, JS, PythonJS, Python
PricingFree tier + usageFree (self-hosted) + cloudFree tier + usage

When to Build vs Buy

TIP

The build-vs-buy threshold is roughly at 100K events/month. Below that, a simple queue worker is sufficient. Above that, the operational complexity of retry management, consumer isolation, monitoring, and the consumer-facing portal justifies using a dedicated platform.


Building Your Own: Minimal Architecture

If you decide to build, here is the minimum viable webhook infrastructure:

python
# Minimal webhook system using PostgreSQL + worker

# Schema
"""
CREATE TABLE webhook_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type TEXT NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE webhook_endpoints (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url TEXT NOT NULL,
    secret TEXT NOT NULL,
    event_types TEXT[] NOT NULL,
    active BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE webhook_deliveries (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_id UUID REFERENCES webhook_events(id),
    endpoint_id UUID REFERENCES webhook_endpoints(id),
    status TEXT DEFAULT 'pending',  -- pending, success, failed, dead_letter
    attempt INT DEFAULT 0,
    next_attempt_at TIMESTAMPTZ,
    last_response_code INT,
    last_error TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    delivered_at TIMESTAMPTZ
);

CREATE INDEX idx_deliveries_pending
    ON webhook_deliveries (next_attempt_at)
    WHERE status = 'pending';
"""

class MinimalWebhookWorker:
    """Poll-based webhook delivery worker."""

    def run(self):
        while True:
            # Fetch next batch of due deliveries
            deliveries = self.db.query("""
                SELECT d.*, e.payload, ep.url, ep.secret
                FROM webhook_deliveries d
                JOIN webhook_events e ON d.event_id = e.id
                JOIN webhook_endpoints ep ON d.endpoint_id = ep.id
                WHERE d.status = 'pending'
                  AND d.next_attempt_at <= NOW()
                ORDER BY d.next_attempt_at
                LIMIT 100
                FOR UPDATE SKIP LOCKED
            """)

            for delivery in deliveries:
                self.process(delivery)

            if not deliveries:
                time.sleep(1)

    def process(self, delivery):
        try:
            headers = self.sign(delivery.payload, delivery.secret)
            resp = httpx.post(
                delivery.url,
                json=delivery.payload,
                headers=headers,
                timeout=30
            )

            if resp.status_code < 300:
                self.mark_success(delivery)
            elif resp.status_code < 500:
                self.mark_dead_letter(delivery, resp.status_code)
            else:
                self.schedule_retry(delivery, resp.status_code)

        except Exception as e:
            self.schedule_retry(delivery, error=str(e))

    def schedule_retry(self, delivery, status_code=None, error=None):
        attempt = delivery.attempt + 1
        if attempt > 8:
            self.mark_dead_letter(delivery, status_code, error)
            return

        delay = compute_retry_delay(attempt)
        self.db.execute("""
            UPDATE webhook_deliveries
            SET attempt = %s,
                next_attempt_at = NOW() + INTERVAL '%s seconds',
                last_response_code = %s,
                last_error = %s
            WHERE id = %s
        """, [attempt, delay, status_code, error, delivery.id])

DANGER

This minimal implementation works for low volume but lacks per-consumer isolation, circuit breaking, and horizontal scaling. If a single consumer is slow, FOR UPDATE SKIP LOCKED helps but does not fully isolate. For production at scale, use per-consumer queues (SQS, Redis Streams) instead of polling a database table.


Key Takeaways

  1. Consumer isolation is the hardest problem — one failing consumer must not block others; use per-consumer queues
  2. Exponential backoff with full jitter — prevents thundering herd and distributes retry load evenly
  3. Circuit breakers per endpoint — stop hammering dead consumers; queue events for later delivery
  4. 4xx vs 5xx handling — do not retry client errors (400, 401, 403); they will never succeed
  5. Signature verification is mandatory — use HMAC-SHA256 with timestamp to prevent replay attacks
  6. Per-entity ordering is the sweet spot — strict global ordering is too expensive; no ordering loses business semantics
  7. Build below 100K events/month, buy above — Svix (cloud) or Convoy (self-hosted) handle the operational complexity you do not want to own
  8. Monitor per-consumer health — proactively notify consumers when their endpoints are failing

"What I cannot create, I do not understand." — Richard Feynman