Webhook Infrastructure at Scale
The Webhook Design Patterns page covers the fundamentals — payload design, signatures, and retry logic. This page goes deeper: building webhook infrastructure that reliably delivers millions of events per day, handles consumer failures gracefully, provides operational visibility, and scales horizontally.
If you are deciding whether to build or buy, we also compare the three leading webhook-as-a-service platforms: Svix, Convoy, and Hookdeck.
The Scale of the Problem
When you send a few hundred webhooks per day, everything works. When you send millions, every edge case becomes a daily occurrence:
| Failure Mode | Frequency at Scale | Impact |
|---|---|---|
| Consumer returns 500 | Hundreds per hour | Retry backlog grows |
| Consumer is down for hours | Weekly | Millions of retries queued |
| Consumer is permanently gone | Monthly | Dead letter queue fills up |
| DNS resolution fails | Daily | Transient delivery failures |
| TLS certificate expired | Monthly | Cannot connect to consumer |
| Consumer is slow (> 30s) | Thousands per day | Ties up worker threads |
| Consumer returns 200 but does not process | Unknown | Silent data loss |
| Network partition | Quarterly | Entire regions lose connectivity |
| Your outage sends burst on recovery | Quarterly | Thundering herd to consumers |
Architecture for Reliable Delivery
Key Design Decisions
| Decision | Recommendation | Why |
|---|---|---|
| Queue technology | Kafka or SQS with DLQ | Durability, at-least-once, built-in retry |
| Worker pool | Dedicated per-consumer pools | One slow consumer cannot block others |
| Timeout | 30 seconds max | Prevent thread exhaustion |
| Max retries | 5-8 attempts over 24-48 hours | Balance reliability vs resource waste |
| Idempotency | Include event ID, advise consumers to deduplicate | At-least-once delivery means duplicates |
| Ordering | Best-effort (not guaranteed) | Strict ordering is extremely expensive |
Retry Strategy
Exponential Backoff with Jitter
import random
import time
def compute_retry_delay(attempt: int, base_delay: int = 30) -> int:
"""
Exponential backoff with full jitter.
Attempt 1: 0-60s (random within 0 to 2*base)
Attempt 2: 0-120s
Attempt 3: 0-240s
Attempt 4: 0-480s
Attempt 5: 0-960s (~16 min max)
Attempt 6: 0-1920s (~32 min max)
Attempt 7: 0-3600s (capped at 1 hour)
"""
max_delay = min(base_delay * (2 ** attempt), 3600) # Cap at 1 hour
return random.randint(0, max_delay)
# Retry schedule (expected values with jitter):
# Attempt 1: ~30 seconds later
# Attempt 2: ~1 minute later
# Attempt 3: ~2 minutes later
# Attempt 4: ~4 minutes later
# Attempt 5: ~8 minutes later
# Attempt 6: ~16 minutes later
# Attempt 7: ~30 minutes later
# Attempt 8: ~1 hour later
# Total span: ~2 hours (but up to 24 hours with variance)TIP
Full jitter (random between 0 and max) is better than equal jitter (delay/2 + random(delay/2)). It provides better spread and reduces the chance of correlated retries across consumers. See AWS Architecture Blog's analysis for the math.
Retry Response Code Handling
| Response | Action | Reasoning |
|---|---|---|
| 2xx | Success, log, move on | Consumer accepted the event |
| 301/302 | Follow redirect once | Consumer moved; update endpoint if permanent |
| 400 | Dead letter, do NOT retry | Payload is malformed — retrying will not fix it |
| 401/403 | Dead letter, alert consumer | Auth issue — consumer needs to fix credentials |
| 404 | Dead letter after 3 attempts | Endpoint may not exist yet (race condition) |
| 408 | Retry | Request timeout, likely transient |
| 410 | Dead letter, disable endpoint | Consumer explicitly says "gone" |
| 429 | Retry with Retry-After header | Consumer is rate limiting you |
| 500/502/503 | Retry with backoff | Server error, likely transient |
| Connection refused | Retry with backoff | Consumer is down |
| DNS failure | Retry with backoff | DNS is likely transient |
| TLS error | Retry 2x, then dead letter + alert | Certificate issue — consumer needs to fix |
class WebhookDeliveryWorker:
def deliver(self, event, endpoint):
try:
response = httpx.post(
endpoint.url,
json=event.payload,
headers=self.build_headers(event, endpoint),
timeout=httpx.Timeout(
connect=5.0,
read=25.0,
write=5.0,
pool=5.0
),
)
if response.status_code < 300:
self.mark_success(event, endpoint, response)
return
if response.status_code in (400, 401, 403, 410):
self.dead_letter(event, endpoint, response,
reason=f"Client error: {response.status_code}")
if response.status_code == 410:
self.disable_endpoint(endpoint)
return
if response.status_code == 429:
retry_after = int(response.headers.get('Retry-After', 60))
self.schedule_retry(event, endpoint, delay=retry_after)
return
# 5xx — retry with backoff
self.schedule_retry(event, endpoint)
except httpx.ConnectError:
self.schedule_retry(event, endpoint)
except httpx.TimeoutException:
self.schedule_retry(event, endpoint)
except ssl.SSLError:
if event.attempt < 2:
self.schedule_retry(event, endpoint)
else:
self.dead_letter(event, endpoint, reason="TLS error")
self.alert_consumer(endpoint, "TLS certificate issue")Consumer Isolation
The most critical design requirement: one slow or failing consumer must not affect delivery to others.
Per-Consumer Queue Architecture
class WebhookDispatcher:
"""Fan out events to per-consumer queues."""
def dispatch(self, event):
# Find all subscriptions for this event type
subscriptions = self.subscription_store.get_active(
event_type=event.type
)
for sub in subscriptions:
# Each consumer gets its own queue
queue_name = f"webhook-delivery-{sub.endpoint_id}"
self.queue.enqueue(
queue_name,
{
'event_id': event.id,
'endpoint_id': sub.endpoint_id,
'payload': event.payload,
'attempt': 0,
'max_attempts': sub.max_retries or 8,
}
)
def get_consumer_health(self, endpoint_id):
"""Track per-consumer delivery health."""
recent = self.delivery_log.get_recent(endpoint_id, count=100)
success_rate = sum(1 for d in recent if d.success) / len(recent)
avg_latency = sum(d.latency_ms for d in recent) / len(recent)
return {
'success_rate': success_rate,
'avg_latency_ms': avg_latency,
'consecutive_failures': self.count_consecutive_failures(endpoint_id),
}Circuit Breaker for Endpoints
When a consumer is persistently failing, stop hammering it:
class EndpointCircuitBreaker:
"""Circuit breaker per consumer endpoint."""
CLOSED = 'closed' # Normal operation
OPEN = 'open' # Endpoint is down, stop sending
HALF_OPEN = 'half_open' # Testing if endpoint recovered
def __init__(self, failure_threshold=10, recovery_time=300):
self.failure_threshold = failure_threshold
self.recovery_time = recovery_time # seconds
self.state = self.CLOSED
self.failure_count = 0
self.last_failure_time = None
def should_attempt(self) -> bool:
if self.state == self.CLOSED:
return True
if self.state == self.OPEN:
# Check if recovery period has passed
if time.time() - self.last_failure_time > self.recovery_time:
self.state = self.HALF_OPEN
return True # Allow one test request
return False
if self.state == self.HALF_OPEN:
return True # Already in test mode
def record_success(self):
self.failure_count = 0
self.state = self.CLOSED
def record_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = self.OPENWARNING
When the circuit breaker opens, you must still queue events — do not drop them. Queue events with a future delivery time (after the circuit breaker recovery window). When the breaker transitions to half-open, deliver the oldest queued event as a test. If it succeeds, drain the queue.
Signature Verification
Standard Webhook Signature (Svix-Compatible)
import hashlib
import hmac
import time
class WebhookSigner:
"""Generate and verify webhook signatures using the Standard Webhooks spec."""
def sign(self, payload: str, secret: str) -> dict:
msg_id = generate_uuid()
timestamp = str(int(time.time()))
# Sign: timestamp.payload
to_sign = f"{msg_id}.{timestamp}.{payload}"
signature = hmac.new(
secret.encode(),
to_sign.encode(),
hashlib.sha256
).hexdigest()
return {
'webhook-id': msg_id,
'webhook-timestamp': timestamp,
'webhook-signature': f"v1,{signature}",
}
def verify(self, payload: str, headers: dict, secret: str,
tolerance_seconds: int = 300) -> bool:
msg_id = headers['webhook-id']
timestamp = headers['webhook-timestamp']
expected_sig = headers['webhook-signature']
# Check timestamp tolerance (prevent replay attacks)
ts = int(timestamp)
if abs(time.time() - ts) > tolerance_seconds:
raise ValueError("Timestamp too old or too far in the future")
# Compute expected signature
to_sign = f"{msg_id}.{timestamp}.{payload}"
computed = hmac.new(
secret.encode(),
to_sign.encode(),
hashlib.sha256
).hexdigest()
expected = expected_sig.split(',')[1] # Remove "v1," prefix
return hmac.compare_digest(computed, expected)Key Rotation
class SecretRotation:
"""Support two active signing secrets during rotation."""
def sign_with_current(self, payload, endpoint):
# Always sign with the current (newest) secret
current_secret = endpoint.signing_secrets[-1]
return self.signer.sign(payload, current_secret)
def verify_with_any(self, payload, headers, endpoint):
# Accept signatures from any active secret
for secret in endpoint.signing_secrets:
try:
if self.signer.verify(payload, headers, secret):
return True
except Exception:
continue
return False
def rotate_secret(self, endpoint):
new_secret = generate_secret()
endpoint.signing_secrets.append(new_secret)
# Keep only last 2 secrets
if len(endpoint.signing_secrets) > 2:
endpoint.signing_secrets = endpoint.signing_secrets[-2:]
# Notify consumer of new secret via API/dashboard
self.notify_consumer(endpoint, new_secret)Event Ordering
Strict ordering is expensive and usually unnecessary. Here is the spectrum:
| Ordering Level | Implementation | Cost |
|---|---|---|
| No ordering | Parallel delivery, fastest wins | Lowest |
| Best-effort | FIFO queue per consumer, but retries may reorder | Low |
| Per-entity | Partition by entity ID, sequential within partition | Medium |
| Strict global | Single sequential queue per consumer | Very high |
class OrderedDelivery:
"""Per-entity ordering: events for the same entity are delivered in order."""
def enqueue(self, event, endpoint):
entity_key = event.payload.get('entity_id', event.id)
# Partition key ensures same entity goes to same queue partition
self.queue.enqueue(
queue=f"webhook-{endpoint.id}",
message=event,
partition_key=entity_key
)
def process(self, endpoint_id, partition):
"""Process events for one entity sequentially."""
while event := self.queue.peek(endpoint_id, partition):
success = self.deliver(event, endpoint_id)
if success:
self.queue.ack(event)
else:
# Block this partition until retry succeeds
# Other partitions (other entities) continue independently
delay = compute_retry_delay(event.attempt)
self.queue.requeue(event, delay=delay)
break # Stop processing this partitionTIP
Per-entity ordering is the sweet spot: events for order_123 arrive in order (created → paid → shipped), but order_123 and order_456 are independent and can be delivered in parallel. This matches most consumers' expectations without the performance cost of global ordering.
Monitoring and Observability
Delivery Dashboard Metrics
| Metric | Computation | Alert Threshold |
|---|---|---|
| Delivery success rate | Successful / total attempts | < 95% (15-min window) |
| Delivery latency P95 | Time from event to successful delivery | > 5 seconds |
| Retry rate | Retried events / total events | > 10% |
| Dead letter rate | Dead-lettered / total events | > 1% |
| Queue depth | Pending events across all consumer queues | > 100K |
| Consumer success rate | Per-consumer delivery success | < 80% per consumer |
| Event lag | Oldest undelivered event age | > 1 hour |
Consumer Health Notifications
When a consumer's endpoint is failing, proactively notify them:
class ConsumerHealthNotifier:
def check_and_notify(self, endpoint):
health = self.get_consumer_health(endpoint.id)
if health['consecutive_failures'] >= 50:
# Email the consumer's admin
self.send_notification(
endpoint.admin_email,
subject="Webhook endpoint failing",
body=f"""
Your webhook endpoint {endpoint.url} has failed
{health['consecutive_failures']} consecutive deliveries.
Last error: {health['last_error']}
Last attempt: {health['last_attempt_time']}
Events are being queued and will be retried.
If failures continue for 72 hours, the endpoint
will be automatically disabled.
Check your endpoint health at:
https://dashboard.example.com/webhooks/{endpoint.id}
"""
)
if health['consecutive_failures'] >= 500:
self.disable_endpoint(endpoint)
self.send_notification(
endpoint.admin_email,
subject="Webhook endpoint disabled",
body="Endpoint disabled after 500 consecutive failures."
)Build vs Buy: Webhook Platforms
Comparison: Svix vs Convoy vs Hookdeck
| Feature | Svix | Convoy | Hookdeck |
|---|---|---|---|
| Type | Webhook sending | Webhook sending + receiving | Webhook receiving / proxy |
| Open source | Yes (MIT) | Yes (MPL-2.0) | No (SaaS only) |
| Self-hosted | Yes | Yes | No |
| Primary use | Send webhooks to your customers | Send + receive webhooks | Receive and manage inbound webhooks |
| Language | Rust | Go | N/A (SaaS) |
| Dashboard | Consumer-facing portal | Admin portal | Developer portal |
| Retry logic | Exponential backoff, configurable | Configurable retry policies | Automatic retries |
| Signatures | Standard Webhooks spec (HMAC) | HMAC, API key, basic auth | Verification built-in |
| Event types | Schema registry | Event types | Topic-based routing |
| Rate limiting | Per-endpoint | Per-endpoint | Per-source |
| Filtering | Consumer-side event type filtering | Server-side filtering | Transform + filter |
| Idempotency | Event ID based | Built-in dedup | Built-in |
| Logs & replay | Full event logs, one-click replay | Event logs, replay | Full request logs, replay |
| SDKs | Python, Node, Go, Java, Ruby, Kotlin | Go, JS, Python | JS, Python |
| Pricing | Free tier + usage | Free (self-hosted) + cloud | Free tier + usage |
When to Build vs Buy
TIP
The build-vs-buy threshold is roughly at 100K events/month. Below that, a simple queue worker is sufficient. Above that, the operational complexity of retry management, consumer isolation, monitoring, and the consumer-facing portal justifies using a dedicated platform.
Building Your Own: Minimal Architecture
If you decide to build, here is the minimum viable webhook infrastructure:
# Minimal webhook system using PostgreSQL + worker
# Schema
"""
CREATE TABLE webhook_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE webhook_endpoints (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
url TEXT NOT NULL,
secret TEXT NOT NULL,
event_types TEXT[] NOT NULL,
active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE webhook_deliveries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_id UUID REFERENCES webhook_events(id),
endpoint_id UUID REFERENCES webhook_endpoints(id),
status TEXT DEFAULT 'pending', -- pending, success, failed, dead_letter
attempt INT DEFAULT 0,
next_attempt_at TIMESTAMPTZ,
last_response_code INT,
last_error TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
delivered_at TIMESTAMPTZ
);
CREATE INDEX idx_deliveries_pending
ON webhook_deliveries (next_attempt_at)
WHERE status = 'pending';
"""
class MinimalWebhookWorker:
"""Poll-based webhook delivery worker."""
def run(self):
while True:
# Fetch next batch of due deliveries
deliveries = self.db.query("""
SELECT d.*, e.payload, ep.url, ep.secret
FROM webhook_deliveries d
JOIN webhook_events e ON d.event_id = e.id
JOIN webhook_endpoints ep ON d.endpoint_id = ep.id
WHERE d.status = 'pending'
AND d.next_attempt_at <= NOW()
ORDER BY d.next_attempt_at
LIMIT 100
FOR UPDATE SKIP LOCKED
""")
for delivery in deliveries:
self.process(delivery)
if not deliveries:
time.sleep(1)
def process(self, delivery):
try:
headers = self.sign(delivery.payload, delivery.secret)
resp = httpx.post(
delivery.url,
json=delivery.payload,
headers=headers,
timeout=30
)
if resp.status_code < 300:
self.mark_success(delivery)
elif resp.status_code < 500:
self.mark_dead_letter(delivery, resp.status_code)
else:
self.schedule_retry(delivery, resp.status_code)
except Exception as e:
self.schedule_retry(delivery, error=str(e))
def schedule_retry(self, delivery, status_code=None, error=None):
attempt = delivery.attempt + 1
if attempt > 8:
self.mark_dead_letter(delivery, status_code, error)
return
delay = compute_retry_delay(attempt)
self.db.execute("""
UPDATE webhook_deliveries
SET attempt = %s,
next_attempt_at = NOW() + INTERVAL '%s seconds',
last_response_code = %s,
last_error = %s
WHERE id = %s
""", [attempt, delay, status_code, error, delivery.id])DANGER
This minimal implementation works for low volume but lacks per-consumer isolation, circuit breaking, and horizontal scaling. If a single consumer is slow, FOR UPDATE SKIP LOCKED helps but does not fully isolate. For production at scale, use per-consumer queues (SQS, Redis Streams) instead of polling a database table.
Key Takeaways
- Consumer isolation is the hardest problem — one failing consumer must not block others; use per-consumer queues
- Exponential backoff with full jitter — prevents thundering herd and distributes retry load evenly
- Circuit breakers per endpoint — stop hammering dead consumers; queue events for later delivery
- 4xx vs 5xx handling — do not retry client errors (400, 401, 403); they will never succeed
- Signature verification is mandatory — use HMAC-SHA256 with timestamp to prevent replay attacks
- Per-entity ordering is the sweet spot — strict global ordering is too expensive; no ordering loses business semantics
- Build below 100K events/month, buy above — Svix (cloud) or Convoy (self-hosted) handle the operational complexity you do not want to own
- Monitor per-consumer health — proactively notify consumers when their endpoints are failing