Skip to content

Observability

Observability is the ability to understand the internal state of a system by examining its external outputs. It is not the same as monitoring. Monitoring tells you when something is wrong (CPU is at 95%). Observability tells you why something is wrong (a new deployment introduced an N+1 query that is hammering the database connection pool).

In a monolith, you can attach a debugger. In a distributed system with 50 services processing a request, you cannot. The request enters through an API gateway, hits three services, two caches, a message queue, and two databases. When it fails, the error manifests in a completely different service than where the root cause lives. Observability gives you the tools to trace that causal chain backward from symptom to cause, in production, under load, in real time.

The Three Pillars

Logs

Logs are discrete, timestamped records of events that happened in your system. They are the most fundamental observability signal — humans have been reading logs since the first computer.

Structured logging is non-negotiable in production. Unstructured logs ("Error processing request for user 123") are nearly impossible to query at scale. Structured logs are machine-parseable:

json
{
  "timestamp": "2026-03-20T14:23:45.123Z",
  "level": "error",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "user_id": "user-42",
  "order_id": "ord-9001",
  "message": "Payment processing failed",
  "error": "connection refused",
  "downstream_service": "payment-service",
  "latency_ms": 5023,
  "retry_count": 3
}
typescript
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  serializers: {
    err: pino.stdSerializers.err,
  },
});

// Structured logging with context
function processOrder(orderId: string, userId: string) {
  const log = logger.child({ orderId, userId, operation: 'processOrder' });

  log.info('Starting order processing');

  try {
    const result = chargePayment(orderId);
    log.info({ paymentId: result.id, amount: result.amount },
             'Payment successful');
  } catch (error) {
    log.error({ err: error, retryable: isRetryable(error) },
              'Payment processing failed');
    throw error;
  }
}

Log Levels Strategy

LevelWhen to UseAlert?
FATALProcess is about to crashPage immediately
ERROROperation failed, user impactAlert
WARNSomething unexpected but handledMonitor trend
INFOSignificant business events (order placed, user signed up)No
DEBUGDiagnostic information for debuggingNo (disable in prod)
TRACEExtremely verbose, per-request detailNo (never in prod)

Metrics

Metrics are numeric measurements of system behavior over time. Unlike logs (which record individual events), metrics aggregate data into time series that you can query, alert on, and graph.

The four golden signals (from Google's SRE book):

Metric types in Prometheus:

python
from prometheus_client import Counter, Histogram, Gauge, Summary

# Counter: monotonically increasing (requests, errors, bytes)
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Histogram: distribution of values (latency, response size)
http_request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# Gauge: can go up and down (temperature, queue size, connections)
active_connections = Gauge(
    'active_connections',
    'Number of active connections',
    ['service']
)

# Usage in request handler
@app.route('/api/orders', methods=['POST'])
def create_order():
    http_requests_total.labels(
        method='POST', endpoint='/api/orders', status='200').inc()

    with http_request_duration.labels(
            method='POST', endpoint='/api/orders').time():
        result = process_order()

    return jsonify(result)

Histogram quantile calculation:

Prometheus histograms use buckets. The p99 latency is estimated as:

ϕ-quantileblower+ϕNcountlowercountuppercountlower×(bupperblower)
promql
# P99 latency over 5 minutes
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Request rate by endpoint
sum by (endpoint) (rate(http_requests_total[5m]))

Traces

Distributed traces track a single request as it flows through multiple services. A trace is a tree of spans, where each span represents a unit of work:

Trace ID: abc123def456

├── [120ms] API Gateway (span-1)
│   └── [95ms] Order Service (span-2, parent: span-1)
│       ├── [45ms] Payment Service (span-3, parent: span-2)
│       │   └── [38ms] Stripe API (span-5, parent: span-3)
│       └── [12ms] Inventory Service (span-4, parent: span-2)

Each span carries:

  • Trace ID — Unique identifier for the entire request
  • Span ID — Unique identifier for this span
  • Parent Span ID — Links to the parent span
  • Start time and duration
  • Attributes — Key-value pairs (e.g., http.method=POST, db.statement=SELECT...)
  • Events — Timestamped annotations within a span
  • Status — OK, ERROR, UNSET

OpenTelemetry

OpenTelemetry (OTel) is the industry standard for instrumentation. It provides vendor-neutral APIs and SDKs for generating logs, metrics, and traces.

Architecture

Instrumentation

python
# OpenTelemetry setup for a Python FastAPI application
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter,
)
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import (
    OTLPMetricExporter,
)
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Configure trace provider
trace_provider = TracerProvider()
trace_provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
trace.set_tracer_provider(trace_provider)

# Configure metrics provider
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="otel-collector:4317"),
    export_interval_millis=10000,
)
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))

# Auto-instrument frameworks
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument(engine=db_engine)

# Custom instrumentation
tracer = trace.get_tracer("order-service")
meter = metrics.get_meter("order-service")

order_counter = meter.create_counter(
    "orders_created_total",
    description="Total orders created",
)

@app.post("/api/orders")
async def create_order(order: OrderRequest):
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("order.items_count", len(order.items))
        span.set_attribute("order.total_amount", order.total)

        # Payment processing
        with tracer.start_as_current_span("process_payment"):
            payment = await payment_service.charge(order.total)
            span.add_event("payment_completed", {
                "payment_id": payment.id
            })

        order_counter.add(1, {"payment_method": order.payment_method})
        return {"order_id": order.id}

OpenTelemetry Collector Configuration

yaml
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Add resource attributes
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

  # Tail-based sampling (keep errors + 10% of success)
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 1000
      - name: percentage
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, tail_sampling]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]

The Observability Stack

Prometheus + Grafana (Metrics)

yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Jaeger (Traces)

Jaeger is the most widely used open-source distributed tracing backend. It stores traces and provides a UI for querying and visualizing them.

yaml
# Jaeger deployment in Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:1.54
          env:
            - name: SPAN_STORAGE_TYPE
              value: elasticsearch
            - name: ES_SERVER_URLS
              value: http://elasticsearch:9200
          ports:
            - containerPort: 16686  # UI
            - containerPort: 14250  # gRPC (collector)
            - containerPort: 14268  # HTTP (collector)

ELK Stack (Logs)

Elasticsearch + Logstash (or Fluentd/Fluent Bit) + Kibana for log aggregation:

yaml
# Fluent Bit DaemonSet (collects logs from every node)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
spec:
  template:
    spec:
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:2.2
          volumeMounts:
            - name: varlog
              mountPath: /var/log
            - name: config
              mountPath: /fluent-bit/etc/
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: config
          configMap:
            name: fluent-bit-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        5
        Log_Level    info
        Parsers_File parsers.conf

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        Mem_Buf_Limit     5MB

    [FILTER]
        Name              kubernetes
        Match             kube.*
        Merge_Log         On
        K8S-Logging.Parser On

    [OUTPUT]
        Name              es
        Match             *
        Host              elasticsearch
        Port              9200
        Logstash_Format   On
        Logstash_Prefix   k8s-logs
        Retry_Limit       5

SLIs, SLOs, and SLAs

Definitions

ConceptAudienceConsequence of Violation
SLIEngineering teamNone (it is a measurement)
SLOEngineering + productPrioritize reliability work
SLACustomersFinancial penalties, credits

Choosing SLIs

Good SLIs measure the user experience, not internal metrics:

Bad SLIGood SLIWhy
CPU utilizationRequest latency (p99)Users do not care about CPU
Free disk spaceError rate (5xx / total)Users care about failures
Pod countAvailability (successful requests / total)Users care about uptime
promql
# Availability SLI: fraction of successful requests
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))

# Latency SLI: fraction of requests under 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[1h]))
/
sum(rate(http_request_duration_seconds_count[1h]))

Error Budgets

An error budget is the inverse of your SLO. If your SLO is 99.9% availability, your error budget is 0.1% — roughly 43 minutes of downtime per month.

Error Budget=1SLO TargetMonthly Budget (minutes)=30×24×60×(1SLO)
SLOError Budget / MonthError Budget / Quarter
99%7.2 hours21.6 hours
99.9%43.2 minutes2.16 hours
99.95%21.6 minutes1.08 hours
99.99%4.32 minutes12.96 minutes

Error Budgets Drive Decisions

When the error budget is healthy, ship features fast. When the error budget is nearly exhausted, freeze deployments and focus on reliability. This creates a natural feedback loop between velocity and reliability without subjective debates.

python
class ErrorBudget:
    def __init__(self, slo_target: float, window_days: int = 30):
        self.slo_target = slo_target
        self.window_seconds = window_days * 24 * 3600
        self.budget = 1 - slo_target  # e.g., 0.001 for 99.9%

    def remaining(self, total_requests: int,
                  failed_requests: int) -> float:
        """Returns remaining error budget as a fraction (0 to 1)."""
        allowed_failures = total_requests * self.budget
        consumed = failed_requests / allowed_failures if allowed_failures > 0 else 1.0
        return max(0, 1 - consumed)

    def should_freeze_deployments(self, total: int, failed: int) -> bool:
        return self.remaining(total, failed) < 0.1  # < 10% remaining

Alerting Strategies

Alert on Symptoms, Not Causes

Bad Alert (cause)Good Alert (symptom)
CPU > 90%P99 latency > 500ms
Disk > 80%Error rate > 1% for 5 minutes
Pod restartedAvailability below SLO burn rate

Multi-Window Burn Rate Alerting

Instead of alerting on instantaneous SLO violations, alert when the error budget is being consumed faster than expected:

Burn Rate=Actual Error RateAllowed Error Rate

A burn rate of 1 means you will exactly exhaust the budget by the end of the window. A burn rate of 10 means you will exhaust the budget in 1/10 of the window.

yaml
# Prometheus alerting rules for multi-window burn rate
groups:
  - name: slo-alerts
    rules:
      # Fast burn: 14.4x burn rate over 1 hour (pages immediately)
      - alert: HighErrorBurnRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error burn rate - will exhaust monthly budget in 2 days"

      # Slow burn: 3x burn rate over 3 days (tickets)
      - alert: SlowErrorBurnRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[3d]))
            / sum(rate(http_requests_total[3d]))
          ) > (3 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            / sum(rate(http_requests_total[6h]))
          ) > (3 * 0.001)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow error burn rate - will exhaust monthly budget in 10 days"

Runbook Automation

Every alert should link to a runbook. Runbooks should be automatable:

yaml
# Example runbook structure
alert: HighDatabaseLatency
runbook: |
  ## Diagnosis
  1. Check connection pool saturation:
     `kubectl exec -it $POD -- curl localhost:9090/metrics | grep db_pool`
  2. Check slow queries:
     `SELECT * FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;`
  3. Check for lock contention:
     `SELECT * FROM pg_locks WHERE NOT granted;`

  ## Mitigation
  1. If connection pool saturated → Scale up pool size or add read replicas
  2. If slow query detected → Kill query: `SELECT pg_cancel_backend($PID);`
  3. If lock contention → Identify blocking transaction and consider termination

  ## Escalation
  - If not resolved in 15 minutes → Page database team

Correlating Signals

The real power of observability comes from correlating logs, metrics, and traces. When an alert fires:

  1. Metrics tell you what changed (latency spike at 14:23)
  2. Traces tell you where the slowness is (payment service, span 3)
  3. Logs tell you why (connection pool exhausted, 0 available connections)

Use the trace ID as the correlation key. Inject it into every log line, every metric label, and propagate it through every service call.

Further Reading

  • Rate Limiting — How to monitor and alert on rate limit metrics
  • Circuit Breaker Pattern — Monitoring circuit breaker state changes
  • Nginx Deep Dive — Nginx metrics and access log analysis
  • Service Mesh — Automatic observability from the mesh layer
  • Prometheus Deep Dive — Advanced PromQL and Prometheus architecture
  • Grafana Dashboards — Building effective dashboards
  • Structured Logging — Production logging patterns
  • Google SRE book, Chapters 4-6 (Service Level Objectives, Monitoring Distributed Systems)
  • Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022)
  • OpenTelemetry documentation: opentelemetry.io/docs/

"What I cannot create, I do not understand." — Richard Feynman