Skip to content
Unverified — AI-generated content. Help verify this page

Service Degradation Runbook

Overview

Service degradation means the service is running but performing poorly — elevated error rates, increased latency, partial failures, or reduced throughput. Unlike a complete outage, degradation is insidious: it may affect only some users, some endpoints, or some regions, making it harder to detect and diagnose.

This runbook provides a systematic approach to identifying the degraded component, determining the root cause, and applying the appropriate mitigation strategy.

Related: Database Failover Runbook | DDoS Response Runbook | Performance Review Checklist | Incident Response


Impact Assessment

Before diving into diagnosis, assess the blast radius:

QuestionHow to CheckAction Based on Answer
What percentage of users are affected?Error rate on dashboard> 50% = SEV1, 10-50% = SEV2, < 10% = SEV3
Which endpoints are degraded?Per-endpoint latency/error dashboardsFocus diagnosis on affected endpoints
Is this getting worse or stable?15-minute trend on error rate graphGetting worse = escalate immediately
When did it start?Dashboard time range, deployment markersCorrelate with deploys, traffic changes, dependency issues

Systematic Diagnosis

Follow this decision tree to identify the root cause. Each step takes 1-2 minutes.

Step 1: Was There a Recent Deployment?

bash
# Check deployment history
kubectl rollout history deployment/my-service -n production

# Check when the last deployment happened
kubectl get deployment my-service -n production -o jsonpath='{.metadata.annotations.kubectl\.kubernetes\.io/last-applied-configuration}' | jq '.metadata.labels'

# Check deploy annotations on Grafana (visual correlation)
# Dashboard: https://grafana.example.com/d/my-service/overview
FindingAction
Deployment in last 2 hours and metrics degraded at deploy timeGo to Rollback Procedure
No recent deploymentContinue to Step 2
Deployment, but degradation started before itContinue to Step 2

Step 2: Is a Dependency Down or Degraded?

bash
# Check circuit breaker states
kubectl exec -it $(kubectl get pod -l app=my-service -n production -o jsonpath='{.items[0].metadata.name}') \
  -n production -- curl -s http://localhost:8080/actuator/health | jq '.components.circuitBreakers'

# Check dependency dashboards
# PostgreSQL: https://grafana.example.com/d/postgresql/overview
# Redis: https://grafana.example.com/d/redis/overview
# External APIs: https://grafana.example.com/d/external-deps/overview

# Quick dependency connectivity test
kubectl exec -it $(kubectl get pod -l app=my-service -n production -o jsonpath='{.items[0].metadata.name}') \
  -n production -- bash -c "
    echo '=== PostgreSQL ===' && pg_isready -h db-primary.example.com -p 5432
    echo '=== Redis ===' && redis-cli -h redis.example.com ping
    echo '=== External API ===' && curl -s -o /dev/null -w '%{http_code} %{time_total}s' https://api.external.com/health
  "
FindingAction
Database is downGo to Database Failover Runbook
External dependency is slow/downGo to Activate Circuit Breaker
All dependencies healthyContinue to Step 3

Step 3: Is Traffic Unusually High?

bash
# Compare current RPS to baseline
# Check Prometheus/Grafana for traffic patterns
# PromQL: sum(rate(http_requests_total{service="my-service"}[5m]))

# Check if specific endpoints are spiking
kubectl exec -it $(kubectl get pod -l app=my-service -n production -o jsonpath='{.items[0].metadata.name}') \
  -n production -- curl -s http://localhost:8080/metrics | grep http_requests_total | sort -t'"' -k4 -rn | head -20
FindingAction
Traffic is 3x+ normal (potential DDoS)Go to DDoS Response Runbook
Traffic is 1.5-3x normal (organic spike)Go to Scale Up
Traffic is normalContinue to Step 4

Step 4: Are Resources Exhausted?

bash
# Check CPU and memory
kubectl top pods -n production -l app=my-service

# Check for OOM kills
kubectl get events -n production --field-selector reason=OOMKilling --sort-by='.lastTimestamp' | tail -10

# Check for CPU throttling
kubectl exec -it $(kubectl get pod -l app=my-service -n production -o jsonpath='{.items[0].metadata.name}') \
  -n production -- cat /sys/fs/cgroup/cpu/cpu.stat

# Check connection pools
kubectl exec -it $(kubectl get pod -l app=my-service -n production -o jsonpath='{.items[0].metadata.name}') \
  -n production -- curl -s http://localhost:8080/actuator/metrics/hikaricp.connections | jq '.'

# Check disk usage
kubectl exec -it $(kubectl get pod -l app=my-service -n production -o jsonpath='{.items[0].metadata.name}') \
  -n production -- df -h
FindingAction
CPU near 100%Go to Scale Up or optimize hot path
Memory near limit / OOM killsIncrease memory limits or investigate memory leak
Connection pool exhaustedGo to Shed Load and tune pool
Disk fullEmergency cleanup, expand volume
Resources look normalContinue to Step 5

Step 5: Check Application Logs

bash
# Recent error logs
kubectl logs -l app=my-service -n production --since=15m --tail=200 | grep -i "error\|exception\|fatal\|panic" | sort | uniq -c | sort -rn | head -20

# Look for specific patterns
kubectl logs -l app=my-service -n production --since=15m | grep -i "timeout\|connection refused\|circuit.open\|rate.limit" | tail -20

# Check for unusual patterns
kubectl logs -l app=my-service -n production --since=15m | grep -i "slow query\|deadlock\|lock timeout" | tail -20

Mitigation Strategies

Mitigation 1: Circuit Breaker

When to use: A downstream dependency is slow or failing, causing cascading latency and errors in your service.

bash
# If using a circuit breaker library with runtime configuration:

# Option 1: Force open the circuit breaker (stop calling the dependency)
kubectl exec -it $(kubectl get pod -l app=my-service -n production -o jsonpath='{.items[0].metadata.name}') \
  -n production -- curl -X POST http://localhost:8080/admin/circuit-breaker/payment-service/force-open

# Option 2: Update circuit breaker config via ConfigMap
kubectl patch configmap my-service-config -n production --type merge -p '{
  "data": {
    "CIRCUIT_BREAKER_PAYMENT_ENABLED": "true",
    "CIRCUIT_BREAKER_PAYMENT_FAILURE_THRESHOLD": "3",
    "CIRCUIT_BREAKER_PAYMENT_TIMEOUT_MS": "5000"
  }
}'
kubectl rollout restart deployment/my-service -n production

Fallback behaviors to implement:

DependencyFallbackUser Experience
Payment serviceQueue for retry, return "processing""Your order is being processed"
Recommendation engineReturn popular itemsLess personalized, but functional
Cache (Redis)Bypass cache, hit DB directlySlower, but correct
Search serviceReturn "Search temporarily unavailable"Degraded, but honest
Analytics/loggingDrop analytics events silentlyNo user impact

Circuit Breaker vs Retry

Retries make sense for transient failures (network blip, momentary overload). Circuit breakers make sense for sustained failures (dependency is down, overloaded, or misconfigured). If you are retrying and the dependency is consistently failing, you are making the problem worse by adding 3x the load. Open the circuit breaker instead.


Mitigation 2: Load Shedding

When to use: The service is overwhelmed by traffic it cannot handle. Rather than everything being slow, deliberately reject some traffic so the rest can be served well.

bash
# Option 1: Reduce connection limits at the ingress/load balancer level
kubectl annotate ingress my-service-ingress -n production \
  nginx.ingress.kubernetes.io/limit-rps="50" \
  nginx.ingress.kubernetes.io/limit-burst-multiplier="3" \
  --overwrite

# Option 2: Enable application-level rate limiting
kubectl patch configmap my-service-config -n production --type merge -p '{
  "data": {
    "RATE_LIMIT_ENABLED": "true",
    "RATE_LIMIT_RPS": "100",
    "RATE_LIMIT_BURST": "200"
  }
}'
kubectl rollout restart deployment/my-service -n production

Load shedding priority matrix:

PriorityTraffic TypeAction During Overload
CriticalHealth checks, authenticationAlways serve
HighPaid user API callsServe with best effort
MediumFree tier API callsRate limit aggressively
LowAnalytics, webhooksDrop entirely
BackgroundBatch jobs, reportsPause completely
python
# Example: Priority-based load shedding middleware
def load_shedding_middleware(request, call_next):
    current_load = get_current_load_percentage()

    if current_load > 95:
        # Shed everything except critical
        if request.priority not in ['critical']:
            return Response(status_code=503,
                          headers={'Retry-After': '30'})

    elif current_load > 80:
        # Shed low and background priority
        if request.priority in ['low', 'background']:
            return Response(status_code=503,
                          headers={'Retry-After': '15'})

    elif current_load > 70:
        # Shed only background
        if request.priority == 'background':
            return Response(status_code=503,
                          headers={'Retry-After': '10'})

    return call_next(request)

Mitigation 3: Fallback / Graceful Degradation

When to use: A non-critical feature is causing the entire service to degrade. Disable the feature to restore core functionality.

bash
# Disable specific features via feature flags
kubectl patch configmap my-service-config -n production --type merge -p '{
  "data": {
    "FEATURE_RECOMMENDATIONS": "false",
    "FEATURE_REAL_TIME_ANALYTICS": "false",
    "FEATURE_FULL_TEXT_SEARCH": "false"
  }
}'
kubectl rollout restart deployment/my-service -n production

# Or if using a feature flag service (LaunchDarkly, Unleash):
curl -X PATCH https://feature-flags.example.com/api/flags/recommendations \
  -H "Authorization: Bearer $FF_API_KEY" \
  -d '{"enabled": false}'

Feature Degradation Matrix

FeatureImpact of DisablingRiskDecision
RecommendationsLess personalized resultsLowDisable during degradation
Real-time notificationsUsers get notifications lateLowDisable during degradation
Search autocompleteUsers type full queriesLowDisable during degradation
Image processingImages show as original sizeMediumDisable during degradation
Payment processingCannot complete purchasesCriticalNever disable — fix root cause
AuthenticationUsers cannot log inCriticalNever disable — fix root cause

Mitigation 4: Scale Up

When to use: Traffic has increased beyond current capacity, but the service is healthy — it just needs more instances.

bash
# Check current replica count and HPA status
kubectl get hpa my-service-hpa -n production
kubectl get deployment my-service -n production

# Manual scale up
kubectl scale deployment/my-service -n production --replicas=20

# Or update HPA max if autoscaler is capped
kubectl patch hpa my-service-hpa -n production --type merge -p '{
  "spec": {
    "maxReplicas": 50
  }
}'

# Verify new pods are starting
kubectl get pods -n production -l app=my-service -w

# Verify new pods are passing health checks
kubectl get endpoints my-service -n production

Scaling Limitations

Scaling up only works if the bottleneck is in the application tier. If the bottleneck is in the database (connection pool, query performance, lock contention), adding more application pods will make it worse by increasing database load. Check database metrics before scaling.


Mitigation 5: Rollback Deployment

When to use: Degradation started immediately after or shortly after a deployment.

bash
# Check deployment history
kubectl rollout history deployment/my-service -n production

# Rollback to previous version
kubectl rollout undo deployment/my-service -n production

# Or rollback to a specific revision
kubectl rollout undo deployment/my-service -n production --to-revision=42

# Monitor the rollback
kubectl rollout status deployment/my-service -n production

# Verify error rates are recovering
# Check dashboard: https://grafana.example.com/d/my-service/overview

Before Rolling Back

  1. Verify the rollback version is safe: kubectl rollout history deployment/my-service -n production --revision=42
  2. Check for database migrations: If the new version included a migration, rolling back the code without rolling back the database may cause errors. Check with the team before rolling back.
  3. Document the rollback: Note the time, the version rolled back from/to, and the reason.

Verification

After applying any mitigation:

bash
# 1. Check error rate is decreasing
#    PromQL: sum(rate(http_requests_total{service="my-service",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="my-service"}[5m]))

# 2. Check latency is recovering
#    PromQL: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="my-service"}[5m])) by (le))

# 3. Check all pods are healthy
kubectl get pods -n production -l app=my-service

# 4. Run a synthetic health check
curl -s -o /dev/null -w "HTTP %{http_code} in %{time_total}s\n" https://api.example.com/v1/healthz

# 5. Check no new errors in logs
kubectl logs -l app=my-service -n production --since=5m | grep -ci error

Escalation

ConditionEscalation TargetChannel
Degradation persists > 15 min after mitigationSecondary on-callPagerDuty
Error rate > 50%Engineering ManagerPhone call
Customer-reported and not yet mitigated > 30 minVP EngineeringPhone call
Data integrity concernsDatabase Team Lead#db-incidents Slack
Security-related degradationSecurity On-Call#security-incidents Slack

Escalation Message Template

markdown
**Service Degradation — [Service Name]**

**Severity**: SEV[1/2/3]
**Started**: [HH:MM UTC]
**Duration**: [X minutes]
**Current status**: [Investigating / Mitigating / Monitoring]

**Impact**:
- Error rate: [X%] (normal: [Y%])
- P99 latency: [X ms] (normal: [Y ms])
- Affected users: [estimated count or percentage]

**Root cause hypothesis**: [brief description]
**Mitigation attempted**: [what you tried]
**Help needed**: [specific ask — e.g., "need DBA to check query performance"]

**Dashboard**: [link]
**Incident channel**: #incident-[date]-[name]

Expected Timeline

StepExpected DurationEscalate If
Impact assessment2 minutes> 5 minutes
Systematic diagnosis5-10 minutes> 15 minutes
Apply mitigation5 minutes> 10 minutes
Verification5 minutesMetrics not improving after 15 min
Total15-25 minutes> 30 minutes total

"What I cannot create, I do not understand." — Richard Feynman