Skip to content
Unverified — AI-generated content. Help verify this page

Debugging in Production

Debugging in production is fundamentally different from debugging in development. You cannot set breakpoints freely, you cannot restart processes without consequence, and every diagnostic action carries the risk of making things worse. The instrumentation overhead that is acceptable in staging can cause cascading failures under real load.

This page covers the full toolkit: remote debugging, core dump analysis, memory leak detection, flame graphs, CPU profiling, network debugging, and distributed tracing. Every technique is accompanied by concrete commands and real-world interpretation guidance.

Related: Monitoring | Logging | Incident Response | Alerting


The Production Debugging Mindset

Before reaching for any tool, internalize these principles:

PrincipleWhy
Observe, don't mutateYour first actions should collect data, not change state
Minimize blast radiusDebug on one replica, not the entire fleet
Time-box your investigationIf you cannot diagnose in 15 minutes, escalate or mitigate
Preserve evidenceCapture dumps, logs, and metrics BEFORE restarting
Know your baselineYou cannot identify anomalies without knowing what "normal" looks like
Automate post-incidentIf you debugged it manually today, add telemetry so you do not have to next time

Remote Debugging

Node.js: --inspect

Node.js has a built-in debugging protocol accessible via Chrome DevTools or VS Code.

bash
# Start with debugging enabled (do NOT use --inspect in prod without auth)
node --inspect=0.0.0.0:9229 app.js

# Safer: only allow localhost connections
node --inspect=127.0.0.1:9229 app.js

# Signal a running process to enable the debugger
kill -USR1 <pid>  # Linux/macOS

DANGER

Never expose the Node.js inspector on 0.0.0.0 in production without authentication. The inspector protocol allows arbitrary code execution. Use SSH tunneling or a secure proxy to access the debugger remotely:

bash
# SSH tunnel to a production node
ssh -L 9229:localhost:9229 user@production-server

# Then connect Chrome DevTools to localhost:9229

Go: Delve

bash
# Attach to a running Go process
dlv attach <pid>

# Remote debugging (start on server, connect from local)
dlv attach <pid> --headless --listen=:2345 --api-version=2

# Connect from VS Code launch.json
{
  "type": "go",
  "request": "attach",
  "mode": "remote",
  "remotePath": "/app",
  "port": 2345,
  "host": "127.0.0.1"
}

Java: JDWP

bash
# JVM debug agent (use suspend=n so the app does not pause on start)
java -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005 \
  -jar myapp.jar

# In Kubernetes, port-forward to the debug port
kubectl port-forward pod/myapp-abc123 5005:5005

Python: debugpy

python
# Attach debugpy to a running process
import debugpy
debugpy.listen(("0.0.0.0", 5678))
# debugpy.wait_for_client()  # Uncomment to pause until debugger connects

Remote Debugging Comparison

RuntimeToolProtocolOverhead
Node.js--inspectChrome DevTools ProtocolLow when idle, moderate when profiling
GoDelveDAPModerate (pauses goroutines)
JavaJDWPJava Debug Wire ProtocolLow when idle
PythondebugpyDebug Adapter ProtocolModerate
RustGDB/LLDBGDB Remote Serial ProtocolLow

Core Dump Analysis

When a process crashes, a core dump captures the entire memory state at the moment of failure. It is the forensic evidence you need to diagnose crashes that are not reproducible.

Enabling Core Dumps

bash
# Check current limits
ulimit -c

# Enable core dumps (unlimited size)
ulimit -c unlimited

# Set core dump pattern (Linux)
echo '/tmp/core.%e.%p.%t' | sudo tee /proc/sys/kernel/core_pattern

# For containerized environments, set in the Pod spec:
# securityContext:
#   capabilities:
#     add: ["SYS_PTRACE"]

Analyzing with GDB

bash
# Load a core dump
gdb /usr/bin/myapp /tmp/core.myapp.12345.1679000000

# Inside GDB:
(gdb) bt              # Full backtrace
(gdb) bt full          # Backtrace with local variables
(gdb) info threads     # List all threads
(gdb) thread 3         # Switch to thread 3
(gdb) frame 5          # Switch to stack frame 5
(gdb) print my_var     # Inspect variable
(gdb) info registers   # CPU register state

Node.js Core Dumps

bash
# Generate a core dump from a running Node.js process
gcore <pid>

# Or use --abort-on-uncaught-exception
node --abort-on-uncaught-exception app.js

# Analyze with llnode (LLDB plugin for Node.js)
lldb -c core.12345 -- node
(lldb) plugin load /path/to/llnode.so
(lldb) v8 bt          # JavaScript backtrace
(lldb) v8 inspect <addr>  # Inspect a JS object

WARNING

Core dumps contain the entire process memory, which may include secrets, tokens, and PII. Handle them with the same security controls as production data. Never commit core dumps to source control or transfer them over unencrypted channels.

Memory Leak Detection

Memory leaks in production manifest as gradually increasing RSS (Resident Set Size) over hours or days, eventually causing OOM kills or degraded performance from excessive GC pressure.

The Detection Workflow

Node.js: Heap Snapshots

bash
# Signal a running process to write a heap snapshot
kill -USR2 <pid>  # If your app handles USR2 for heapdumps

# Or from code:
# v8.writeHeapSnapshot()    — built-in (Node 12+)
# require('heapdump').writeSnapshot()  — heapdump package
javascript
// Programmatic heap snapshot on memory threshold
import v8 from 'node:v8';
import { memoryUsage } from 'node:process';

setInterval(() => {
  const { heapUsed } = memoryUsage();
  if (heapUsed > 500 * 1024 * 1024) { // 500MB threshold
    const filename = v8.writeHeapSnapshot();
    console.error(`Heap snapshot written to ${filename}`);
  }
}, 60_000);

Load snapshots in Chrome DevTools (Memory tab) and use the "Comparison" view between two snapshots to find objects that are accumulating.

Common Node.js Leak Patterns

PatternCauseFix
Event listener accumulationAdding listeners without removing themremoveListener / once / AbortSignal
Unbounded cachesMap or Object used as cache without evictionUse lru-cache with max size
Closures holding referencesCallbacks retaining large scope chainsMinimize closure scope, nullify references
Uncleared timerssetInterval without clearIntervalClean up in shutdown handlers
Stream backpressure ignoredReadable producing faster than writable consumingHandle drain events properly
Global variable accumulationAppending to arrays/maps on globalAvoid globals; scope data to request lifecycle

Go: pprof

bash
# Enable pprof in your Go application
import _ "net/http/pprof"

# Collect a heap profile
curl -s http://localhost:6060/debug/pprof/heap > heap.prof

# Analyze
go tool pprof heap.prof
(pprof) top 20           # Top memory consumers
(pprof) list MyFunction  # Source-annotated allocation
(pprof) web              # Open interactive graph in browser

# Compare two profiles (find what grew)
go tool pprof -base heap1.prof heap2.prof

Valgrind (C/C++)

bash
# Full memory leak detection
valgrind --leak-check=full --show-leak-kinds=all \
  --track-origins=yes ./myapp

# Output interpretation:
# "definitely lost"  — memory with no pointer to it (real leak)
# "indirectly lost"  — reachable only through a definitely lost block
# "possibly lost"    — ambiguous (interior pointer exists)
# "still reachable"  — not freed at exit but still referenced

Flame Graphs

Flame graphs are the single most effective visualization for understanding where CPU time is spent. Each box represents a function; width represents time (or samples). The x-axis is NOT time — it is sorted alphabetically. The y-axis is stack depth.

Reading a Flame Graph

┌───────────────────────────────────────────────────────────────┐
│                        main()                                  │
├──────────────────────────────┬────────────────────────────────┤
│        handleRequest()       │         processQueue()         │
├──────────────┬───────────────┤────────────────────────────────┤
│  parseJSON() │ validateInput()│     serializeResponse()       │
├──────────────┤               ├────────────────────────────────┤
│ JSON.parse() │               │      JSON.stringify()          │
└──────────────┘               └────────────────────────────────┘

Width = proportion of CPU time
Wide boxes at the TOP = the bottleneck functions (where CPU is actually consumed)
Wide boxes at the BOTTOM = common ancestors (callers, not bottlenecks themselves)

TIP

Look for wide plateaus at the top of the flame graph — those are the functions actually consuming CPU. Wide boxes at the bottom (like main()) are just callers. The actionable insight is always in the widest boxes closest to the top.

Generating Flame Graphs

Node.js (0x)

bash
# Profile and generate flame graph in one command
npx 0x app.js

# Profile a running process (using --inspect)
node --inspect app.js &
# Then use Chrome DevTools CPU profiler → export → convert with speedscope

Linux (perf)

bash
# Record CPU samples for 30 seconds
sudo perf record -F 99 -p <pid> -g -- sleep 30

# Generate flame graph
sudo perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > flame.svg

# Or use the modern alternative:
sudo perf script | inferno-collapse-perf | inferno-flamegraph > flame.svg

Java (async-profiler)

bash
# Attach to a running JVM
./asprof -d 30 -f flame.html <pid>

# CPU + allocation profiling combined
./asprof -e cpu,alloc -d 60 -f combined.html <pid>

Go (pprof)

bash
# Collect 30-second CPU profile
curl -s 'http://localhost:6060/debug/pprof/profile?seconds=30' > cpu.prof

# View as flame graph in browser
go tool pprof -http=:8080 cpu.prof

Flame Graph Tools Comparison

ToolLanguage/RuntimeOverheadOutput
0xNode.jsLowInteractive HTML
perfAny (Linux)Very lowSVG (via scripts)
async-profilerJVMVery lowHTML/SVG/JFR
pprofGoLowInteractive web UI
py-spyPythonLowSVG/speedscope
rbspyRubyLowSVG/speedscope
SpeedscopeAny (viewer)N/AInteractive web viewer

CPU Profiling in Production

Continuous Profiling

Rather than profiling reactively during incidents, continuous profiling collects low-overhead samples all the time:

ToolTypeOverheadLanguages
PyroscopeOpen source~2-5%Go, Java, Python, Ruby, Node.js, Rust
ParcaOpen source~1-3%Any (eBPF-based)
Datadog Continuous ProfilerCommercial~2%Go, Java, Python, Ruby, Node.js
Google Cloud ProfilerCommercial~1%Go, Java, Python, Node.js

Profiling Safely in Production

TechniqueSafety
Sampling profiler (perf, async-profiler)Safe — low overhead, no app modification
Instrumentation profiler (V8 --prof)Risky — high overhead, can affect latency
Heap snapshotRisky — pauses the process briefly, memory spike
Core dumpSafe for analysis, but large file + may contain secrets
eBPFSafest — kernel-level, near-zero overhead

WARNING

Never enable instrumentation-based profiling in production under heavy load. Sampling profilers (which interrupt the process at fixed intervals) add 1-5% overhead. Instrumentation profilers (which hook every function call) can add 10-100x overhead and turn a performance problem into an outage.

Network Debugging

When the problem is between services — timeouts, dropped connections, unexpected latency — you need network-level visibility.

tcpdump

bash
# Capture all traffic on port 443
sudo tcpdump -i eth0 port 443 -w capture.pcap

# Capture traffic to a specific host
sudo tcpdump -i any host api.example.com -nn

# Capture HTTP traffic (unencrypted) with content
sudo tcpdump -i eth0 port 80 -A -s 0

# Capture DNS queries
sudo tcpdump -i any port 53 -nn

# Capture traffic between two pods (Kubernetes)
kubectl debug node/my-node -it --image=nicolaka/netshoot -- \
  tcpdump -i any host 10.0.1.5 -w /tmp/capture.pcap

mtr (My Traceroute)

bash
# Continuous traceroute with packet loss statistics
mtr --report --report-cycles 100 api.example.com

# Output:
# HOST                  Loss%   Snt   Last   Avg  Best  Wrst StDev
# 1. gateway.local       0.0%   100    0.5   0.6   0.3   1.2   0.2
# 2. isp-router.net      0.0%   100    5.2   5.1   4.8   6.3   0.3
# 3. backbone.carrier    2.0%   100   15.3  14.8  14.1  20.3   1.5  # <-- packet loss
# 4. api.example.com     2.0%   100   18.2  17.9  17.1  22.5   1.2

curl for HTTP Debugging

bash
# Detailed timing breakdown
curl -o /dev/null -s -w "\
  DNS:        %{time_namelookup}s\n\
  Connect:    %{time_connect}s\n\
  TLS:        %{time_appconnect}s\n\
  TTFB:       %{time_starttransfer}s\n\
  Total:      %{time_total}s\n\
  HTTP Code:  %{http_code}\n\
  Size:       %{size_download} bytes\n" \
  https://api.example.com/health

# Example output:
#   DNS:        0.015s
#   Connect:    0.045s
#   TLS:        0.120s    <-- slow TLS handshake?
#   TTFB:       0.350s    <-- server processing time
#   Total:      0.380s

Common Network Issues

SymptomLikely CauseInvestigation
Timeouts to one serviceDNS resolution, service downdig, curl -v, service health check
Intermittent connection resetsConnection pool exhaustion, NAT table overflowss -s, conntrack -S, pool metrics
High latency varianceNetwork congestion, GC pauses on remotemtr, remote service profiling
SSL handshake failuresCertificate expiry, TLS version mismatchopenssl s_client, curl -v
Partial responsesMTU issues, proxy bufferingtcpdump, check MTU with ping -M do -s 1472

Debugging Distributed Systems

In microservices architectures, a single user request touches multiple services. Debugging requires correlation across service boundaries.

Correlation IDs

Every request entering your system should receive a unique correlation ID that propagates through all downstream calls:

typescript
// Middleware to generate/propagate correlation IDs
import { randomUUID } from 'node:crypto';

function correlationMiddleware(req, res, next) {
  const correlationId = req.headers['x-correlation-id'] || randomUUID();
  req.correlationId = correlationId;
  res.setHeader('x-correlation-id', correlationId);

  // Attach to all outgoing HTTP calls
  // Attach to all log messages
  next();
}
bash
# Search logs by correlation ID across all services
# (assumes structured logging to a centralized system)
kubectl logs -l app=api-gateway --all-containers | \
  jq 'select(.correlationId == "abc-123-def")'

Distributed Tracing

Distributed tracing extends the correlation ID concept with timing, parent-child relationships, and span metadata.

OpenTelemetry Setup

typescript
// tracing.ts — initialize before importing any other module
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'product-service',
});

sdk.start();
typescript
// Manual span creation for custom business logic
import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('product-service');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);

    try {
      const validated = await validateOrder(orderId);
      span.addEvent('order_validated');

      const charged = await chargePayment(validated);
      span.addEvent('payment_charged');

      return charged;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: 2, message: error.message }); // ERROR
      throw error;
    } finally {
      span.end();
    }
  });
}

Tracing Backends

BackendTypeKey Feature
JaegerOpen sourceBuilt-in dependency graph
ZipkinOpen sourceSimple, battle-tested
Tempo (Grafana)Open sourcePairs with Loki + Prometheus
HoneycombCommercialHigh-cardinality queries, BubbleUp analysis
Datadog APMCommercialUnified metrics + traces + logs
AWS X-RayCommercialNative AWS integration

The Three Questions

When debugging a distributed system issue, you need answers to three questions:

QuestionTool
What is happening?Metrics (Prometheus, Grafana)
Why is it happening?Logs (Loki, ELK, CloudWatch)
Where is it happening?Traces (Jaeger, Tempo, Honeycomb)

TIP

The highest-leverage investment in distributed debugging is connecting your metrics, logs, and traces. When you can click from a Grafana alert to the relevant traces, and from a trace span to the corresponding log lines, your mean time to resolution (MTTR) drops dramatically. Grafana's Tempo + Loki + Prometheus stack provides this out of the box.

Production Debugging Checklist

Use this checklist during incidents to ensure you collect sufficient evidence before applying fixes:

StepCommand/ActionPurpose
1. Check metricsGrafana dashboardIdentify the anomaly
2. Check recent deploysgit log --oneline -10Correlate with changes
3. Collect logskubectl logs --since=10mError messages, stack traces
4. Check resource usagetop, htop, kubectl top podsCPU, memory, disk
5. Network connectivitycurl -v, dig, mtrDNS, routing, TLS
6. Take heap snapshotkill -USR2 <pid>If memory is climbing
7. CPU profileperf record -p <pid> -g -- sleep 30If CPU is high
8. Trace a requestJaeger/Tempo UIIf latency is high
9. Check dependenciesHealth endpoints of downstream servicesIf errors are cascading
10. Preserve evidenceCopy dumps, logs, metrics screenshotsPost-incident analysis

Further Reading

"What I cannot create, I do not understand." — Richard Feynman