Production Debugging Without a Debugger
You are forty minutes into an incident. P1 ticket open, on-call engineer paged, Slack exploding. The service is degraded but not down. Attaching a debugger to a live process is out of the question — you'd need to pause threads, risk dropping requests, and most container runtimes make it painful anyway. So you do what experienced engineers actually do: you read the runtime like a crime scene.
This post is about systematic production debugging when attaching a debugger is off the table. It is not a substitute for good observability design — but it is what you fall back on when that design has gaps.
Start With Logs, Not Assumptions
The first instinct during an incident is to form a theory and look for confirmation. Resist it. Read the logs without a hypothesis for the first three minutes.
Structured logs are your baseline. If your service emits JSON, you can query in-flight without redeployment. Tools like jq, lnav, or your logging backend's query language let you slice by field, not just grep for strings.
# Pull the last 10 minutes of logs from a pod, filter for status >= 500
kubectl logs deployment/payment-service --since=10m \
| jq 'select(.status >= 500) | {time, trace_id, path, status, duration_ms, upstream}'What you want from the first pass:
- Error rate shape — is it a step function (something changed) or a ramp (something is filling up)?
- Blast radius — are all endpoints affected, or only a subset?
- Correlation fields — do the erroring requests share a user segment, a region, or an upstream dependency?
# Count errors by upstream dependency to find the blast source
kubectl logs deployment/payment-service --since=10m \
| jq -r 'select(.level == "error") | .upstream' \
| sort | uniq -c | sort -rnIf your logs are unstructured, you are already paying a tax. Grep can still save you — look for patterns around the timestamp of the first alert, not around the current time. Incidents rarely peak when they start.
Dynamic Sampling Without a Code Change
Sometimes you need more signal than logs give you. The trap is redeploying with extra logging — you burn ten minutes and introduce a risk of your own. Use dynamic instrumentation instead.
Java (Arthas)
Arthas is an open-source Java diagnostic tool from Alibaba that attaches to a running JVM with no restart.
# Attach to the running JVM
curl -O https://arthas.aliyun.com/arthas-boot.jar
java -jar arthas-boot.jar --pid $(pgrep -f payment-service)
# Once inside: watch method args and return values in real time
watch com.example.PaymentProcessor process '{params, returnObj, throwExp}' -n 5 -x 3The watch command intercepts method invocations at runtime and dumps arguments, return values, and exceptions — without touching source code or restarting the process.
Python (pyspy)
py-spy is a sampling profiler for Python that requires no code changes and works on CPython processes without GIL contention.
pip install py-spy
# Top-like view of where time is being spent
py-spy top --pid $(pgrep -f gunicorn)
# Record a flamegraph for 30 seconds
py-spy record -o flamegraph.svg --pid $(pgrep -f gunicorn) --duration 30Node.js (clinic.js / 0x)
# Attach 0x to a running pid (send SIGUSR2 to trigger V8 profiling)
npx 0x --output-dir profile $(pgrep -f "node server.js")Reading System Calls With strace and ltrace
When the problem is below the application layer — slow disk, weird socket behavior, unexpected syscall patterns — strace is the tool. Use it surgically.
# Watch all network-related syscalls on a single PID
strace -f -e trace=network -p $(pgrep -f payment-service) 2>&1 | head -100
# Count syscall frequency to find what's being hammered
strace -f -c -p $(pgrep -f payment-service) &
sleep 30 && kill %1The -c flag aggregates counts and time per syscall type. If you see futex dominating, you have lock contention. If read and write dominate with tiny byte counts, you likely have a chatty socket protocol or unbuffered I/O.
ltrace is the shared-library equivalent — it intercepts library calls rather than kernel calls. Useful when diagnosing TLS negotiation overhead or malloc pressure:
ltrace -p $(pgrep -f payment-service) -e 'malloc+free+SSL_*' 2>&1 | tail -50eBPF: Surgical Observation With Zero Overhead
eBPF programs run in the kernel, verified safe, and can trace anything from syscalls to TCP retransmits without touching the application.
bpftrace one-liners for incidents:
# Which files is this process opening most?
bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == 12345/ { @[str(args->filename)] = count(); }'
# TCP retransmit rate by destination IP
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb { @[ntop(args->daddr)] = count(); }'
# Latency histogram for all read() calls taking > 1ms
bpftrace -e 'tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@start[tid]/ {
$lat = (nsecs - @start[tid]) / 1000;
if ($lat > 1000) { @us = hist($lat); delete(@start[tid]); }
}'The bcc toolkit ships higher-level tools:
# Disk I/O latency distribution by process
biolatency -P 10 1
# TCP connection latency from SYN to ESTABLISHED
tcpconnlat 10 # show connections taking > 10ms
# Trace all new TCP connections to port 5432 (database)
tcpconnect -P 5432eBPF requires kernel >= 4.9 for most features, and >= 5.8 for BTF-based CO-RE tools. Most cloud provider managed Kubernetes nodes meet this today.
Controlled SSH: When You Have No Other Option
Sometimes your observability gaps are large enough that you need to SSH into the instance and look around manually. Do it with discipline.
When you SSH into production, follow this checklist:
# 1. Memory pressure
free -h
cat /proc/meminfo | grep -E 'MemAvailable|Dirty|Writeback'
# 2. CPU steal (if in a VM — steal > 5% means noisy neighbor)
top -bn1 | grep Cpu | awk '{print "steal:", $16}'
# 3. File descriptor exhaustion
ls /proc/$(pgrep -f payment-service)/fd | wc -l
cat /proc/sys/fs/file-max
# 4. Network socket states
ss -s # summary
ss -tnp | awk '{print $1}' | sort | uniq -c # state distribution
# 5. OOM killer activity in the last hour
journalctl -k --since="1 hour ago" | grep -i oomThe key discipline: SSH is read-only investigation. You gather evidence, you do not make changes. If something looks fixable from a shell, raise it, document it, and apply it through your normal change process — even during an incident.
Correlating Across Signals
No single signal tells the whole story. The pattern that matters is the correlation across layers.
The workflow is:
- Logs first — establish blast radius and shape.
- Metrics second — confirm the signal in numbers (error rate, p99 latency, saturation).
- Dynamic sampling third — instrument the live process if logs lack detail.
- Syscall / eBPF fourth — if the problem is below the application layer.
- SSH last — read-only, documented, time-boxed.
Building Incident-Ready Observability
The best production debugging is the kind you never have to do ad-hoc because you anticipated the signals you would need.
After every incident, ask two questions before closing the ticket:
- What would have cut our time-to-diagnose in half? Add that observability before the next sprint ends.
- Was there a signal that existed but that we did not look at? Update your runbook.
Concrete practices:
- Ship trace IDs on every outbound log line. You cannot correlate across services without them.
- Add a sampling rate header to your internal services so you can raise sampling on a specific user segment or endpoint dynamically via feature flag, without redeploying.
- Keep a per-service debugging runbook with the exact
jq,kubectl, andbpftracecommands that worked last time. - Set FD and goroutine / thread count as exposed Prometheus metrics. You cannot alert on what you cannot see.
Key Takeaways
- Logs are your first signal — read them without a hypothesis for the first few minutes, then filter to correlate across error fields.
- Dynamic instrumentation tools (Arthas for JVM, py-spy for Python) let you add observability to a running process without restarting it.
strace -cquickly reveals pathological syscall patterns; eBPF / bpftrace provides the same with negligible overhead at scale.- SSH into production is a last resort, must be read-only, and every finding must be documented in the incident ticket.
- The correlation pattern — logs, then metrics, then dynamic sampling, then OS-level — is more reliable than any single tool.
- Every incident is a signal about your observability gap; close it before the next sprint.