Designing Dashboards Engineers Actually Look At
Most engineering dashboards are built for the demo, not for the incident. They have forty panels, three rows of sparklines in colors that mean nothing, and a title row that says "Service Health" without telling you what health means. When something breaks at 2 AM, engineers ignore them and go straight to logs. That is a design failure, not a tool failure.
The dashboard that gets used during an incident answers one question per panel and tells you what to do next. This post covers how to design that dashboard — the four-panel rule, signal selection, panel layout, and the anti-patterns that make dashboards useless.
The Four-Panel Rule
A service dashboard should lead with exactly four panels. Not four rows. Four panels, large, at the top of the page. If you cannot fit the most important signal for a service into four panels, you do not understand what matters most about that service.
The four panels map directly to the RED method for request-based services:
| Panel | Metric | Why |
|---|---|---|
| 1 | Request Rate | Are people using the service? Is volume anomalous? |
| 2 | Error Rate (%) | Is the service returning failures? |
| 3 | Latency (p50 / p95 / p99) | Are successful requests fast enough? |
| 4 | Saturation | Is a resource approaching its limit? |
Saturation is the most frequently omitted panel. Engineers know to watch error rate and latency; they forget that a connection pool at 95% capacity will cause latency to spike within minutes even if error rates are still low.
Selecting Top-of-Funnel Signals
The hardest part of dashboard design is signal selection. Prometheus gives you hundreds of metrics per service. Most of them are diagnostic — useful when you already know the problem. The top-of-funnel panels should show symptoms, not causes.
A symptom is observable by the user. A cause explains why.
- Symptom:
http_request_duration_seconds{quantile="0.99"} > 2 - Cause:
go_gc_pause_seconds_sumrising
The four-panel dashboard shows symptoms. Everything below that fold is causal — it exists to help you answer "why" once the top panels have told you "what."
Grafana PromQL for the four panels:
# Panel 1: Request Rate (per-second, grouped by status class)
sum(rate(http_requests_total{job="payment-service"}[2m])) by (status_class)
# Panel 2: Error Rate (ratio, not raw count)
sum(rate(http_requests_total{job="payment-service", status=~"5.."}[2m]))
/
sum(rate(http_requests_total{job="payment-service"}[2m]))
# Panel 3: Latency percentiles
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="payment-service"}[5m])) by (le))
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job="payment-service"}[5m])) by (le))
# Panel 4: Connection pool saturation
pg_stat_activity_count{datname="payments"} / pg_settings_max_connections * 100One mistake engineers make: they use raw counts instead of rates for error panels. Raw counts make spikes look worse than they are during traffic ramps. Always express errors as a ratio of total requests.
Traffic, Errors, Latency, Saturation — In That Order
The order of the four panels is not arbitrary. It reflects how you read an incident from the outside in.
Traffic first. If request rate dropped to zero, you have a routing or deployment problem — not a service problem. The error rate panel will show 0%, which looks fine. Looking at traffic first prevents this misdirection.
Errors second. Error rate tells you if the service is returning failures. Look at the shape: a step function suggests a deploy or a configuration push. A ramp suggests resource exhaustion or a downstream dependency degrading.
Latency third. Latency can climb before errors do. If p99 climbs from 200ms to 1.8s but errors are still low, you have a queue building up or a slow downstream. You have time, but not much.
Saturation last. Saturation is predictive. A healthy service with CPU at 85% and connection pool at 90% will fail — you just do not know when. The saturation panel catches failures before they happen.
Panel Layout and Visual Grammar
Good Grafana dashboards use a consistent visual grammar so engineers parse them without thinking.
Colors:
- Green = good / below threshold
- Yellow = warning / approaching threshold
- Red = bad / above threshold
Do not use these colors for anything else. Do not use blue for errors or orange for "informational." Every color deviation burns cognitive budget during an incident.
Thresholds in every panel:
Every panel should have at least one threshold line drawn. The threshold is your SLO budget or your known failure point. Without it, a number in isolation is meaningless.
// Grafana panel threshold configuration
{
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.01 },
{ "color": "red", "value": 0.05 }
]
}
}Time ranges:
Set your default dashboard time range to the last 3 hours, not 24 hours. Incidents happen in minutes, not days. Twenty-four-hour views flatten the signal. On a 24-hour graph, a 10-minute spike looks like a thin sliver. On a 3-hour graph, it tells a story.
Legends:
Every legend should show the current value, not just the label. In Grafana, set Legend: Table and display Last and Max. When you arrive at the dashboard mid-incident, the first thing you want is the current number, not a colored line you have to trace to its current position.
Upstream Dependency Row
Below the four top panels, add one row per upstream dependency. Each dependency gets two panels: latency and error rate. Nothing else.
# Upstream dependency error rate from the client's perspective
sum(rate(http_client_requests_total{job="payment-service", upstream="stripe", status=~"5.."}[2m]))
/
sum(rate(http_client_requests_total{job="payment-service", upstream="stripe"}[2m]))
# Upstream dependency p99 latency (from client instrumentation)
histogram_quantile(0.99,
sum(rate(http_client_request_duration_seconds_bucket{job="payment-service", upstream="stripe"}[5m])) by (le)
)The discipline here is measuring from the client's perspective, not the upstream service's own metrics. The upstream may report 200ms average; you may be experiencing 1.2s because of retry storms or connection overhead on your side.
Anti-Patterns That Kill Dashboards
1. The "just in case" panel. You add a panel because someone asked "can we add X to the dashboard?" The answer is always "yes, if it replaces something less useful." Panels never get removed; they accumulate. After six months, you have a wall of sparklines and no one looks at any of them.
2. Non-percentage error panels.
payments_failed: 142 tells you nothing without context. payments_failed / payments_total: 4.2% tells you whether to wake someone up.
3. The single-host panel. Panels that show metrics for a single pod or instance, not the fleet aggregate. During incidents you care about the service behavior, not one instance. Instance-level panels belong on a separate "debugging" dashboard.
4. No time correlation.
Panels with different scrape intervals or different rate() windows on the same dashboard will show spikes at different times for the same event. Pin all panels in a dashboard to the same $__rate_interval Grafana variable.
# Use $__rate_interval instead of hardcoded 2m or 5m
rate(http_requests_total{job="payment-service"}[$__rate_interval])5. Dashboard-as-documentation. Dashboards are not the place for architecture diagrams, runbook links, or "this metric means X" text panels. Put that in a runbook. Dashboards are for live signal, not static text.
Alerting and Dashboard Alignment
Your alerts and your dashboard should reference the same expressions. If your alert fires on error_rate > 0.05 and your dashboard shows a different calculation, you will spend the first five minutes of every incident confirming that the alert is "real."
In Grafana, use recording rules to pre-compute expensive aggregations and share them between alert rules and dashboard panels:
# Prometheus recording rule
groups:
- name: payment_service
rules:
- record: job:http_error_rate:ratio5m
expr: |
sum(rate(http_requests_total{job="payment-service", status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="payment-service"}[5m]))# Dashboard panel uses the same recording rule
job:http_error_rate:ratio5m{job="payment-service"}Now alerts and dashboards show the same number. When the alert fires and you open the dashboard, the panel is already red. No interpretation required.
Key Takeaways
- The four-panel rule — Rate, Error Rate, Latency, Saturation — gives every dashboard a consistent, incident-ready entry point.
- Express errors as a ratio of total requests, never raw counts; raw counts misfire during traffic ramps and dips.
- Panel order matters: traffic first (to detect routing failures), errors second, latency third, saturation last (predictive).
- Every panel needs a threshold drawn on it; without a reference line, numbers in isolation are meaningless during an incident.
- Use
$__rate_intervaluniformly across all panels in a dashboard to prevent timing artifacts from different scrape windows. - Align alert expressions with dashboard recording rules so the dashboard is already red when the alert fires.