Engineering

Designing Dashboards Engineers Actually Look At

Ravinder·December 19, 2025·8 min read

EngineeringObservabilityDashboardsGrafana

Designing Dashboards Engineers Actually Look At

Most engineering dashboards are built for the demo, not for the incident. They have forty panels, three rows of sparklines in colors that mean nothing, and a title row that says "Service Health" without telling you what health means. When something breaks at 2 AM, engineers ignore them and go straight to logs. That is a design failure, not a tool failure.

The dashboard that gets used during an incident answers one question per panel and tells you what to do next. This post covers how to design that dashboard — the four-panel rule, signal selection, panel layout, and the anti-patterns that make dashboards useless.

The Four-Panel Rule

A service dashboard should lead with exactly four panels. Not four rows. Four panels, large, at the top of the page. If you cannot fit the most important signal for a service into four panels, you do not understand what matters most about that service.

The four panels map directly to the RED method for request-based services:

Panel	Metric	Why
1	Request Rate	Are people using the service? Is volume anomalous?
2	Error Rate (%)	Is the service returning failures?
3	Latency (p50 / p95 / p99)	Are successful requests fast enough?
4	Saturation	Is a resource approaching its limit?

graph TD A[Dashboard Top Row - Always Visible] --> B[Panel 1: Request Rate] A --> C[Panel 2: Error Rate %] A --> D[Panel 3: Latency p50/p99] A --> E[Panel 4: Saturation - CPU/Queue/Conn] B --> F[Second Row: Upstream Dependencies] C --> F D --> F E --> F F --> G[Third Row: Business Metrics] F --> H[Third Row: Infrastructure]

Saturation is the most frequently omitted panel. Engineers know to watch error rate and latency; they forget that a connection pool at 95% capacity will cause latency to spike within minutes even if error rates are still low.

Selecting Top-of-Funnel Signals

The hardest part of dashboard design is signal selection. Prometheus gives you hundreds of metrics per service. Most of them are diagnostic — useful when you already know the problem. The top-of-funnel panels should show symptoms, not causes.

A symptom is observable by the user. A cause explains why.

Symptom: http_request_duration_seconds{quantile="0.99"} > 2
Cause: go_gc_pause_seconds_sum rising

The four-panel dashboard shows symptoms. Everything below that fold is causal — it exists to help you answer "why" once the top panels have told you "what."

Grafana PromQL for the four panels:

# Panel 1: Request Rate (per-second, grouped by status class)
sum(rate(http_requests_total{job="payment-service"}[2m])) by (status_class)
 
# Panel 2: Error Rate (ratio, not raw count)
sum(rate(http_requests_total{job="payment-service", status=~"5.."}[2m]))
/
sum(rate(http_requests_total{job="payment-service"}[2m]))
 
# Panel 3: Latency percentiles
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="payment-service"}[5m])) by (le))
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job="payment-service"}[5m])) by (le))
 
# Panel 4: Connection pool saturation
pg_stat_activity_count{datname="payments"} / pg_settings_max_connections * 100

One mistake engineers make: they use raw counts instead of rates for error panels. Raw counts make spikes look worse than they are during traffic ramps. Always express errors as a ratio of total requests.

Traffic, Errors, Latency, Saturation — In That Order

The order of the four panels is not arbitrary. It reflects how you read an incident from the outside in.

Traffic first. If request rate dropped to zero, you have a routing or deployment problem — not a service problem. The error rate panel will show 0%, which looks fine. Looking at traffic first prevents this misdirection.

Errors second. Error rate tells you if the service is returning failures. Look at the shape: a step function suggests a deploy or a configuration push. A ramp suggests resource exhaustion or a downstream dependency degrading.

Latency third. Latency can climb before errors do. If p99 climbs from 200ms to 1.8s but errors are still low, you have a queue building up or a slow downstream. You have time, but not much.

Saturation last. Saturation is predictive. A healthy service with CPU at 85% and connection pool at 90% will fail — you just do not know when. The saturation panel catches failures before they happen.

sequenceDiagram participant Eng as Engineer participant T as Traffic Panel participant E as Error Panel participant L as Latency Panel participant S as Saturation Panel Eng->>T: Is traffic normal? T-->>Eng: Yes, nominal volume Eng->>E: Are errors elevated? E-->>Eng: Yes, error rate 12% Eng->>L: Is latency elevated? L-->>Eng: p99 at 3.2s (normal is 200ms) Eng->>S: What is saturated? S-->>Eng: DB connection pool at 98% Eng->>Eng: Hypothesis: DB pool exhaustion causing timeouts → errors

Panel Layout and Visual Grammar

Good Grafana dashboards use a consistent visual grammar so engineers parse them without thinking.

Colors:

Green = good / below threshold
Yellow = warning / approaching threshold
Red = bad / above threshold

Do not use these colors for anything else. Do not use blue for errors or orange for "informational." Every color deviation burns cognitive budget during an incident.

Thresholds in every panel:

Every panel should have at least one threshold line drawn. The threshold is your SLO budget or your known failure point. Without it, a number in isolation is meaningless.

// Grafana panel threshold configuration
{
  "thresholds": {
    "mode": "absolute",
    "steps": [
      { "color": "green", "value": null },
      { "color": "yellow", "value": 0.01 },
      { "color": "red", "value": 0.05 }
    ]
  }
}

Time ranges:

Set your default dashboard time range to the last 3 hours, not 24 hours. Incidents happen in minutes, not days. Twenty-four-hour views flatten the signal. On a 24-hour graph, a 10-minute spike looks like a thin sliver. On a 3-hour graph, it tells a story.

Legends:

Every legend should show the current value, not just the label. In Grafana, set Legend: Table and display Last and Max. When you arrive at the dashboard mid-incident, the first thing you want is the current number, not a colored line you have to trace to its current position.

Upstream Dependency Row

Below the four top panels, add one row per upstream dependency. Each dependency gets two panels: latency and error rate. Nothing else.

# Upstream dependency error rate from the client's perspective
sum(rate(http_client_requests_total{job="payment-service", upstream="stripe", status=~"5.."}[2m]))
/
sum(rate(http_client_requests_total{job="payment-service", upstream="stripe"}[2m]))
 
# Upstream dependency p99 latency (from client instrumentation)
histogram_quantile(0.99,
  sum(rate(http_client_request_duration_seconds_bucket{job="payment-service", upstream="stripe"}[5m])) by (le)
)

The discipline here is measuring from the client's perspective, not the upstream service's own metrics. The upstream may report 200ms average; you may be experiencing 1.2s because of retry storms or connection overhead on your side.

Anti-Patterns That Kill Dashboards

1. The "just in case" panel. You add a panel because someone asked "can we add X to the dashboard?" The answer is always "yes, if it replaces something less useful." Panels never get removed; they accumulate. After six months, you have a wall of sparklines and no one looks at any of them.

2. Non-percentage error panels. payments_failed: 142 tells you nothing without context. payments_failed / payments_total: 4.2% tells you whether to wake someone up.

3. The single-host panel. Panels that show metrics for a single pod or instance, not the fleet aggregate. During incidents you care about the service behavior, not one instance. Instance-level panels belong on a separate "debugging" dashboard.

4. No time correlation. Panels with different scrape intervals or different rate() windows on the same dashboard will show spikes at different times for the same event. Pin all panels in a dashboard to the same $__rate_interval Grafana variable.

# Use $__rate_interval instead of hardcoded 2m or 5m
rate(http_requests_total{job="payment-service"}[$__rate_interval])

5. Dashboard-as-documentation. Dashboards are not the place for architecture diagrams, runbook links, or "this metric means X" text panels. Put that in a runbook. Dashboards are for live signal, not static text.

Alerting and Dashboard Alignment

Your alerts and your dashboard should reference the same expressions. If your alert fires on error_rate > 0.05 and your dashboard shows a different calculation, you will spend the first five minutes of every incident confirming that the alert is "real."

In Grafana, use recording rules to pre-compute expensive aggregations and share them between alert rules and dashboard panels:

# Prometheus recording rule
groups:
  - name: payment_service
    rules:
      - record: job:http_error_rate:ratio5m
        expr: |
          sum(rate(http_requests_total{job="payment-service", status=~"5.."}[5m]))
          / sum(rate(http_requests_total{job="payment-service"}[5m]))

# Dashboard panel uses the same recording rule
job:http_error_rate:ratio5m{job="payment-service"}

Now alerts and dashboards show the same number. When the alert fires and you open the dashboard, the panel is already red. No interpretation required.

Key Takeaways

The four-panel rule — Rate, Error Rate, Latency, Saturation — gives every dashboard a consistent, incident-ready entry point.
Express errors as a ratio of total requests, never raw counts; raw counts misfire during traffic ramps and dips.
Panel order matters: traffic first (to detect routing failures), errors second, latency third, saturation last (predictive).
Every panel needs a threshold drawn on it; without a reference line, numbers in isolation are meaningless during an incident.
Use $__rate_interval uniformly across all panels in a dashboard to prevent timing artifacts from different scrape windows.
Align alert expressions with dashboard recording rules so the dashboard is already red when the alert fires.