Reliability

The Cost of an Outage, Broken Down

Ravinder·April 29, 2026·10 min read

ReliabilityIncident ManagementEngineering ManagementMTTR

The Number That Gets Attention

Engineering teams talk about reliability in the language of nines: 99.9%, 99.95%, 99.99%. The difference between three nines and four nines sounds like a rounding error. In actual revenue and customer trust terms, it is not.

A service that runs at 99.9% availability loses 8.7 hours of availability per year. At 99.99%, that's 52 minutes. The question that rarely gets asked explicitly is: what does each of those lost minutes cost?

Not in terms of SLA credits. In terms of revenue not processed, customers who churned because of the experience, engineers pulled off roadmap work, and the slower erosion of trust that doesn't show up in any incident ticket but does show up in net retention six months later.

This post is about building that model. Not to make engineers feel bad about incidents, but because a concrete cost model is what gets reliability investment approved. Saying "we had four incidents last quarter" is a status update. Saying "those four incidents cost an estimated $340,000 in direct revenue impact and consumed 620 engineering hours" is a business case.

The Three Cost Buckets

Outage costs fall into three buckets that rarely get totaled together, which is why the real number is almost always larger than leadership estimates.

Hard costs are directly measurable: revenue not processed during the outage window, SLA credits issued to enterprise customers, refunds issued to affected customers, and cloud compute costs that often spike during recovery (auto-scaling during incident response, rollback and redeploy cycles).

Soft costs are real but lagging: customer churn attributable to the incident, increased support volume in the weeks following, sales cycles delayed because a prospect witnessed the outage or found it in status page history, and brand damage that only surfaces in surveys and NPS scores.

Engineering costs are hidden in plain sight: hours spent in incident response, hours spent in postmortem, hours spent on remediation work that displaced roadmap features, and the cognitive tax of being on high alert after an incident that leaves the team slower for days afterward.

pie title Typical Outage Cost Distribution (2-hour P1) "Hard Cost - Revenue Impact" : 45 "Engineering Hours - Response + Postmortem" : 20 "Engineering Hours - Remediation" : 15 "Soft Cost - Churn / Trust" : 20

Calculating Hard Costs

The hard cost calculation starts with a revenue-per-minute baseline. For most product companies, this is straightforward: take monthly recurring revenue, divide by the minutes in a month, and you have the revenue-at-risk per minute of downtime.

Monthly Revenue: $2,000,000
Minutes in a month: 43,800
Revenue per minute: $45.66
 
2-hour outage (120 minutes):
Gross Revenue at Risk: 120 × $45.66 = $5,479

This is the ceiling, not the actual impact — not all revenue was going to transact in that specific 120-minute window, and some users will retry after the outage resolves. A practical multiplier for a consumer SaaS product is 40-60% of the gross figure. For a B2B platform where transactions happen during business hours and don't retry, it can be 80-100%.

SLA credits are more precise. If you have enterprise contracts with uptime commitments, you know exactly what you owe for each minute of downtime below the SLA threshold:

def calculate_sla_credits(contracts, downtime_minutes, sla_target_percent):
    """
    Calculate SLA credits owed across enterprise contracts.
    
    contracts: list of dicts with 'arr' and 'credit_table' keys
    credit_table: dict mapping (min_downtime, max_downtime) to credit_percent
    """
    annual_minutes = 525_960
    allowed_downtime = annual_minutes * (1 - sla_target_percent / 100)
    
    # Simplified: assume credit triggers at breach
    # Real implementations track YTD downtime per contract
    total_credits = 0
    for contract in contracts:
        for (min_down, max_down), credit_pct in contract['credit_table'].items():
            if min_down <= downtime_minutes < max_down:
                credit = contract['arr'] * (credit_pct / 100)
                total_credits += credit
                break
    return total_credits
 
# Example
contracts = [
    {
        'arr': 240_000,  # $240k ARR customer
        'credit_table': {
            (0, 60): 0,        # < 1 hour: no credit
            (60, 240): 10,     # 1-4 hours: 10% of monthly ARR
            (240, float('inf')): 25  # > 4 hours: 25%
        }
    },
    {
        'arr': 480_000,
        'credit_table': {
            (0, 30): 0,
            (30, 120): 10,
            (120, float('inf')): 25
        }
    }
]
 
credits = calculate_sla_credits(contracts, downtime_minutes=95, sla_target_percent=99.9)
print(f"SLA credits owed: ${credits:,.0f}")

Add the cloud cost spike. During a major incident, auto-scaling fires, engineers run repeated deploys, and observability systems log at elevated rates. In a 2-hour incident for a mid-size service, it's common to see 3-5x normal cloud spend for those 2 hours — which might be $200-$2,000 depending on your baseline.

Calculating Engineering Costs

Engineering time during incidents is consistently underestimated because the accounting only captures the incident itself. The full accounting:

Incident Response:
  - IC (incident commander) × 3 hours: 3h
  - On-call engineer × 3 hours: 3h  
  - Secondary engineers pulled in × 2 people × 2h each: 4h
  - Comms/support liaison × 2 hours: 2h
  Response subtotal: 12 engineer-hours
 
Postmortem:
  - Postmortem preparation: 2h (primary engineer)
  - Postmortem meeting × 8 people × 1h: 8h
  - Postmortem writeup and review: 3h
  Postmortem subtotal: 13 engineer-hours
 
Remediation (varies widely):
  - Investigation and root cause confirmation: 4h
  - Fix implementation: 8h
  - Testing and validation: 4h
  - Deployment and monitoring: 2h
  Remediation subtotal: 18 engineer-hours
 
Indirect:
  - Interrupted focus for adjacent engineers: 4h estimated
  - On-call stress / next-day productivity loss: 8h estimated
  Indirect subtotal: 12 engineer-hours
 
Total: 55 engineer-hours
At $150/hour fully-loaded cost: $8,250

55 engineer-hours for a 2-hour incident is not unusual for a P1 at a company with 20+ engineers. The number that surprises most leadership teams is the remediation and postmortem cost — it's often larger than the incident response itself.

The "indirect" row is the hardest to measure but real. On-call engineers sleep poorly the night after a P1. Engineers who were pulled into the incident take time to get back into flow state on their feature work. These costs are diffuse and don't show up in any system.

The Soft Cost Model

Soft costs are the most important and the most ignored, because they're lagging indicators that attribution is hard.

Customer churn. Not every customer who experiences an outage churns. But some do. The question is what percentage, and over what time horizon. The data point that's most useful is the churn rate in the 90 days following a major incident versus a baseline 90-day period. If that's not available, a conservative estimate is: for each 1,000 customers who experienced the outage, 0.5-2% increase in churn probability, depending on whether this was a repeat incident.

Affected customers: 5,000
Incremental churn probability: 1%
Expected additional churned customers: 50
Average Revenue Per Customer: $1,200/year
Churn impact (annualized): 50 × $1,200 = $60,000

Support volume. Incidents generate support tickets that continue arriving for 2-3 weeks. Each ticket costs engineering and support time. If your average support ticket costs $30 to resolve (including engineering escalation time), and you receive 200 incremental tickets from an incident, that's $6,000.

Sales impact. This one is uncomfortable but real. Enterprise prospects do due diligence. They check your status page history. If a prospect is evaluating you and sees two P1 incidents in the last 90 days, some percentage of those deals slow or die. This is nearly impossible to quantify precisely, but a rough model: track deals that had a status page review as a noted activity, and look at close rates for deals where the review occurred near an incident.

Building the Spreadsheet

The format that resonates with leadership is a single-page summary that shows cost per incident, trended over time, alongside the cost of the reliability investment being proposed.

xychart-beta title "Outage Cost vs Reliability Investment (Quarterly)" x-axis ["Q1 2025", "Q2 2025", "Q3 2025", "Q4 2025", "Q1 2026"] y-axis "Cost ($K)" 0 --> 500 bar [180, 320, 290, 95, 40] line [20, 20, 85, 85, 85]

The bar is incident cost. The line is reliability investment. The story the chart tells: Q3 2025 was when the team invested $85K in reliability tooling and remediation. Q4 and Q1 costs dropped dramatically. The investment paid for itself in the first quarter after it was made.

That is the chart that gets budget approved for the next reliability sprint.

The row-level detail that supports it:

Quarter	Incidents	Avg Duration	Hard Cost	Eng Cost	Soft Cost (Est)	Total
Q1 2025	3	1.2h avg	$42,000	$38,000	$100,000	$180,000
Q2 2025	5	2.8h avg	$78,000	$62,000	$180,000	$320,000
Q3 2025	4	1.9h avg	$71,000	$49,000	$170,000	$290,000
Q4 2025	1	0.8h avg	$18,000	$17,000	$60,000	$95,000
Q1 2026	1	0.3h avg	$8,000	$12,000	$20,000	$40,000

Framing for Executives

The engineering frame for reliability is MTTR (mean time to recovery) and error budget. These are correct and useful metrics internally. They are not the frame that drives investment decisions at the VP and C-suite level.

The executive frame is: what did each incident cost, what is the trend, and what does it cost to change that trend? Frame the conversation around three numbers:

Cost of last year's incidents. Total all incidents, apply the model above, produce a dollar figure. This number will be larger than anyone expects. That's the point.

Cost of the proposed reliability investment. Engineering time for chaos engineering, observability tooling, runbook development, on-call improvements. This should be significantly smaller than the incident cost.

Expected reduction in incident frequency and severity. Be conservative. If the data shows that teams with mature reliability practices have 60-70% fewer P1 incidents, claim 40% in your proposal. Undersell and overdeliver.

The conversation becomes: "Last year's incidents cost us an estimated $925,000 in direct and indirect costs. We're proposing a $120,000 investment in reliability tooling and process that we expect to reduce incident costs by 40-50% over the next 12 months. At 40% reduction, that's a $370,000 return on a $120,000 investment."

That is a conversation that engineering leaders can have with CFOs. The MTTR chart is not.

Key Takeaways

Outage costs have three buckets — hard (revenue, credits, refunds), engineering (response, postmortem, remediation), and soft (churn, support, sales impact) — and the real total is almost always larger than leadership assumes.
Revenue-per-minute calculations are straightforward: monthly revenue divided by 43,800 gives the ceiling, apply a 40-80% factor based on whether your transaction pattern is time-sensitive.
Engineering time is chronically underestimated because the accounting skips postmortem and remediation work, which often exceeds incident response time.
Soft costs are lagging and hard to attribute, but churn probability modeling and incremental support volume give defensible estimates that complete the picture.
The chart that unlocks reliability investment is incident cost versus reliability investment over time — it makes the ROI visible in a language finance understands.
MTTR and uptime nines are internal engineering metrics; the executive conversation needs dollar figures, trends, and a credible reduction estimate attached to a specific investment proposal.