Skip to main content
Engineering

The Pre-Mortem: Catching Failure Before It Ships

Ravinder··11 min read
EngineeringPre-MortemRisk ManagementEngineering Management
Share:
The Pre-Mortem: Catching Failure Before It Ships

Post-mortems are popular because they feel productive. The incident happened, you gather the team, you write up what went wrong, you assign action items. The problem is that everything you learn in a post-mortem could have been surfaced before the incident — if you had asked the right question at the right time.

The pre-mortem is that question. The practice is straightforward: before you ship, you ask "Imagine it's three months from now and this feature failed catastrophically. What happened?" The challenge is running it in a way that produces honest answers, not theater. Most pre-mortems fail because they surface only the risks people are comfortable sharing publicly, then generate action items that never get worked.

This post covers how to run a pre-mortem that actually works — the facilitation structure, the prompts that unlock uncomfortable truths, the risk capture process, and the follow-through that separates the exercise from performance.

Why Pre-Mortems Produce Different Results Than Risk Reviews

A standard risk review asks: "What could go wrong?" A pre-mortem asks: "It went wrong. What happened?"

That shift from hypothetical to retrospective is not semantic. Psychologist Gary Klein's research shows that people identify 30% more causes of failure using prospective hindsight — imagining the failure has already occurred — compared to standard probabilistic risk assessment. The mechanism is that retrospective framing allows people to construct a causal narrative rather than estimate probabilities, and narrative construction is something human brains do better.

The practical difference: in a risk review, engineers mention the risks they consider professionally acceptable to raise. In a pre-mortem, the frame is "this already failed" — which removes the social cost of being "too negative" and makes it easier to voice the concerns everyone already privately holds.

graph TD A[Standard Risk Review] -->|Asks: What might fail?| B[Public, politically safe risks] B --> C[Mitigation plans for comfortable concerns] D[Pre-Mortem] -->|Tells: It failed. Why?| E[Causal narratives, including uncomfortable ones] E --> F[Broader risk surface area captured] F --> G[Ranked by likelihood and impact] G --> H[Mitigations with owners and deadlines]

Who to Invite and When to Run It

Who: Everyone who touches the system being launched. That means engineers (backend, frontend, infra), the PM, the QA engineer, the on-call rotation lead, and — if it is a customer-facing feature — a representative from customer support. Customer support sees the failure modes that engineers do not anticipate.

Who not to invite: Senior leadership above the team. Their presence changes the social dynamics. Engineers censor themselves around executives. If leadership wants input, debrief them afterward.

When: One to two weeks before launch, not the day before. You need time to do something about what you find. A pre-mortem the day before launch produces stress, not safety.

Duration: 90 minutes maximum. Longer sessions produce diminishing returns and exhausted participants who stop contributing honestly.

The Facilitation Structure

A pre-mortem has four phases. The facilitator's job is to keep the group from collapsing into solution mode before the failure space is fully mapped.

Phase 1: Set the scene (10 minutes)

The facilitator reads the launch brief. What is shipping? What is the intended user experience? What does success look like? Do not assume everyone has read the PRD — five minutes of shared context prevents 20 minutes of clarifying questions during the session.

Then state the pre-mortem frame explicitly:

"Imagine it is three months after launch. The feature has failed — and failed badly. Users are unhappy, the team is scrambling, leadership is asking what happened. We are not here to prevent this; we are here to explain it. What went wrong?"

Phase 2: Individual silent writing (15 minutes)

Every participant writes their failure scenarios independently, in silence, before any group discussion. This is the most important discipline in the exercise. Group discussion before individual writing produces anchoring — the first person to speak shapes what everyone else says. Silent writing preserves independent thought.

Each participant writes on sticky notes or in a shared doc, following the prompt:

"The feature failed. Describe three to five specific failure scenarios. For each, write what failed, how it became visible to users, and what the first sign of trouble was that we missed."

Phase 3: Round-robin sharing and clustering (45 minutes)

Each participant shares one failure scenario at a time, in rotation, until all scenarios are on the table. The facilitator's rules:

  • No debate during sharing. The only permitted response is a clarifying question.
  • Similar scenarios get clustered, not merged. "Database overload" and "cache miss cascade" are related but distinct failure modes.
  • Every scenario gets a sticky note on the board, even if it sounds unlikely. Unlikely failure modes are often the ones that actually happen.

After all scenarios are shared, the group votes on likelihood and impact using dot voting (two votes per person: one for most likely, one for highest impact). The clustering and voting take about 20 minutes.

Phase 4: Mitigation planning (20 minutes)

The four or five highest-scored scenarios get mitigations. Each mitigation needs:

  1. A specific action (not "improve monitoring" — "add an alert on cache hit rate < 80% for the new recommendation endpoint").
  2. A named owner.
  3. A deadline relative to launch.

No mitigation gets added to a sprint without an owner and a deadline. Ownerless mitigations are aspirations.

Prompts That Unlock Uncomfortable Truths

Standard "what could go wrong" prompts produce standard answers. These prompts are designed to break through the polite surface:

The "known unknown" prompt:

"What do we know we do not understand well enough to be confident about? Name the one thing you would want two more weeks to test."

This surfaces the assumptions engineers are least comfortable with — not because they do not know the risk, but because they do not want to slow the launch.

The "first responder" prompt:

"Imagine you are the on-call engineer at 2 AM, six weeks after launch. What does the alert say? What do you do first?"

This makes the failure mode operational and concrete. Engineers who have been on-call know what it feels like. This prompt converts abstract risks into lived scenarios.

The "customer support" prompt (use this with your CS rep):

"What is the most likely reason a customer calls support in the first month after launch? Write three support ticket titles."

Customer support representatives have a mental model of user behavior that engineers consistently underestimate. Their answers are often the most grounded in the session.

The "dependency chain" prompt:

"Name every external system this feature depends on. For each, describe what happens to the user experience if that system is degraded by 50%."

This one surfaces dependency assumptions. Engineers often model dependencies as binary (up or down) rather than degraded. A 50% degradation scenario is more realistic and more dangerous than a full outage.

The "six months from now" prompt:

"Six months after launch, the team is different — two engineers have moved to other teams, one is on parental leave. A bug is filed against this feature. Who understands the system well enough to fix it?"

This surfaces knowledge concentration risk and documentation debt before it becomes an incident.

sequenceDiagram participant F as Facilitator participant Eng as Engineers participant PM as Product Manager participant CS as Customer Support F->>Eng: Silent writing: 3-5 failure scenarios each F->>PM: Silent writing: 3-5 failure scenarios each F->>CS: Silent writing: 3-5 failure scenarios each F->>Eng: Share one scenario, no debate F->>PM: Share one scenario, no debate F->>CS: Share one scenario, no debate Note over F: Rotate until all scenarios surfaced F->>F: Cluster by failure domain F->>Eng: Dot vote: likelihood and impact F->>PM: Dot vote: likelihood and impact F->>CS: Dot vote: likelihood and impact F->>F: Top 4-5 scenarios get mitigations F->>Eng: Assign owner + deadline per mitigation

Capturing Risks in a Format That Survives

The output of the pre-mortem needs to survive the session. A doc that lives in the meeting notes is a doc no one reads after the launch.

Use a structured risk register. It does not need to be sophisticated:

## Pre-Mortem Risk Register: Recommendation Engine v2
**Session date:** 2025-12-15  
**Launch date:** 2025-12-29  
**Facilitator:** Ravinder
 
| ID | Failure Scenario | Likelihood | Impact | Mitigation | Owner | Due |
|----|-----------------|------------|--------|------------|-------|-----|
| R1 | Cache miss cascade on cold start floods DB with N+1 queries | High | Critical | Add cache warm-up job pre-launch; alert on query rate > 5k/min | Priya | Dec 22 |
| R2 | ML model returns NaN scores for new user cohort, UI shows blank recommendations | Medium | High | Add model output validation; fallback to rule-based recommendations if NaN rate > 1% | Alex | Dec 24 |
| R3 | Feature flag rollout misconfigured — 100% exposure on day 1 | Medium | High | Add explicit config review step in launch checklist; second pair of eyes required | PM | Dec 28 |
| R4 | On-call engineer unfamiliar with model inference pipeline | High | Medium | Add runbook for common inference failures; do tabletop walkthrough before launch | Ravinder | Dec 26 |
| R5 | Third-party recommendation API rate limit hit during holiday traffic | Low | High | Add circuit breaker with fallback; load test at 2x expected holiday volume | Alex | Dec 25 |

The risk register goes into the launch readiness checklist. Each row's mitigation is a checkbox. The launch does not proceed until the checkboxes in the "Critical" and "High" impact rows are completed.

Follow-Through: Making Mitigations Real

This is where most pre-mortems fail. The session ends, the doc is written, the mitigations are assigned — and then no one follows up. A week later, launch day arrives and three of the five mitigations are incomplete.

The structural fix is to treat pre-mortem mitigations like launch blockers, not "nice to haves." Specifically:

  1. Put mitigations in the sprint immediately. Do not wait for the next planning session. Create the tickets in the session or within 24 hours.
  2. Add the risk register to the launch checklist. The launch is not approved until high and critical mitigations have a completion status.
  3. Check in on mitigations in standup. One sentence per mitigation until it is closed. This keeps them visible.
  4. Do a pre-mortem retrospective after launch. Two weeks after launch, look at what actually happened. Which risks materialized? Which mitigations worked? This feedback loop makes future pre-mortems better.
graph TD A[Pre-Mortem Session] --> B[Risk Register Created] B --> C[Tickets Created in Sprint Within 24h] C --> D[Launch Checklist Updated With Risk Status] D --> E{All Critical/High Mitigated?} E -->|No| F[Launch Blocked - Escalate] E -->|Yes| G[Launch Approved] G --> H[Post-Launch Retrospective at 2 weeks] H --> I[Update Pre-Mortem Template With Lessons] I --> J[Next Pre-Mortem Is More Accurate]

When Pre-Mortems Do Not Work

Pre-mortems fail in predictable ways. Knowing the failure modes helps you avoid them:

The HIPPO effect. The highest-paid person in the room (HIPPO) speaks first, and everyone else adjusts their views to match. Solution: enforce the silent writing phase. No sharing until every participant has written.

The "we already thought of that" deflection. An engineer raises a risk and the lead says "yeah, we have a plan for that." This shuts down exploration. The facilitator's response: "Great — write the plan in the risk register now, so we can verify it." Known risks still need explicit mitigations documented.

The "too late to change anything" session. A pre-mortem run the day before launch is not a risk management exercise — it is a blame-distribution exercise. If you cannot act on what you find, do not run it. Run it when you still have two weeks.

The polite pre-mortem. If everyone's scenarios are anodyne (network issues, third-party API downtime), the session has failed. Use the "known unknown" prompt to break the politeness ceiling. If you get only surface-level risks in the silent writing phase, give five more minutes and explicitly ask for the risks that people have not said in public yet.

Key Takeaways

  • Pre-mortems surface 30% more failure causes than standard risk reviews because retrospective framing removes the social cost of "being too negative."
  • Enforce silent individual writing before any group sharing — anchoring from early speakers is the single biggest threat to an honest session.
  • Use structured prompts ("known unknowns," "2 AM on-call," "customer support ticket titles") to break through the polite surface of standard risk discussions.
  • Every mitigation requires a named owner and a deadline; ownerless mitigations are aspirations and will not ship before launch.
  • Treat pre-mortem mitigations as launch blockers for Critical and High risks — add them to the launch checklist and refuse to approve launch until they are closed.
  • Run a post-launch retrospective at two weeks to close the feedback loop: which risks materialized, which mitigations worked, and what the next pre-mortem template should include.