Engineering

Flaky Test Triage as a First-Class Workflow

Ravinder·December 5, 2025·8 min read

EngineeringTestingCI/CDReliability

Flaky Test Triage as a First-Class Workflow

There is a specific moment when an engineering team's relationship with CI breaks. It doesn't happen all at once. It happens incrementally: a test fails, someone reruns the pipeline, it passes, and nobody investigates. Then it happens again. Then it becomes common enough that "just retry" becomes standard advice. By the time the failure rate on main is 15%, nobody believes a red pipeline means anything. They retry, merge anyway, and hope.

I've seen this pattern in teams of every size. The flakiness is never the root cause — it's a symptom of treating test reliability as somebody else's problem. This post is about building a system that treats flaky tests as incidents: detected automatically, owned explicitly, quarantined until fixed or deleted.

Why Flaky Tests Compound

A single flaky test with a 5% failure rate seems harmless. Run it 20 times across 20 PRs in a day and it fires once — a minor annoyance. But at 50 engineers with 5 PRs each, that same test fires 12.5 times per day. Add 10 such tests and you have 125 spurious failures per day. Engineers learn that red means nothing. The psychological cost is worse than the wall-clock cost.

flowchart TD A[Flaky test introduced] --> B[Engineers retry CI] B --> C{Passes on retry?} C -->|Yes| D[Merged, ignored] C -->|No| E[Investigated rarely] D --> F[Retry culture normalizes] F --> G[More flakiness tolerated] G --> H[Engineers distrust CI entirely] H --> I[Real failures missed] I --> J[Production incidents] E --> K[Root cause fixed] K --> L[Test reliability restored]

The loop is self-reinforcing. The only exit is systematic intervention.

Step 1: Detection — You Can't Fix What You Don't Measure

Before quarantine, before ownership, you need reliable detection. "Someone noticed it failed twice" is not detection. Detection means running each test N times on a stable branch and recording the failure rate.

The simplest approach: a nightly job that runs your suite three times in parallel on main and reports any test that fails in at least one run but not all three.

# scripts/detect_flaky.py
import subprocess
import json
from collections import defaultdict
 
RUNS = 3
SUITE_CMD = ["pytest", "--json-report", "--json-report-file=/tmp/results_{run}.json", "-x"]
 
def run_suite(run_id: int) -> dict:
    cmd = [c.replace("{run}", str(run_id)) for c in SUITE_CMD]
    subprocess.run(cmd, capture_output=True)
    with open(f"/tmp/results_{run_id}.json") as f:
        return json.load(f)
 
results = defaultdict(list)
for i in range(RUNS):
    report = run_suite(i)
    for test in report["tests"]:
        results[test["nodeid"]].append(test["outcome"])
 
flaky = {
    test_id: outcomes
    for test_id, outcomes in results.items()
    if len(set(outcomes)) > 1  # mixed pass/fail across runs
}
 
print(f"Detected {len(flaky)} flaky tests")
for test_id, outcomes in flaky.items():
    print(f"  {test_id}: {outcomes}")

More sophisticated: use your CI platform's API to aggregate historical failure data. If a test has passed and failed in the last 50 runs on the same commit SHA, it's flaky. GitHub Actions doesn't expose this directly, but most enterprise CI platforms (BuildKite, CircleCI) do. Alternatively, push test results to a database yourself.

-- Schema for test result tracking
CREATE TABLE test_runs (
    id          BIGSERIAL PRIMARY KEY,
    test_id     TEXT NOT NULL,
    run_id      TEXT NOT NULL,
    commit_sha  TEXT NOT NULL,
    branch      TEXT NOT NULL,
    outcome     TEXT NOT NULL, -- 'passed', 'failed', 'skipped'
    duration_ms INT,
    recorded_at TIMESTAMPTZ DEFAULT NOW()
);
 
CREATE INDEX idx_test_runs_flakiness
    ON test_runs (test_id, commit_sha)
    WHERE branch = 'main';
 
-- Flakiness rate per test over last 30 days
SELECT
    test_id,
    COUNT(*) AS total_runs,
    SUM(CASE WHEN outcome = 'failed' THEN 1 ELSE 0 END) AS failures,
    ROUND(100.0 * SUM(CASE WHEN outcome = 'failed' THEN 1 ELSE 0 END) / COUNT(*), 1) AS failure_rate_pct
FROM test_runs
WHERE recorded_at > NOW() - INTERVAL '30 days'
  AND branch = 'main'
GROUP BY test_id
HAVING SUM(CASE WHEN outcome = 'failed' THEN 1 ELSE 0 END) > 0
ORDER BY failure_rate_pct DESC;

Run this query weekly and you have your flakiness leaderboard.

Step 2: Automatic Quarantine

Once you can detect flaky tests, automate the quarantine decision. The policy: any test with a failure rate above 3% on main over the last 14 days gets quarantined automatically. Quarantine means it runs in a separate non-blocking job — you can still see the results, but it doesn't gate merges.

In pytest, quarantine via a custom marker:

# conftest.py
import pytest
 
def pytest_collection_modifyitems(config, items):
    quarantined = load_quarantine_list()  # from your DB or a YAML file
    for item in items:
        if item.nodeid in quarantined:
            item.add_marker(pytest.mark.quarantine)
 
def pytest_runtest_makereport(item, call):
    if "quarantine" in item.keywords and call.when == "call":
        if call.excinfo:
            # Override failure to warning for quarantined tests
            call.excinfo = None

# .github/workflows/ci.yml
jobs:
  test:
    steps:
      - run: pytest -m "not quarantine" --ci
 
  quarantined-tests:
    steps:
      - run: pytest -m quarantine --ci || true  # non-blocking
      - name: Report quarantine results
        run: python scripts/report_quarantine.py

Store your quarantine list in a YAML file committed to the repo — this makes it visible, reviewable, and auditable:

# .quarantine.yml
quarantined:
  - id: "tests/integration/test_payment.py::test_stripe_webhook_retry"
    reason: "Race condition in webhook timing — tracked in JIRA-4821"
    owner: "payments-team"
    quarantined_at: "2025-11-14"
    deadline: "2025-12-14"
 
  - id: "tests/unit/test_cache.py::test_eviction_under_pressure"
    reason: "Depends on system clock — needs mock"
    owner: "platform-team"
    quarantined_at: "2025-11-20"
    deadline: "2025-12-20"

The deadline field is critical. Without it, quarantine becomes permanent.

Step 3: Ownership Tracking

A quarantined test with no owner is a quarantined test that never gets fixed. Ownership must be assigned at quarantine time — not as a suggestion, but as a blocking requirement for the quarantine PR to merge.

We enforce this with a pre-commit hook that validates .quarantine.yml:

#!/usr/bin/env python3
# .git/hooks/pre-commit (or pre-receive on the server side)
import yaml
import sys
from datetime import datetime, timedelta
 
with open(".quarantine.yml") as f:
    data = yaml.safe_load(f)
 
errors = []
for entry in data.get("quarantined", []):
    if not entry.get("owner"):
        errors.append(f"{entry['id']}: missing owner")
    if not entry.get("reason"):
        errors.append(f"{entry['id']}: missing reason")
    if not entry.get("deadline"):
        errors.append(f"{entry['id']}: missing deadline")
    else:
        deadline = datetime.fromisoformat(entry["deadline"]).date()
        max_deadline = datetime.now().date() + timedelta(days=30)
        if deadline > max_deadline:
            errors.append(f"{entry['id']}: deadline too far out (max 30 days)")
 
if errors:
    print("Quarantine validation failed:")
    for e in errors:
        print(f"  - {e}")
    sys.exit(1)

Ownership assignment also needs a notification mechanism. When a test is quarantined, the owning team's Slack channel gets a message with the test ID, the failure rate, and the deadline. We use a simple webhook:

import requests
 
def notify_owner(entry: dict, failure_rate: float):
    webhook_url = TEAM_WEBHOOKS[entry["owner"]]
    requests.post(webhook_url, json={
        "text": (
            f"Test quarantined: `{entry['id']}`\n"
            f"Failure rate: {failure_rate:.1f}% over 14 days\n"
            f"Reason: {entry['reason']}\n"
            f"Fix deadline: {entry['deadline']}"
        )
    })

Step 4: Deletion Criteria

Quarantine without deletion criteria produces a graveyard. The rule we settled on: a quarantined test that reaches its deadline without being fixed is automatically deleted via a daily job, with a PR opened for team review.

More importantly, some tests should be deleted rather than fixed:

The test covers a code path that no longer exists
The test covers a race condition that is architecturally impossible to test reliably
The test duplicates coverage from another test that is not flaky
The test's assertion is trivially tautological (it tests the mock, not the behavior)

We evaluate these at the quarantine review meeting (30 minutes, every two weeks). Any quarantined test older than 14 days gets discussed: fix it, extend deadline with justification, or delete it. No limbo.

# scripts/prune_quarantine.py — run in CI nightly
import yaml
from datetime import datetime, date
import subprocess
 
with open(".quarantine.yml") as f:
    data = yaml.safe_load(f)
 
today = date.today()
to_remove = []
for entry in data.get("quarantined", []):
    deadline = datetime.fromisoformat(entry["deadline"]).date()
    if deadline < today:
        to_remove.append(entry)
        print(f"Overdue: {entry['id']} (deadline {deadline}, owner {entry['owner']})")
 
if to_remove:
    # Open a PR with the overdue entries flagged
    subprocess.run(["python", "scripts/open_deadline_pr.py",
                    "--entries", ",".join(e["id"] for e in to_remove)])

Step 5: Root-Cause Patterns

After triaging dozens of flaky tests, the same causes appear repeatedly. Know them:

Timing dependencies: Tests that assert state after a fixed sleep() or that race against a real clock. Fix with explicit event waiting or mock the clock entirely.

# Bad
time.sleep(0.5)
assert queue.empty()
 
# Good
queue.join()  # blocks until all items processed
assert queue.empty()

Shared mutable state: Tests that pass in isolation but fail in suite because they share global state — database rows, Redis keys, module-level singletons. Fix with explicit setup/teardown and scoped fixtures.

Port conflicts: Integration tests that bind to fixed ports. Fix with port=0 (OS assigns a free port) and pass the port to the system under test.

Ordering dependencies: Test A creates data that Test B expects. Fix by making each test fully self-sufficient. Tools like pytest-randomly help surface these.

External service calls: Tests that hit real APIs. Fix with VCR cassettes or deterministic stubs. Never let non-hermetic calls into the test suite.

Key Takeaways

Treat flaky tests as incidents with detection, ownership, and resolution SLAs — not as background noise to retry through.
Instrument your CI to record per-test outcomes to a database; you cannot manage flakiness rates you are not measuring.
Automate quarantine at a defined threshold so flaky tests cannot gate merges, but keep results visible so they don't disappear.
Every quarantined test needs an owner and a hard deadline; without both, quarantine becomes a permanent graveyard.
Know the five root-cause patterns — timing, shared state, port conflicts, ordering, and external calls — and resolve the root cause rather than wrapping failures in retries.
Delete tests that cannot be made reliably hermetic; a deleted test causes zero false confidence.