Engineering

Trunk-Based Development Without Breaking Prod

Ravinder·October 9, 2025·9 min read

EngineeringTrunk-Based DevelopmentFeature FlagsCI/CD

Trunk-Based Development Without Breaking Prod

The first time I proposed trunk-based development to a team that had been living on long-lived feature branches, I got the same reaction you'd expect: "So you want us to merge broken code directly to main?" That's the wrong mental model, and it's the reason most TBD migrations fail. Trunk-based development doesn't mean shipping incomplete work — it means decoupling deployment from feature delivery, and that distinction changes everything.

We've been on trunk-based development with a team of 40+ engineers for two years. In that time, we've had exactly one production incident caused by a bad merge to main. This post is about the specific guardrails that make that number possible.

The Core Insight: Deployment Is Not Release

Long-lived feature branches exist because engineers conflate two separate concerns: getting code into production infrastructure, and making a feature available to users. If those two things happen at the same moment, long-lived branches make sense — you need somewhere to accumulate work until the feature is "ready."

Separate them, and the calculus changes completely. Code can land in production continuously. Features become available to users according to a separate, controlled schedule. The mechanism that enables this separation is the feature flag.

graph LR A[Code merged to main] --> B[Deployed to production] B --> C{Feature Flag?} C -->|Flag OFF| D[Code runs, feature hidden] C -->|Flag ON for 1%| E[Dark launch / canary] C -->|Flag ON for 100%| F[Full rollout] D --> E --> F

Once you internalize this model, the question "is this code ready to ship?" becomes two separate questions: "is this code safe to deploy?" (almost always yes, if the flag is off) and "is this feature ready to release?" (that's a product decision, made independently).

Branch Policy: The Non-Negotiables

Trunk-based development requires branch policy that's enforced mechanically, not socially. Relying on engineers to "just not create long-lived branches" doesn't work. Here's our enforced policy:

Hard rules (CI blocks merge if violated):

No branch may be more than 2 days old without a merge to main or explicit extension approval
PRs must have at least one passing CI run against the current main commit (rebased, not just green against the branch base)
Merge commits are disabled; squash or rebase only

Soft norms (reviewed in retros, not CI-blocked):

PRs under 400 lines are preferred
Each PR should be deployable independently — it should not require another PR to be merged first to be safe

The 2-day branch lifetime rule is aggressive and was controversial when we introduced it. It forces a discipline: if your change is too big to land in 2 days, you need to break it up. That's a feature, not a bug. Engineers who struggle with this rule almost always have a hidden problem (can't ship incomplete work because flags aren't available for their feature type, or tests take 45 minutes to run and rebasing is painful) that the rule surfaces.

# .github/branch-protection.yml
# Applied via Terraform / GitHub Actions
branch_protection:
  main:
    required_status_checks:
      - "ci/build"
      - "ci/test"
      - "ci/security-scan"
    require_up_to_date: true
    required_approving_review_count: 1
    dismiss_stale_reviews: true
    require_linear_history: true
    allow_force_pushes: false
    allow_deletions: false

Feature Flag Discipline

Flags are the heart of TBD, and they're also where teams accumulate the most technical debt. A flag created in January 2024 that's still in the codebase in October 2025 is a liability: it adds conditional paths that complicate testing, it carries cognitive overhead, and it often guards code that was "temporary" and became permanent.

Our flag lifecycle has four phases:

Draft — flag exists in config, code guarded, flag is OFF everywhere
Canary — flag ON for internal users and a small percentage of production traffic
Rollout — flag percentage increasing from 5% → 25% → 100% over days or weeks
Cleanup — flag removed from code, feature runs unconditionally

The cleanup phase is the one teams skip, and it's the one that matters most for long-term codebase health. We track flag age in our internal developer platform. Any flag older than 90 days that hasn't reached 100% rollout generates a weekly nag to the owning team. Any flag older than 180 days gets escalated to an engineering manager.

// Flag usage with automatic staleness tracking
import { getFlag } from "@arika/flags";
 
export async function renderCheckout(userId: string) {
  // Flag name, owner, and created date are required at flag creation time
  // The SDK warns at startup if a flag is >90 days old
  const useNewCheckoutFlow = await getFlag("new-checkout-flow", {
    userId,
    defaultValue: false,
  });
 
  if (useNewCheckoutFlow) {
    return renderNewCheckout(userId);
  }
  return renderLegacyCheckout(userId);
}

One rule we enforce without exception: flags that affect data writes must be independently rollback-safe. A flag that causes your service to write to a new database table is not rollback-safe if turning the flag off leaves the new table with partial data that the old path doesn't read. This requires thinking about flag design before the code is written, not after.

Dark Launches and Traffic Shadowing

Dark launch means sending real production traffic through new code paths without surfacing results to users. It's the highest-confidence validation technique available before full release, and it's under-used because it requires infrastructure investment upfront.

Our dark launch setup works at the service level: when a flag is in canary phase, we can configure the service to execute both the old and new code paths for the same request and compare outputs. The new path's result is discarded; the old path's result is returned. Differences are logged and aggregated.

# Simplified dark launch comparison middleware
import asyncio
import logging
from typing import Callable, Any
 
logger = logging.getLogger("dark_launch")
 
async def dark_launch_compare(
    control: Callable[..., Any],
    candidate: Callable[..., Any],
    *args,
    **kwargs,
) -> Any:
    """
    Run both control and candidate. Return control result.
    Log any differences between outputs.
    """
    control_result, candidate_result = await asyncio.gather(
        control(*args, **kwargs),
        candidate(*args, **kwargs),
        return_exceptions=True,
    )
 
    if isinstance(candidate_result, Exception):
        logger.warning(
            "dark_launch.candidate_error",
            extra={"error": str(candidate_result), "args": str(args)[:200]},
        )
    elif control_result != candidate_result:
        logger.info(
            "dark_launch.divergence",
            extra={
                "control": str(control_result)[:500],
                "candidate": str(candidate_result)[:500],
            },
        )
 
    return control_result

Dark launch data tells you two things before you turn on the flag for real: whether the new code path crashes under production input diversity, and whether it produces different outputs. Divergences are expected at first and are the signal you investigate. When divergences reach zero (or are fully explained), you have high confidence in the new path.

Rollback Drills: Practice Before You Need It

Every team that claims to have a rollback strategy has never actually run one under pressure. The drill is what validates the strategy.

We run quarterly rollback drills in a production-mirroring staging environment. The exercise: engineer introduces a tagged "bad" commit to main, triggers a deploy, then must fully roll back within 15 minutes using only documented procedures.

The 15-minute target is not arbitrary. It corresponds to our maximum tolerable time to restore from a flag-off rollback. If rolling back a flag takes 20 minutes, we investigate why and fix the tooling. Common culprits: flag propagation lag (our CDN was caching flag evaluations for 10 minutes; we reduced it to 60 seconds), missing runbooks, or flag SDK not handling graceful degradation when the flag service is unavailable.

sequenceDiagram participant On-Call as On-Call Engineer participant Flags as Feature Flag Service participant Service as Production Service participant Monitor as Monitoring Monitor->>On-Call: Alert: error rate spike On-Call->>Flags: Set flag "new-checkout-flow" to 0% Flags-->>Service: Config propagated (< 60s) Service-->>Monitor: Error rate recovering On-Call->>Monitor: Confirm recovery Note over On-Call,Monitor: Total: < 5 minutes with flag rollback

The drill also tests the runbook. If the person running the drill can't complete it by following the runbook alone (without tribal knowledge), the runbook is wrong and needs updating. We treat runbook gaps found in drills the same way we treat production incidents: they get a ticket with an owner and a due date.

PR Size and Incremental Delivery

The single biggest practical obstacle to TBD is the engineer who says "my feature is just too big to break into small PRs." In 90% of cases, this is a skill issue, not a feature complexity issue. Breaking up large changes is a learnable discipline.

Patterns that work:

Strangler fig on the code path: Add the new implementation alongside the old, guarded by a flag. Neither path is removed until the new path is fully validated. Each PR is small because it's additive.

Schema-first migrations: Database schema changes land first, with backward compatibility guaranteed. Application code using the new schema lands in subsequent PRs. No PR is blocked on another.

Parallel running: For critical business logic, run both implementations and compare results (see dark launch above). The comparison harness can land before either implementation is complete.

The one case where breaking up is genuinely hard: large refactors. Renaming a core domain concept across 200 files cannot easily be made incremental. Our approach is to treat mechanical refactors (no behavior change) as a special category: they get a dedicated "refactor PR" that's still squashed to main, but the flag requirement is waived if CI confirms no behavior change via snapshot tests.

Key Takeaways

Trunk-based development works by separating deployment from release — code lands continuously in production, features become available via flags on a separate schedule.
Branch lifetime limits must be mechanically enforced, not socially expected; 2-day limits surface hidden tooling and architecture problems.
Feature flag lifecycle management is as important as the flags themselves — flag cleanup debt compounds fast and should be tracked with explicit age limits and escalation paths.
Dark launches (traffic shadowing with output comparison) provide the highest pre-release confidence and are worth the infrastructure investment for high-risk changes.
Rollback drills done quarterly against documented runbooks are the only way to verify that your rollback strategy works before you need it under pressure.
Breaking large features into small PRs is a learnable engineering skill, not an inherent limitation of feature complexity — the strangler fig, schema-first, and parallel-running patterns cover most cases.