Engineering

Migrations: The Playbook That Beats Hope

Ravinder·November 4, 2025·9 min read

EngineeringMigrationsDatabaseReliability

Migrations: The Playbook That Beats Hope

The migration had been running for six weeks. The team had done everything right on paper: backfill script written, tested in staging, nightly batch jobs humming. Then, on the day they cut over production traffic, they discovered a class of writes that the backfill had never touched — a legacy code path that one team owned and nobody else remembered existed. Rollback took four hours. The data was recoverable, but trust was not.

That story isn't rare. Migrations fail not because engineers are careless but because migrations are fundamentally a distributed systems problem disguised as a database problem. You have two states of truth, two code paths, and a window of time where both must coexist. Hope is not a strategy for that window.

This is the playbook we actually use: phased, instrumented, paranoid.

Why Migrations Always Feel Safe Until They Aren't

The seductive lie of a migration is that it feels like a one-time operation. In reality, a migration creates a period of concurrent state — the old system and the new system both hold authoritative data, and any code path that writes to one but not the other creates silent divergence. The longer the migration window, the more divergence accumulates. The more divergence accumulates, the more painful the cutover.

The second lie is that staging proves correctness. Staging has last month's schema, last quarter's data volume, and none of the weird multi-tenant edge cases your largest customer introduced six months ago. Staging tells you the migration runs. It does not tell you the migration is correct.

The third lie is that rollback is easy. Rolling back a schema change after writes have gone through is almost never as simple as running ALTER TABLE in reverse. If you haven't explicitly designed your rollback path before you start, you don't have one.

The Four Phases

Every migration we run follows four phases regardless of scope. A single-column rename takes an afternoon through these phases. A service-to-service platform migration takes months. The phases don't change; only the duration does.

flowchart LR A[Phase 1\nPrepare] --> B[Phase 2\nDual-Write] B --> C[Phase 3\nShadow Read] C --> D[Phase 4\nCutover] D --> E[Phase 5\nCleanup] D -->|divergence detected| C C -->|high error rate| B style A fill:#1e3a5f,color:#fff style B fill:#1e3a5f,color:#fff style C fill:#1e3a5f,color:#fff style D fill:#c0392b,color:#fff style E fill:#1a5e35,color:#fff

Phase 5 — cleanup — is shown here because it is the phase most teams skip, and skipping it is how you accumulate migration debt that haunts you in the next migration.

Phase 1: Prepare

Preparation is about three things: understanding the current state completely, defining success precisely, and building your rollback path before you write a line of migration code.

Audit every writer. This is the step the team in the opening story missed. You need an exhaustive inventory of every code path that touches the data you're migrating. Use your query logs. Use your ORM's query tracker. Use grep. Be paranoid. The legacy code path that nobody remembers is always there.

# Find all callsites that reference the old table
grep -rn "orders_legacy\|LegacyOrder\|legacy_order" \
  --include="*.py" --include="*.go" --include="*.ts" \
  ./services/ | sort | uniq -c | sort -rn

Define the migration contract. Write down, before you start, what "done" looks like. This includes:

Which columns or records are being migrated
What the accepted data type conversions are
What happens to records created during the migration window
What the error tolerance is during shadow reads (we typically use < 0.1%)

Design your rollback explicitly. For each phase, write down what rollback looks like. If rollback from Phase 3 means re-enabling reads from the old system, that needs to be a feature flag flip, not a code deploy.

Phase 2: Dual-Write

Dual-write means every write goes to both the old system and the new system, with the old system remaining the source of truth. This is the most important phase and the most dangerous one to rush.

class OrderRepository:
    def create_order(self, order: Order) -> Order:
        # Old system remains authoritative
        saved = self.legacy_db.insert(order)
 
        # New system gets a best-effort write
        try:
            self.new_db.insert(self._translate(saved))
        except Exception as e:
            # Metric, not exception — old system succeeded
            metrics.increment("migration.dual_write.new_system_failure",
                              tags={"entity": "order"})
            log.warning("dual_write_new_system_failure", order_id=saved.id, error=str(e))
 
        return saved

The critical discipline here: the new system's write must never block or fail the old system's write. Wrap the new write in a try/except, emit a metric on failure, and move on. If the new system is down, orders still get created.

Track the dual-write lag metric obsessively. If writes are failing to the new system at more than 1%, stop and investigate before proceeding.

# Prometheus metric for dual-write health
dual_write_lag_gauge = Gauge(
    'migration_dual_write_lag_seconds',
    'Seconds between write to old system and new system',
    ['entity_type']
)
 
dual_write_failure_counter = Counter(
    'migration_dual_write_failures_total',
    'Count of writes that failed on new system during dual-write',
    ['entity_type', 'reason']
)

Backfill happens here. While dual-write runs, you backfill historical data into the new system. Do it in small batches with rate limiting. Always checkpoint your progress so restarts are safe.

def backfill_orders(batch_size: int = 500, delay_ms: int = 50):
    cursor = get_checkpoint("orders_backfill") or 0
    while True:
        batch = legacy_db.query(
            "SELECT * FROM orders WHERE id > %s ORDER BY id LIMIT %s",
            cursor, batch_size
        )
        if not batch:
            break
        for order in batch:
            new_db.upsert(translate_order(order))
        cursor = batch[-1].id
        save_checkpoint("orders_backfill", cursor)
        time.sleep(delay_ms / 1000)

Phase 3: Shadow Reads

Shadow reads are where you verify that the new system's data is actually correct before trusting it with production traffic. For every read from the old system, you also read from the new system and compare the results — but you serve the old result to the caller.

func (r *OrderRepo) GetOrder(ctx context.Context, id string) (*Order, error) {
    old, err := r.legacyDB.GetOrder(ctx, id)
    if err != nil {
        return nil, err
    }
 
    // Shadow read — async, never blocks caller
    go func() {
        newOrder, newErr := r.newDB.GetOrder(ctx, id)
        if newErr != nil {
            metrics.Inc("migration.shadow_read.new_system_error")
            return
        }
        if !ordersEqual(old, newOrder) {
            metrics.Inc("migration.shadow_read.divergence")
            log.Warn("shadow_read_divergence",
                "id", id,
                "old", old,
                "new", newOrder,
            )
        } else {
            metrics.Inc("migration.shadow_read.match")
        }
    }()
 
    return old, nil
}

Run shadow reads for long enough to build confidence. For a high-volume system, this might be a few hours. For systems with weekly batch jobs or monthly billing cycles, it might be weeks. The exit criterion is not "it's been long enough" — it's "the divergence rate has been below our threshold for one full business cycle."

When you do find divergences, treat each one as a data bug, not a migration bug. Understand the root cause. Was it a missed writer? A timezone conversion issue? A null handling difference? Fix the root cause, not just the symptom.

Phase 4: The Cutover Ceremony

Cutover is a ceremony, not an event. Treat it like a production incident in advance: pre-assigned roles, a written runbook, clear abort criteria, and a communication plan.

sequenceDiagram participant TL as Tech Lead participant Eng as Engineer participant Monitor as Monitoring participant Comms as Comms/Slack TL->>Comms: "Cutover starting — migration-playbook runbook linked" Eng->>Eng: Flip feature flag: reads from new system (5% canary) Monitor->>TL: Error rate baseline established TL->>Eng: Ramp to 25% Monitor->>TL: Divergence rate: 0.02% ✓ TL->>Eng: Ramp to 100% Monitor->>TL: p99 latency delta: +4ms ✓ TL->>Comms: "Cutover complete — old system now read-only" TL->>Eng: Remove dual-write to old system (Phase 5 begins)

Your abort criteria should be defined before the ceremony starts:

Error rate on new system reads > 1%: abort and roll back
p99 latency on new system > 2x old system: abort and investigate
Any data corruption detected: immediate halt, incident declared

Use a feature flag to ramp traffic, not a code deploy. A flag flip is reversible in seconds. A deploy takes minutes and introduces its own risk.

Rollback: Design It First, Use It Last

Rollback should feel boring when you execute it. If rollback requires heroics, you haven't designed it properly.

For database migrations, the specific rollback paths are:

Phase 2 (dual-write): Stop writes to new system, drain async queue, done. Old system never stopped being authoritative.
Phase 3 (shadow reads): Same as Phase 2 rollback. New system was never serving reads.
Phase 4 (cutover): Feature flag flip back to old system. New system writes continue briefly via dual-write while you diagnose.

The constraint that makes rollback possible: never remove the old write path until you are fully confident the new system is correct and the old system is in read-only state. Teams that remove the dual-write early because it "seems fine" have no rollback path.

Phase 5: Cleanup

Cleanup is mandatory, not optional. Every column you leave behind, every dead code path you don't delete, every dual-write you don't remove is future maintenance cost and future confusion.

Set a cleanup deadline before you start Phase 4. Ours is 30 days after successful cutover. At that deadline:

The old system goes read-only, then offline
The dual-write code is deleted
The old schema columns are dropped
The migration feature flags are archived

-- Only after confirmed cutover and monitoring period
ALTER TABLE orders DROP COLUMN legacy_customer_id;
ALTER TABLE orders DROP COLUMN legacy_status_code;
DROP TABLE orders_legacy;

Key Takeaways

Migrations fail at the edges: the legacy code path nobody remembered, the batch job that runs monthly, the admin tool someone built and forgot to tell you about. Audit every writer before you start.
Dual-write keeps the old system authoritative; the new system's write must never block the old system's write. Wrap it, meter it, and treat failures as signals, not exceptions.
Shadow reads are your proof of correctness. Exit criteria should be based on divergence rate over a full business cycle, not elapsed time.
The cutover ceremony deserves the same preparation as a production incident: pre-assigned roles, a written runbook, explicit abort criteria, and feature-flag-based traffic ramping — not a code deploy.
Rollback must be designed before Phase 1 starts. If you haven't written down what rollback looks like for each phase, you don't have a rollback plan.
Cleanup is not optional. Dead write paths, orphaned columns, and retired flags are the debt that makes your next migration harder.