Skip to main content
Data

Schema Migrations on a Billion-Row Table

Ravinder··9 min read
DataSchema MigrationPostgresMySQLDatabase
Share:
Schema Migrations on a Billion-Row Table

Altering a table with a billion rows is not like altering a table with a million rows, except slower. The failure modes are qualitatively different. A table lock that would be annoying at a million rows will take down your application for hours at a billion. A migration that works in staging — where you have 50,000 rows — will silently destroy production.

The mistake most teams make is treating schema migrations as a deployment concern handled by their ORM's migration runner. At scale, they are an infrastructure operation that requires planning, tooling, monitoring, and an abort plan.


Why Naive ALTER TABLE Fails

Most databases acquire a table-level lock for schema changes. In Postgres, ALTER TABLE ADD COLUMN DEFAULT NULL was made lock-free in version 11 — but ALTER TABLE ADD COLUMN DEFAULT 'some_value' still requires a full table rewrite, which locks for the duration. In MySQL prior to 8.0, most ALTER TABLE statements lock the table for the entire rewrite duration.

At a billion rows, a table rewrite can take 4–12 hours depending on row width and disk speed. A 4-hour table lock is an outage.

Even "online" DDL has gotchas:

  • Postgres concurrent index builds (CREATE INDEX CONCURRENTLY) do not lock writes but will fail if interrupted, leaving an invalid index behind.
  • MySQL online DDL queues schema changes, but the queue has a timeout. Long-running transactions prevent the DDL from completing.
  • Both engines can saturate disk I/O during rewrites, degrading query performance even without explicit locks.

The Tool Landscape

For MySQL, two tools dominate:

pt-online-schema-change (pt-osc) — from Percona Toolkit. Creates a shadow table, copies rows in batches, uses triggers to mirror live writes, then atomically swaps tables. Mature, well-understood, production-proven. The catch: triggers add write overhead. On very high write-rate tables (>50K writes/sec), trigger overhead can be significant.

gh-ost — from GitHub. Replicates changes from the MySQL binlog instead of using triggers. Lower write overhead, better control over throttling, and pauseable mid-migration. GitHub open-sourced it after using it on their own production tables. For high-write-rate tables, gh-ost is the better choice.

For Postgres, the story is different. PostgreSQL's DDL capabilities have improved significantly:

  • ADD COLUMN DEFAULT NULL — instant since Postgres 11.
  • ADD COLUMN DEFAULT <constant> — table rewrite in older versions; in Postgres 11+ constants are also stored as column defaults without rewriting.
  • ALTER COLUMN TYPE — almost always requires a table rewrite. No native online solution.
  • Index creation — CREATE INDEX CONCURRENTLY avoids write locks but takes longer.

For complex Postgres migrations, the patterns are: expand-contract, backfill scripts, and tools like pgslice (table partitioning) or pganonymize.


gh-ost on MySQL: The Mechanics

sequenceDiagram participant App as Application participant MySQL as MySQL Primary participant Ghost as gh-ost participant Shadow as Shadow Table Ghost->>MySQL: Connect to binlog stream Ghost->>MySQL: CREATE TABLE _orders_gho LIKE orders Ghost->>MySQL: Apply schema change to _orders_gho Ghost->>MySQL: BEGIN batch copy (chunk by chunk) App->>MySQL: INSERT / UPDATE / DELETE on orders MySQL->>Ghost: Binlog events (DML on orders) Ghost->>Shadow: Apply binlog events to _orders_gho Ghost->>MySQL: Continue batch copy Note over Ghost,Shadow: Repeat until copy + binlog = caught up Ghost->>MySQL: LOCK TABLE orders WRITE (briefly) Ghost->>MySQL: RENAME TABLE orders TO _orders_del, _orders_gho TO orders Ghost->>MySQL: UNLOCK App->>MySQL: Queries now hit renamed orders

The critical insight: gh-ost reads binlog events and applies them to the shadow table, so the shadow table stays nearly in sync with the source during the copy phase. The final lock window is typically under 2 seconds — just long enough to catch up the final binlog events and rename.

# gh-ost invocation for adding a column to a billion-row orders table
gh-ost \
  --host=mysql-primary.internal \
  --port=3306 \
  --user=ghost_user \
  --password="${GHOST_PASSWORD}" \
  --database=production \
  --table=orders \
  --alter="ADD COLUMN fulfillment_center_id BIGINT UNSIGNED DEFAULT NULL AFTER warehouse_id, ADD INDEX idx_fulfillment_center (fulfillment_center_id)" \
  --chunk-size=1000 \
  --max-load="Threads_running=50" \
  --critical-load="Threads_running=100" \
  --critical-load-interval-millis=2000 \
  --throttle-control-replicas="replica1.internal,replica2.internal" \
  --max-lag-millis=5000 \
  --initially-drop-ghost-table \
  --initially-drop-old-table \
  --ok-to-drop-table \
  --execute

Key flags to understand:

  • --max-load: pause copying if Threads_running exceeds 50. This is your primary throttle.
  • --max-lag-millis: pause if replication lag exceeds 5 seconds on any monitored replica. Critical for protecting replica read workloads.
  • --chunk-size: rows copied per batch. 1000 is conservative; you can go to 2000–5000 on fast disks with low contention.
  • Without --execute, gh-ost runs in dry-run mode — verify this first.

Postgres: The Expand-Contract Pattern

Postgres has no gh-ost equivalent. The standard approach for complex schema changes is expand-contract (also called parallel change):

Phase 1: Expand. Add the new column (or table) without removing the old. Both old and new exist simultaneously.

-- Step 1: Add column with no default (instant in Postgres 11+)
ALTER TABLE orders ADD COLUMN fulfillment_center_id BIGINT;
 
-- Step 2: Backfill in batches (see backfill strategy below)
-- Step 3: Add NOT NULL constraint only after backfill is complete
-- Use NOT VALID to skip scanning existing rows — validates only new rows
ALTER TABLE orders ADD CONSTRAINT orders_fulfillment_center_id_not_null
    CHECK (fulfillment_center_id IS NOT NULL) NOT VALID;
 
-- Step 4: Validate the constraint (separate transaction, no lock on writes)
ALTER TABLE orders VALIDATE CONSTRAINT orders_fulfillment_center_id_not_null;

The NOT VALID + VALIDATE CONSTRAINT split is the key Postgres technique. NOT VALID adds the constraint instantly (no full-table scan). VALIDATE CONSTRAINT scans existing rows to confirm compliance — it takes time but only holds a SHARE UPDATE EXCLUSIVE lock, which does not block reads or writes.

Phase 2: Migrate. Update application code to write to both old and new columns. Backfill old rows.

Phase 3: Contract. Once all rows are migrated and application code no longer reads the old column, drop the old column. ALTER TABLE DROP COLUMN in Postgres marks the column as dropped in the catalog and is instant — the space is reclaimed lazily.


Backfill Strategies

Never backfill a billion rows in a single transaction. It will lock the table, consume enormous undo/WAL space, and likely fail partway through.

The correct approach is chunked backfills with cursor-based pagination:

import psycopg2
import time
 
def backfill_fulfillment_center(conn_str: str, batch_size: int = 5000):
    conn = psycopg2.connect(conn_str)
    conn.autocommit = False
 
    with conn.cursor() as cur:
        # Get total count for progress tracking
        cur.execute("SELECT COUNT(*) FROM orders WHERE fulfillment_center_id IS NULL")
        total = cur.fetchone()[0]
        print(f"Rows to backfill: {total:,}")
 
        processed = 0
        last_id = 0
 
        while True:
            with conn:  # transaction per batch
                cur.execute("""
                    UPDATE orders
                    SET fulfillment_center_id = warehouse_to_center(warehouse_id)
                    WHERE id > %s
                      AND fulfillment_center_id IS NULL
                    ORDER BY id
                    LIMIT %s
                    RETURNING id
                """, (last_id, batch_size))
 
                rows = cur.fetchall()
                if not rows:
                    break
 
                last_id = rows[-1][0]
                processed += len(rows)
                print(f"Progress: {processed:,} / {total:,} ({100*processed/total:.1f}%)")
 
            # Throttle: sleep between batches to avoid I/O saturation
            time.sleep(0.1)
 
    conn.close()
    print("Backfill complete.")

Throttle between batches. The 100ms sleep is not optional. Without throttling, the backfill will saturate disk I/O and degrade application performance. Adjust based on your replication lag: if replicas start falling behind, increase the sleep.

Monitor replication lag during backfill. On Postgres, watch pg_stat_replication.write_lag and replay_lag. On MySQL, watch Seconds_Behind_Master. If lag spikes above your threshold, pause the backfill.

Use the primary key as cursor. ID-based cursors are more reliable than OFFSET-based pagination. OFFSET scans from the beginning of the table for each batch — ID-based cursors use the index.


Index Migrations

Adding an index to a billion-row table is a frequent need and a common source of incidents.

In Postgres:

-- Always use CONCURRENTLY for large tables
CREATE INDEX CONCURRENTLY idx_orders_fulfillment_center
    ON orders (fulfillment_center_id)
    WHERE fulfillment_center_id IS NOT NULL;  -- partial index if appropriate

Risks of concurrent index builds:

  • If the build is interrupted (process killed, connection dropped), it leaves an invalid index. Check with SELECT * FROM pg_indexes WHERE indexname = 'idx_orders_fulfillment_center' and look at pg_index.indisvalid. Drop and rebuild invalid indexes.
  • Long-running transactions can block the final phase of the concurrent build. Check pg_stat_activity for long-running queries before starting.

In MySQL with gh-ost, include index creation in the --alter clause. gh-ost creates the index on the shadow table before the swap, so the primary table never takes a lock for index creation.


The Abort Plan

Every migration needs a tested abort plan. "We can always roll back" is not a plan. Write it down:

  1. gh-ost abort. Kill the gh-ost process. The shadow table (_orders_gho) is left behind. The original table is untouched. Clean up: DROP TABLE _orders_gho; DROP TABLE _orders_del;

  2. Backfill abort. Kill the backfill script. The column has partial data (some rows populated, some NULL). Application code must handle NULL for this column until the backfill is restarted.

  3. Constraint abort. If VALIDATE CONSTRAINT is running and causing load, you can kill it. The constraint remains NOT VALID. You can drop the constraint without penalty and retry validation during a quieter period.

  4. Application abort. If the new schema causes application errors post-migration, the fastest rollback is a feature flag disabling the code path that uses the new column — not a database rollback. Database rollbacks are often slower than the original migration.

Document these steps in your runbook before you start the migration. Run a fire drill in staging.


Key Takeaways

  • Naive ALTER TABLE at billion-row scale causes hours-long table locks; treat schema migrations as infrastructure operations, not ORM tasks.
  • For MySQL, gh-ost is the preferred tool for high-write-rate tables because it reads from binlog instead of using triggers, keeping write overhead low.
  • For Postgres, the expand-contract pattern with ADD CONSTRAINT ... NOT VALID followed by VALIDATE CONSTRAINT lets you add constraints incrementally without blocking reads or writes.
  • Always backfill in batches using a primary-key cursor with inter-batch throttling; monitor replication lag and pause if it spikes above your threshold.
  • Create large indexes with CONCURRENTLY in Postgres and check for invalid indexes after any interrupted build.
  • Write and test your abort plan before starting the migration — the fastest rollback in most cases is a feature flag, not a database operation.