Skip to main content
Architecture

Event Sourcing: The Four Ways It Goes Wrong

Ravinder··9 min read
ArchitectureEvent SourcingGDPRDistributed Systems
Share:
Event Sourcing: The Four Ways It Goes Wrong

Event sourcing is the only pattern I know that makes teams dramatically overconfident at the start and dramatically regretful six months later. The initial implementation feels clean: everything is an event, events are immutable, you can replay history. Then production teaches you about schema drift at month 3, a GDPR deletion request at month 5, and a 14-hour projection rebuild at month 6.

None of these problems are unsolvable. But each one is easier to mitigate if you design for it before the event store has a billion rows. Here are the four failure modes and what to do about each.

Failure Mode 1: Schema Drift

Events accumulate in an append-only store. Unlike a relational schema where you run a migration and the old shape disappears, events live forever in their original shape. Six months after launch, your OrderCreated event has evolved through seven incompatible versions, and your projection code contains a switch statement that nobody fully understands.

# What schema drift looks like in practice
def apply_order_created(self, event: dict):
    version = event.get("schema_version", 1)
 
    if version == 1:
        # Original shape: no currency field
        amount = event["amount_cents"]
        currency = "USD"  # assumed
    elif version == 2:
        # Added currency in v2
        amount = event["amount_cents"]
        currency = event["currency"]
    elif version == 3:
        # Renamed amount_cents to total_cents in v3
        amount = event["total_cents"]
        currency = event["currency"]
    elif version == 4:
        # Split into subtotal + tax in v4
        amount = event["subtotal_cents"] + event["tax_cents"]
        currency = event["currency"]
    else:
        raise UnknownEventVersion(version)

This code is a maintenance trap. The right approach is event upcasting: transform old event shapes into the current shape at read time, so projection code only handles one shape.

from typing import Callable
 
EventUpcaster = Callable[[dict], dict]
 
UPCASTERS: dict[tuple[str, int], EventUpcaster] = {
    ("OrderCreated", 1): lambda e: {**e, "currency": "USD", "schema_version": 2},
    ("OrderCreated", 2): lambda e: {**e, "total_cents": e.pop("amount_cents"), "schema_version": 3},
    ("OrderCreated", 3): lambda e: {
        **e,
        "subtotal_cents": e["total_cents"],
        "tax_cents": 0,  # historical events had no tax breakdown
        "schema_version": 4
    },
}
 
def upcast_event(event: dict) -> dict:
    """Apply all upcasters in sequence until current version."""
    current_version = CURRENT_SCHEMA_VERSION[event["type"]]
    e = dict(event)
    while e.get("schema_version", 1) < current_version:
        key = (e["type"], e.get("schema_version", 1))
        upcaster = UPCASTERS.get(key)
        if upcaster is None:
            raise MissingUpcaster(key)
        e = upcaster(e)
    return e

Rules for managing schema evolution:

  • Never change the meaning of an existing field. Add fields; do not rename or reuse them.
  • Increment schema_version on every structural change. Make it explicit in the event payload, not inferred.
  • Write an upcaster for every version transition before deploying a schema change.
  • Test upcasters against real historical events from production — do not test only against synthetic fixtures.
flowchart LR E1["Event v1\n{amount_cents: 100}"] E2["Event v2\n{amount_cents: 100,\ncurrency: USD}"] E3["Event v3\n{total_cents: 100,\ncurrency: USD}"] E4["Event v4\n{subtotal_cents: 100,\ntax_cents: 0,\ncurrency: USD}"] E1 -->|"upcast 1→2"| E2 E2 -->|"upcast 2→3"| E3 E3 -->|"upcast 3→4"| E4 E4 --> Projection["Projection\n(handles v4 only)"] style Projection fill:#276749,color:#e2e8f0

Failure Mode 2: Projection Rebuild Storms

A projection is derived from replaying all relevant events. When you change the projection logic — a new field, a different aggregation, a bug fix — you must rebuild it by replaying the entire event stream. For a small store, this takes seconds. For a large store, it takes hours.

The failure mode is not slow rebuilds per se. It is a rebuild that competes with the live system for the same database resources, causing production degradation while the rebuild runs. I have seen a rebuild of 50 million events take 14 hours and raise p99 query latency by 40x on the shared database.

Mitigation: rebuild into a shadow store, not in-place.

class ProjectionRebuilder:
    """
    Rebuilds a projection into a shadow table, then swaps atomically.
    Live read traffic is unaffected during rebuild.
    """
 
    def __init__(self, event_store, read_db):
        self.event_store = event_store
        self.read_db = read_db
 
    def rebuild(self, projector_class, projection_name: str, batch_size: int = 5000):
        shadow = f"{projection_name}_{epoch_seconds()}"
        self.read_db.create_table_like(shadow, projection_name)
 
        projector = projector_class(target_table=shadow)
        total = 0
 
        for batch in self.event_store.stream_all(batch_size=batch_size):
            for raw_event in batch:
                event = upcast_event(raw_event)
                if projector.handles(event["type"]):
                    projector.apply(event)
            total += len(batch)
            if total % 100_000 == 0:
                print(f"Rebuilt {total} events...")
 
        # Validate before swap
        self._validate_shadow(shadow, projection_name)
 
        # Atomic rename
        self.read_db.rename_table(projection_name, f"{projection_name}_old")
        self.read_db.rename_table(shadow, projection_name)
        print(f"Swap complete. Old table retained as {projection_name}_old for 24h.")
 
    def _validate_shadow(self, shadow: str, live: str):
        shadow_count = self.read_db.count(shadow)
        live_count = self.read_db.count(live)
        if abs(shadow_count - live_count) / max(live_count, 1) > 0.05:
            raise ValidationError(
                f"Shadow count {shadow_count} diverges from live {live_count} by >5%"
            )

Additional mitigations:

  • Snapshotting. Store periodic aggregate snapshots so replay starts from the snapshot, not event zero. A snapshot every 1,000 events reduces replay time by 3 orders of magnitude for long-lived aggregates.
  • Throttled replay. Rate-limit event reads during rebuild to avoid saturating the event store. 5,000 events/second is usually safe; 500,000/second will starve live traffic.
  • Catch-up mode. After the shadow rebuild finishes, enter a catch-up mode where you apply events that arrived during the rebuild before swapping. Otherwise, the shadow is stale by hours.

Failure Mode 3: GDPR and the Right to Erasure

GDPR Article 17 requires you to erase personal data on request. An append-only, immutable event store is structurally hostile to erasure. This is not a theoretical concern — regulators have issued fines for "we cannot delete because our architecture is append-only."

There are three approaches, in order of preference:

Approach 1: Crypto-Shredding

Encrypt PII fields in events using a per-subject key. On erasure request, delete the key. The events remain; the PII becomes unreadable ciphertext.

import os
from cryptography.fernet import Fernet
 
class CryptoShredder:
    def __init__(self, key_store):
        self.key_store = key_store  # e.g., AWS KMS, HashiCorp Vault
 
    def encrypt_pii(self, subject_id: str, data: dict, pii_fields: list[str]) -> dict:
        key = self.key_store.get_or_create_key(subject_id)
        f = Fernet(key)
        result = dict(data)
        for field in pii_fields:
            if field in result:
                result[field] = f.encrypt(str(result[field]).encode()).decode()
                result[f"{field}_encrypted"] = True
        return result
 
    def erase_subject(self, subject_id: str):
        """
        Deletes the encryption key. All PII for this subject
        in all historical events becomes permanently unreadable.
        """
        self.key_store.delete_key(subject_id)
        # Audit the erasure
        self.audit_log.write({
            "type": "GDPR_ERASURE",
            "subject_id": subject_id,
            "erased_at": utcnow().isoformat()
        })

Crypto-shredding preserves event integrity (the event exists, the audit trail is intact) while making PII irrecoverable. This satisfies GDPR in most interpretations, though consult your legal team on jurisdiction-specific requirements.

Approach 2: PII Externalization

Do not store PII in events at all. Store a reference (user ID, hashed identifier) in the event, and maintain a separate PII store. On erasure, delete from the PII store.

# Event contains no PII:
{
    "type": "OrderCreated",
    "order_id": "ord_123",
    "customer_ref": "cust_456",  # opaque reference
    "total_cents": 9999
}
 
# PII store (separate, erasable):
# customer_ref → {name, email, address, phone}

Projections that need to display customer name join against the PII store at query time. After erasure, those fields return null or a placeholder.

Approach 3: Event Tombstoning (Last Resort)

Write a tombstone event that signals erasure, and filter PII from event replay after the tombstone:

# Tombstone event
{
    "type": "SubjectDataErased",
    "subject_id": "cust_456",
    "erased_fields": ["email", "name", "address"],
    "erased_at": "2025-06-29T00:00:00Z"
}

Projections check for tombstones and redact fields. This is operationally fragile — every projection must implement redaction — and does not actually remove data from the event store. Use only as a last resort when the other approaches are infeasible.

Failure Mode 4: Replay Storms

A replay storm occurs when multiple projections simultaneously rebuild from the event store, overwhelming it with read traffic. This happens predictably after: a major schema change, a large bug fix requiring multiple projection rebuilds, or a new team member who decides to "just rebuild all projections to be safe."

sequenceDiagram participant P1 as Projection A Rebuild participant P2 as Projection B Rebuild participant P3 as Projection C Rebuild participant ES as Event Store P1->>ES: Stream 50M events P2->>ES: Stream 50M events P3->>ES: Stream 50M events Note over ES: CPU: 100%
Disk I/O: saturated
Live queries: timing out ES-->>P1: slow / errors ES-->>P2: slow / errors ES-->>P3: slow / errors

Mitigations:

Serialize rebuilds. Only one projection rebuilds at a time. Use a distributed lock or a rebuild queue.

class RebuildQueue:
    def __init__(self, redis_client):
        self.redis = redis_client
 
    def enqueue_rebuild(self, projection_name: str, priority: int = 5):
        self.redis.zadd("rebuild_queue", {projection_name: priority})
 
    def process_next(self):
        # Blocking pop from queue — one worker, one rebuild at a time
        item = self.redis.bzpopmin("rebuild_queue", timeout=30)
        if item:
            _, projection_name, _ = item
            self.run_rebuild(projection_name.decode())

Event store read replicas. Route projection rebuilds to a read replica or a dedicated follower node. Live event writes and current-position reads go to the primary.

Materialized snapshots. For aggregates replayed frequently, store snapshots at regular intervals in a separate snapshot store. Replay starts from the latest snapshot, not event 0.

@dataclass
class AggregateSnapshot:
    aggregate_id:   str
    aggregate_type: str
    version:        int
    state:          dict
    snapshot_at:    str
 
def load_aggregate(event_store, snapshot_store, aggregate_id: str):
    snapshot = snapshot_store.latest(aggregate_id)
    if snapshot:
        # Replay only events after the snapshot version
        events = event_store.load_from_version(
            aggregate_id,
            after_version=snapshot.version
        )
        return replay_from_snapshot(snapshot.state, events)
    else:
        events = event_store.load_all(aggregate_id)
        return replay_from_start(events)

When Event Sourcing Is the Wrong Choice

Event sourcing adds permanent complexity. The benefits — complete audit trail, temporal queries, event-driven projections — are real but narrow. It is the wrong choice when:

  • You do not need the audit trail. If you only care about current state and nobody will ever ask "what was the state at time T," you are paying for a feature you do not use.
  • Your domain has no meaningful events. A CRUD service for static reference data (country codes, tax rates) has no business events worth sourcing.
  • Your team lacks the operational experience. Projection rebuilds, schema upcasting, and GDPR compliance in an event store require genuine expertise. The learning curve is steep and the production failures are painful.
  • You need strong read consistency. Async projections produce eventual consistency. If your reads must reflect writes immediately (financial balances, inventory counts), you will fight the architecture constantly.

The honest answer: most services do not need event sourcing. A relational database with an audit log table (append-only, written by triggers or application code) gives you 80% of the audit benefit with 10% of the complexity.

Key Takeaways

  • Schema drift is inevitable in any long-lived event store — design upcasters from day one and version every event schema explicitly.
  • Projection rebuilds must write to a shadow table and swap atomically; in-place rebuilds degrade production.
  • GDPR compliance requires crypto-shredding or PII externalization before you have an event store with sensitive data — retrofitting is painful.
  • Replay storms are caused by concurrent rebuilds; serialize them behind a queue and route to read replicas.
  • Snapshotting aggregates at regular version intervals reduces replay time from O(all events) to O(events since snapshot).
  • Event sourcing is the right tool for auditable, event-rich domains — not a default architecture for every service.