Skip to main content
Architecture

Multi-Region Active-Active Without Losing Your Weekends

Ravinder··11 min read
ArchitectureMulti-RegionDistributed SystemsReliability
Share:
Multi-Region Active-Active Without Losing Your Weekends

A team I advised spent six months building a multi-region active-active deployment. The architecture was beautifully designed — two regions, both serving traffic, data replicated asynchronously. They launched it in January. By March they had their first conflict-induced data corruption incident. By April they had their first GDPR violation from a routing bug that briefly stored EU user data in a US region. By May their on-call rotation had doubled in size because no one knew what to do when a region failed.

Multi-region active-active is one of the most difficult operational postures in distributed systems. The theoretical benefits — higher availability, lower latency, regional fault isolation — are real. So are the operational costs. The difference between teams that make it work and teams that lose their weekends over it is usually not the technology they chose. It is whether they had a clear-eyed view of the failure modes before they started.

This post is the honest accounting: what active-active buys you, what conflict resolution actually looks like in practice, how data residency interacts with failover, and the failover drill regimen that keeps the system honest.

Active-Active vs Active-Passive: What You Are Actually Buying

Active-passive is the conservative choice: one primary region serves all writes, a standby region receives replicated data and serves reads (or nothing). Failover means promoting the passive region to primary. Conflict resolution is trivial because only one region ever writes.

Active-active means both (or all) regions accept writes simultaneously. The benefits:

  • Write latency: Users in EU-West get write latency against an EU database, not a US one.
  • Write availability: A US region failure does not block EU writes.
  • Capacity: Write load is distributed across regions.

The costs:

  • Conflict resolution: Two regions can write to the same record concurrently. You must resolve this.
  • Replication lag: Changes written in one region are not immediately visible in another.
  • Operational complexity: Failover, routing, and observability are all harder.
  • Data residency: If your routing is imperfect, data can land in the wrong region.

Before committing to active-active, be honest about whether you actually need write availability across regions. Most applications can tolerate write unavailability for the duration of a regional failover (5-30 minutes) in exchange for dramatically simpler consistency semantics. Active-passive with a fast failover is often the right answer.

Conflict Resolution: LWW vs CRDT

When two regions write to the same record concurrently, you have a conflict. You must resolve it somehow. The two dominant approaches are last-write-wins (LWW) and conflict-free replicated data types (CRDTs).

Last-Write-Wins

LWW is the simplest approach: the write with the higher timestamp wins. Every write carries a timestamp (usually the server time at the writing region). During replication merge, the higher timestamp takes precedence.

from dataclasses import dataclass
from datetime import datetime
 
@dataclass
class Record:
    key: str
    value: dict
    updated_at: datetime
    region: str
 
def merge_lww(local: Record, remote: Record) -> Record:
    """Return the record with the higher timestamp."""
    if remote.updated_at > local.updated_at:
        return remote
    return local

LWW is predictable and simple. It also silently loses data. If two regions update different fields of the same record concurrently, LWW picks one and discards the other. If the application writes are append-style (e.g., a counter increment), LWW produces incorrect results — both regions may increment from the same base value, and LWW keeps one increment rather than applying both.

LWW is appropriate when:

  • The entity is write-once (or write-rarely) and reads are far more common.
  • The field semantics are "last state wins" (e.g., a user's profile photo URL — you want the most recent value).
  • The conflict rate is low and data loss is acceptable for the conflict cases.

CRDTs

Conflict-free replicated data types are data structures designed so that concurrent updates from multiple replicas can always be merged deterministically, without coordination, producing a consistent result.

The canonical CRDT types:

G-Counter (grow-only counter): Each region has its own counter slot. The global count is the sum of all slots. No conflict possible.

class GCounter:
    def __init__(self, node_id: str, nodes: list[str]):
        self.node_id = node_id
        self.counters: dict[str, int] = {node: 0 for node in nodes}
 
    def increment(self) -> None:
        self.counters[self.node_id] += 1
 
    def value(self) -> int:
        return sum(self.counters.values())
 
    def merge(self, other: "GCounter") -> "GCounter":
        merged = GCounter(self.node_id, list(self.counters.keys()))
        for node in self.counters:
            merged.counters[node] = max(
                self.counters.get(node, 0),
                other.counters.get(node, 0),
            )
        return merged

PN-Counter: Two G-Counters (positive and negative). Supports increment and decrement. Value is P.value() - N.value(). Suitable for inventory counts, vote tallies.

LWW-Register with vector clocks: An LWW register that uses vector clocks rather than wall-clock timestamps. Avoids clock skew issues. More complex to implement.

OR-Set (observed-remove set): A set that supports add and remove without conflict. Add operations are tagged with unique IDs; remove operations reference specific add IDs. Resolves the "add and remove concurrently" conflict by keeping adds that were not specifically observed and removed.

CRDTs require careful data modeling. Not every application concept maps cleanly to a CRDT. The practical reality: identify the data types in your system that need multi-region concurrent writes, model those specifically as CRDTs, and use LWW or routing-based write ownership for everything else.

Region-Pinning vs Roaming Users

The cleanest solution to conflict resolution is preventing conflicts by design: pin each user to a home region and route all their writes to that region.

flowchart TD User["User Request"] --> Router["Global Load Balancer\n(GeoDNS or Anycast)"] Router --> Lookup["User Home Region Lookup\n(edge cache, ~1ms)"] Lookup -->|"EU user"| EURegion["EU-West Region\n(authoritative writer)"] Lookup -->|"US user"| USRegion["US-East Region\n(authoritative writer)"] EURegion -->|"Async replication"| USRegion USRegion -->|"Async replication"| EURegion style EURegion fill:#4f46e5,color:#fff style USRegion fill:#059669,color:#fff

With region-pinning, reads can be served from the local region (with replication lag tolerance), but writes always go to the home region. Conflict rate drops to near zero because a given record only has one authoritative writer.

The tradeoff: latency asymmetry. A US user whose home region is US-East gets US-latency writes. A US user who is traveling in Asia still gets US-East write latency — a roaming write penalty of 150-200ms RTT. Whether this is acceptable depends on your SLA.

The operational benefit of region-pinning is enormous: your conflict resolution code is almost never exercised, making it much easier to reason about data integrity.

When roaming writes matter: If your product has global-roaming users who need consistent write performance regardless of location (e.g., a real-time collaboration tool), you need true active-active with conflict resolution. Accept the complexity explicitly.

Data Residency and GDPR: The Failure Mode Nobody Plans For

GDPR Article 44 and similar regulations require that personal data of EU residents is processed and stored within the EU (or jurisdictions with adequacy decisions). Multi-region active-active creates a specific class of GDPR violation: a routing bug that causes EU user data to be written to or replicated to a US region.

This is not theoretical. It happens. The failure modes:

  1. Failover routing: During a US region failure, traffic fails over to the EU region — correct. But if the failover also accidentally routes EU user data into the recovering US region during restoration, you have a residency violation.
  2. Async replication of full data: If you replicate all data to all regions for disaster recovery, you are storing EU data in US regions. You need selective replication that excludes personal data from cross-jurisdiction copies.
  3. Backup and export jobs: Analytics pipelines and backup jobs that run globally and pull data from all regions into a central store.

The architectural response:

flowchart LR subgraph EU["EU-West (GDPR Jurisdiction)"] EUDB["EU Database\n(Personal Data)"] EUAnon["EU Analytics DB\n(Anonymized Only)"] end subgraph US["US-East"] USDB["US Database\n(Personal Data)"] USAnon["US Analytics DB\n(Anonymized Only)"] end EUDB -->|"Anonymize & replicate\n(no PII)"| USAnon USDB -->|"Anonymize & replicate\n(no PII)"| EUAnon EUDB -->|"Encrypted backup\n(stays in EU)"| EUBackup["EU S3 (eu-west-1)"] style EUDB fill:#dc2626,color:#fff style EUBackup fill:#dc2626,color:#fff

The rules I enforce:

  • Personal data never crosses jurisdiction in unencrypted form.
  • Replication of personal data across jurisdictions requires explicit legal basis — usually a standard contractual clause — documented and reviewed by legal.
  • Failover routing must be constrained by user home region. EU users must always fail over to another EU region, never to US. If no EU region is available, the right answer is a degraded mode or a user-facing error, not a jurisdiction violation.
  • Audit logs of cross-region data flows are a GDPR compliance requirement in practice, even when not explicitly mandated.

Conflict Rate Monitoring

Whatever conflict resolution strategy you choose, you need to know your conflict rate. A conflict rate that is creeping up is a leading indicator of a routing problem, a clock skew problem, or an unexpected write pattern change.

from prometheus_client import Counter, Histogram
import time
 
CONFLICTS_TOTAL = Counter(
    "replication_conflicts_total",
    "Total replication conflicts by type and resolution",
    ["conflict_type", "resolution", "entity_type"],
)
 
CONFLICT_AGE = Histogram(
    "replication_conflict_age_seconds",
    "Age of conflicting writes at resolution time",
    buckets=[0.1, 1, 10, 60, 300, 3600],
)
 
def record_conflict(
    conflict_type: str,
    resolution: str,
    entity_type: str,
    conflict_age_seconds: float,
) -> None:
    CONFLICTS_TOTAL.labels(
        conflict_type=conflict_type,
        resolution=resolution,
        entity_type=entity_type,
    ).inc()
    CONFLICT_AGE.observe(conflict_age_seconds)

Alert on:

  • Conflict rate >0.1% of writes (investigate routing).
  • Conflict age >60 seconds (replication lag degradation).
  • Any conflict on entity types that should never conflict (invariant violation).

Failover Drills: The Practice That Makes the Theory Real

A multi-region active-active architecture that has never been tested under failure conditions is a multi-region active-active architecture that will fail in an untested way during an actual incident.

Failover drills need to be scheduled, scripted, and run against production (or a production-equivalent staging environment). The drill cadence I recommend: one full regional failover drill per quarter, plus monthly smaller-scope drills (single availability zone, single service failover).

The standard drill script:

1. Pre-drill verification (15 min)
   - Confirm replication lag is <5s across all regions
   - Confirm all synthetic monitors are green
   - Confirm on-call engineers are staged
 
2. Region isolation (5 min)
   - Update load balancer rules to stop routing to target region
   - Do NOT stop the region — you need to observe its behavior under isolation
 
3. Observation window (20 min)
   - Monitor replication queue depth in surviving regions
   - Confirm traffic is correctly rerouting
   - Confirm data written during failover is landing in correct regions
   - Check GDPR routing constraints are being honored
 
4. Validation (10 min)
   - Run smoke test suite against surviving regions
   - Verify conflict queues are not growing unexpectedly
   - Check that region-pinned users are being correctly served
 
5. Region restoration (10 min)
   - Re-enable routing to isolated region
   - Monitor replication catchup — region should consume the queue
   - Confirm conflict rate does not spike above baseline during catchup
 
6. Post-drill (30 min)
   - Document all findings
   - Record Time-to-Detect, Time-to-Route (how long until traffic shifted)
   - Update runbooks with any gaps found

The metrics you track for each drill:

  • Time-to-Detect (TTD): How long from region isolation until alerts fired.
  • Time-to-Route (TTR): How long until traffic was rerouted to surviving regions.
  • Data written during failover: All writes made during the drill should be recoverable. Verify.
  • Conflict count during restoration: How many conflicts occurred as the isolated region reconnected.

A good TTR is under 60 seconds. If your TTR is 5-10 minutes, you have a routing automation problem that will cost you significantly during a real incident.

The Replication Lag Budget

Async replication means your regions are not in sync. The lag budget is the maximum acceptable replication lag before you alert and take action.

A practical lag budget framework:

Data Class Max Lag Action at Breach
User account credentials 5 seconds Alert + investigate
Financial transactions 30 seconds Alert + halt cross-region reads
User profile data 2 minutes Alert
Content/media 5 minutes Warn
Analytics events 15 minutes Warn

Different data classes have different consistency requirements. Do not apply a single global lag budget; it will either be too tight for unimportant data (causing alert fatigue) or too loose for critical data (missing real degradation).

Key Takeaways

  • Active-active is justified by write latency and write availability requirements; if your SLA can tolerate 5-30 minute write unavailability during failover, active-passive with fast failover is dramatically simpler.
  • LWW is appropriate for "latest state wins" semantics; CRDTs are necessary when concurrent updates to the same record must both be reflected (counters, sets, collaborative edits).
  • Region-pinning eliminates most conflicts by design — route all writes for a user to their home region and accept roaming write latency as the tradeoff.
  • GDPR residency constraints must be enforced at the routing layer, not the application layer; a failover routing bug is a compliance incident, not just an availability incident.
  • Conflict rate monitoring is a leading indicator — a rising conflict rate signals routing bugs or clock skew before it manifests as data corruption.
  • Failover drills run quarterly against production (or equivalent) with a scripted procedure and tracked TTD/TTR metrics are the difference between a documented failover posture and an untested assumption.