Skip to main content
Engineering

Logs, Metrics, Traces — and the Fourth Pillar Nobody Names

Ravinder··8 min read
EngineeringObservabilityEventsTelemetry
Share:
Logs, Metrics, Traces — and the Fourth Pillar Nobody Names

A customer opens a support ticket: "My order was charged twice." Your on-call engineer opens Grafana. The metrics show a spike in payment API calls at 14:23 UTC. The traces show two successful charge requests with different trace IDs. The logs show... noise. Structured JSON from five services, all timestamped correctly, none of which answers the question: why did the checkout service submit two charge requests for order ord_9f2c8a?

You eventually piece it together from three different log streams, a Slack message from a deploy notification, and one engineer who remembers seeing this before. It takes two hours. The answer was a retry bug triggered by a specific network timeout on the payment gateway. The information to diagnose it existed. But it wasn't observable — it was reconstructable, which is a different, worse thing.

This is the gap that domain events fill. Not logs. Not metrics. Not traces. Events.

What the Three Pillars Actually Give You

The observability canon is well-established. Logs give you verbose, timestamped text records of what happened inside a process. Metrics give you aggregated numeric measurements over time. Traces give you the causal chain of operations across service boundaries. Each is essential. None of them, individually or together, answers "what business-significant thing happened, to whom, why, and what was the exact state when it happened?"

graph TD L[Logs] -->|Answers| L1["What did this process do?\nAt what time?\nWith what error?"] M[Metrics] -->|Answers| M1["How many? How fast?\nWhat's trending?\nIs the SLO broken?"] T[Traces] -->|Answers| T1["Which services were involved?\nWhere was the latency?\nWhat called what?"] E[Events] -->|Answers| E1["What business action occurred?\nTo which entity?\nWith what state before/after?\nWho caused it?"] style L fill:#2c5282,color:#fff style M fill:#2c5282,color:#fff style T fill:#2c5282,color:#fff style E fill:#7b341e,color:#fff

The double-charge scenario is clear in this framing. Metrics told us how many charge calls were made. Traces told us which services made them. Logs told us what the service logged. None of that tells us that a specific user's checkout session transitioned from AWAITING_PAYMENT to PAYMENT_SUBMITTED twice — which is the single fact that makes the bug obvious.

What a Domain Event Actually Is

A domain event is a record that something meaningful happened in your domain — not in your infrastructure. It captures the intent, the actor, the affected entity, and the state transition. It is not a log message. It is not a metric increment. It is a structured, immutable fact.

The shape matters. A weak event:

{
  "timestamp": "2025-11-17T14:23:11.842Z",
  "event": "payment_submitted",
  "order_id": "ord_9f2c8a",
  "amount": 4999
}

A strong event:

{
  "event_id": "evt_4a8f2b91-c3d7-4e12-9f0a-1b2c3d4e5f6a",
  "event_type": "order.payment.submitted",
  "event_version": "1",
  "occurred_at": "2025-11-17T14:23:11.842Z",
  "recorded_at": "2025-11-17T14:23:11.901Z",
  "actor": {
    "type": "user",
    "id": "usr_77281",
    "session_id": "sess_f9a2c8"
  },
  "entity": {
    "type": "order",
    "id": "ord_9f2c8a",
    "version": 3
  },
  "context": {
    "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
    "correlation_id": "checkout_session_2cf81a",
    "environment": "production",
    "service": "checkout-service",
    "service_version": "v2.14.3"
  },
  "payload": {
    "before": {
      "status": "AWAITING_PAYMENT",
      "payment_attempts": 0
    },
    "after": {
      "status": "PAYMENT_SUBMITTED",
      "payment_attempts": 1
    },
    "gateway": "stripe",
    "amount_cents": 4999,
    "currency": "USD",
    "idempotency_key": "idem_checkout_2cf81a_attempt_1"
  }
}

The strong event is self-describing. You can reconstruct the sequence of events for order ord_9f2c8a without touching any other data source. The before/after state makes the double-submit immediately visible: you'd see two events with before.payment_attempts of 0 and 1 respectively — both with after.status: PAYMENT_SUBMITTED.

The Event Schema Contract

Domain events, like API contracts, need versioning and schema enforcement. An event that changes shape silently is a breaking change to every consumer — your analytics pipeline, your audit log, your SIEM, your support tooling.

from pydantic import BaseModel, Field
from datetime import datetime
from typing import Any, Dict, Literal
import uuid
 
class EventActor(BaseModel):
    type: Literal["user", "service", "system"]
    id: str
    session_id: str | None = None
 
class EventEntity(BaseModel):
    type: str
    id: str
    version: int | None = None
 
class EventContext(BaseModel):
    trace_id: str
    correlation_id: str | None = None
    environment: str
    service: str
    service_version: str
 
class DomainEvent(BaseModel):
    event_id: str = Field(default_factory=lambda: f"evt_{uuid.uuid4()}")
    event_type: str  # e.g. "order.payment.submitted"
    event_version: str = "1"
    occurred_at: datetime
    recorded_at: datetime = Field(default_factory=datetime.utcnow)
    actor: EventActor
    entity: EventEntity
    context: EventContext
    payload: Dict[str, Any]
 
    class Config:
        json_encoders = {datetime: lambda v: v.isoformat() + "Z"}

Version the schema in your schema registry. We use Confluent Schema Registry for Kafka-backed event streams. The contract is enforced at publish time: if your event doesn't match the registered schema for order.payment.submitted@v1, the publish fails. This is the same discipline you'd apply to a REST API — treat an event type as a versioned API contract.

Emitting Events Without Coupling Your Domain to Infrastructure

A common mistake is coupling event emission to the event transport. If your domain code imports Kafka producers or HTTP clients to emit events, you've created a dependency that makes your domain logic hard to test and hard to change.

Use the outbox pattern: write the event to a local table in the same transaction as your domain write. A separate process reads the outbox and publishes to your event bus.

class OrderService:
    def __init__(self, db: Database, event_store: OutboxWriter):
        self.db = db
        self.event_store = event_store
 
    def submit_payment(self, order_id: str, actor: Actor) -> Order:
        with self.db.transaction() as tx:
            order = tx.get_order_for_update(order_id)
 
            if order.status != OrderStatus.AWAITING_PAYMENT:
                raise InvalidStateTransition(
                    f"Cannot submit payment: order is {order.status}"
                )
 
            order.status = OrderStatus.PAYMENT_SUBMITTED
            order.payment_attempts += 1
            tx.save(order)
 
            # Written in the same transaction — atomicity guaranteed
            self.event_store.write(tx, DomainEvent(
                event_type="order.payment.submitted",
                occurred_at=datetime.utcnow(),
                actor=EventActor(type="user", id=actor.id, session_id=actor.session_id),
                entity=EventEntity(type="order", id=order.id, version=order.version),
                context=build_context(),
                payload={
                    "before": {"status": "AWAITING_PAYMENT", "payment_attempts": order.payment_attempts - 1},
                    "after": {"status": "PAYMENT_SUBMITTED", "payment_attempts": order.payment_attempts},
                    "gateway": order.payment_gateway,
                    "amount_cents": order.total_cents,
                    "idempotency_key": f"idem_checkout_{actor.session_id}_attempt_{order.payment_attempts}",
                }
            ))
 
        return order

The outbox writer never makes a network call. The database transaction is the only I/O. A separate relay process polls the outbox and publishes to Kafka. If Kafka is down, orders still process; events catch up when Kafka recovers.

Querying Events: The Patterns That Matter

The power of structured events is in the queries. Here are three patterns we use constantly.

Entity timeline reconstruction: Given an entity ID, reconstruct every state transition in order.

SELECT event_type, occurred_at, payload->'before' AS before, payload->'after' AS after, actor
FROM domain_events
WHERE entity->>'id' = 'ord_9f2c8a'
ORDER BY occurred_at ASC;

Funnel analysis: How many orders that reached PAYMENT_SUBMITTED also reached FULFILLED?

WITH payment_submitted AS (
    SELECT entity->>'id' AS order_id, occurred_at AS submitted_at
    FROM domain_events
    WHERE event_type = 'order.payment.submitted'
      AND occurred_at > now() - interval '7 days'
),
fulfilled AS (
    SELECT entity->>'id' AS order_id, occurred_at AS fulfilled_at
    FROM domain_events
    WHERE event_type = 'order.fulfilled'
      AND occurred_at > now() - interval '7 days'
)
SELECT
    count(ps.order_id) AS submitted,
    count(f.order_id) AS fulfilled,
    round(count(f.order_id)::numeric / count(ps.order_id) * 100, 2) AS conversion_pct
FROM payment_submitted ps
LEFT JOIN fulfilled f ON ps.order_id = f.order_id;

Anomaly detection: Find orders that emitted order.payment.submitted more than once.

SELECT entity->>'id' AS order_id, count(*) AS submission_count
FROM domain_events
WHERE event_type = 'order.payment.submitted'
GROUP BY entity->>'id'
HAVING count(*) > 1
ORDER BY submission_count DESC;

That last query would have found the double-charge bug in seconds. No log parsing. No trace correlation across services. One SQL query against a structured event store.

Events vs Logs: When to Use Each

This is not an either/or argument. Logs and events are complementary.

Use logs for: process-level debugging, internal state during complex computations, infrastructure-level events (connection pool exhausted, cache miss rate high). Logs are cheap to emit and should be verbose in debug mode.

Use events for: business-significant state transitions, anything an auditor would care about, anything your analytics team queries, anything that crosses a domain boundary. Events are intentional and curated; they represent facts about your domain, not internal implementation details.

The tell: if a non-engineer could read the event and understand what happened, it belongs in your event stream. If it requires infrastructure context to make sense, it belongs in your logs.

Key Takeaways

  • Logs, metrics, and traces answer infrastructure questions. Domain events answer business questions: what happened, to whom, with what state, caused by whom. You need both.
  • A strong event schema includes: a stable event ID, the actor, the entity and its version, before/after state in the payload, a correlation ID, and a trace ID linking to your distributed trace.
  • Use the outbox pattern to emit events transactionally with your domain writes. Never make network calls inside your domain transaction; let a relay process handle the publish.
  • Version your event schemas in a schema registry and enforce them at publish time. An unversioned event type is a breaking change waiting to happen.
  • Structured events unlock query patterns — entity timelines, funnel analysis, anomaly detection — that log parsing cannot match in speed or clarity.
  • The tell for events vs logs: if a non-engineer can read it and understand a business fact, it is an event. If it requires infrastructure context, it is a log line.