Event-Driven Architecture Review: Schema Evolution, Ordering, Replay, and Dead Letters

Events do not make a system resilient by themselves; they move the failure boundary from synchronous calls into contracts, queues, consumers, and time.

Situation

Most teams adopt event-driven architecture for good reasons. Services can publish state changes without knowing every downstream consumer. Slow integrations can run asynchronously. New products can subscribe to existing facts instead of requesting new point-to-point APIs. Cloud platforms make the starting point deceptively simple: create a topic, emit JSON, add consumers, and scale workers horizontally.

The architecture works while event volume is small, schemas are stable, and consumers process messages near real time. The real test arrives later. A producer changes a field. A consumer needs to rebuild a projection from last month. A payment event arrives before the account event it references. One malformed message is retried thousands of times and blocks useful work behind it.

At that point, the design question is no longer “Should we use events?” It is “What operational contract keeps event-driven systems recoverable when change, delay, and bad data are normal?”

The Problem

The common failure is treating an event bus as a transport layer instead of a durable integration boundary. Transport thinking asks whether a message can be delivered. Architecture thinking asks whether a message can be understood, ordered, replayed, ignored, repaired, or retired without corrupting downstream state.

Four failure modes dominate production reviews.

First, schema evolution breaks consumers silently. JSON makes it easy to add fields, rename fields, widen meanings, or change nullability without a compiler noticing. The producer deploys cleanly; the consumer fails later under traffic.

Second, ordering is often assumed globally but provided locally. Kafka, for example, provides ordering within a partition, not across an entire topic. If two events for the same aggregate land in different partitions, consumers can observe impossible histories.

Third, replay is confused with retry. Retry handles temporary failure. Replay rebuilds state from historical events. A consumer that is safe to retry once may not be safe to replay over six months of data.

Fourth, dead letters become a junk drawer. Teams add a dead letter queue after the first incident, but without classification, ownership, retention, and redrive rules, it becomes an unbounded evidence pile.

The core question: how should an event-driven system define contracts for schema evolution, ordering, replay, and dead letters before the first major recovery event?

The Operating Contract

A durable event architecture needs a control plane around the message flow. The broker moves events. The control plane governs whether those events are valid, how they are partitioned, how they are replayed, and what happens when they cannot be processed.

flowchart TD
    A[producer — domain event] --> B[schema gate — compatibility check]
    B --> C[event log — durable topic]
    C --> D[ordered partition — aggregate key]
    D --> E[consumer — idempotent handler]
    E --> F[projection — derived state]
    E --> G[dead letter queue — classified failure]
    C --> H[replay runner — bounded rebuild]
    H --> E
    G --> I[repair workflow — owner and redrive]
    I --> E

The first rule is that events are facts, not commands. “InvoiceIssued” is safer than “SendInvoiceEmail” because the latter encodes one consumer’s desired action. Facts age better because multiple consumers can interpret them independently.

The second rule is that every event has an envelope. The envelope should include event name, schema version, event id, aggregate id, producer, occurred time, published time, trace id, and idempotency key. The payload carries domain data. Consumers should be able to make routing, ordering, deduplication, and observability decisions from the envelope before parsing business fields.

The third rule is schema compatibility at publication time. A schema registry or equivalent validation step should prevent incompatible producer changes from reaching the log. Backward-compatible changes include adding optional fields and preserving existing meanings. Breaking changes include renaming required fields, changing semantic meaning, or removing fields still consumed downstream.

The fourth rule is partition by the thing that needs ordered history. If account lifecycle events must be processed in order, the partition key is account id. If order matters per shopping cart, use cart id. Do not partition by convenience fields such as region or event type unless those are the real ordering boundary.

The fifth rule is replay must be designed as a first-class operation. Replays need bounded windows, explicit target consumers, rate limits, idempotent writes, and visibility into side effects. A replay should rebuild projections or repair missed processing; it should not resend customer emails, re-charge cards, or call external systems unless explicitly operating in a side-effecting repair mode.

The sixth rule is dead letters need taxonomy. A dead letter caused by invalid schema is different from one caused by missing reference data, timeout, permission failure, or a bug in consumer code. Each class needs an owner, alert threshold, retention period, and redrive policy.

In Practice

Context

The documented pattern across mature event systems is that guarantees are scoped. Apache Kafka documents ordering at the partition level, which means application designers must choose keys that align with the ordering domain. Confluent Schema Registry documents compatibility modes such as backward, forward, and full compatibility, making schema evolution a governance choice rather than an informal convention. AWS SQS documents dead letter queues as a way to isolate messages that cannot be processed successfully after repeated receives.

These are not competing products so much as operating lessons: brokers provide primitives, not complete recovery semantics.

Action

A practical review should start with a contract matrix for each event family.

For schema evolution, define the schema owner, compatibility mode, versioning policy, and consumer migration window. Require compatibility checks in CI and again at publish boundaries for high-risk producers.

For ordering, document the aggregate that requires ordered processing and prove the partition key matches it. If workflows require cross-aggregate ordering, make that dependency explicit and consider a coordinator, saga, or database transaction instead of pretending the event bus gives global order.

For replay, separate consumer code paths into pure projection updates and side-effecting actions. Projection handlers should be idempotent and replayable. Side-effecting handlers should persist a decision record before acting and should deduplicate by event id or business idempotency key.

For dead letters, require structured failure metadata: exception class, consumer version, event id, schema version, retry count, first failure time, last failure time, and failure category. A dead letter queue without enough metadata is not recoverability; it is delayed debugging.

Result

The result is not that failures disappear. The result is that failure blast radius becomes bounded.

A schema-breaking producer deployment is stopped before publication or isolated to a known version transition. A hot aggregate can still create pressure on one partition, but the ordering rule is visible and intentional. A replay can rebuild a search index without accidentally triggering external side effects. A dead letter spike can be routed to the owning team with enough context to decide whether to redrive, patch, suppress, or migrate.

Learning

The learning is that event-driven architecture is less about decoupling services than decoupling failure handling. Producers and consumers are only truly decoupled when each side can evolve, pause, replay, and recover without asking the other side to guess what happened.

Where It Breaks

Failure mode	Why it happens	Architectural response
Schema drift	Producers change payloads faster than consumers migrate	Enforce compatibility checks and publish versioned event contracts
False ordering assumptions	Teams assume topic order means business order	Partition by aggregate id and document the ordering boundary
Replay creates duplicate effects	Consumers mix projection writes with external actions	Make handlers idempotent and isolate side effects behind decision records
Dead letters accumulate forever	Messages are isolated but not owned	Classify failures, assign owners, set retention, and define redrive rules
Backfills overwhelm live traffic	Replay competes with production processing	Use bounded replay windows, throttling, and separate consumer groups
Event meanings decay	Old names no longer match business behavior	Treat event semantics as public APIs and deprecate intentionally

What to Do Next

Problem: Your event bus may deliver messages reliably while your system still cannot recover reliably.
Solution: Define an operating contract for schema evolution, ordering, replay, and dead letters around every critical event family.
Proof: Use broker-documented guarantees as constraints: Kafka ordering is partition-scoped, schema compatibility must be enforced deliberately, and dead letter queues only help when failures are classified and owned.
Action: Pick one production event flow and review four artifacts this week: schema compatibility rules, partition key choice, replay procedure, and dead letter ownership.

Situation

The Problem

The Operating Contract

In Practice

Context

Action

Result

Learning

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Designing for Peak Traffic Without Designing for Permanent Waste

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse