System Design Starts With Failure Modes, Not Boxes and Arrows
The first system design question is not “what are the services?” It is “what breaks, how fast does it spread, and what evidence tells us the damage is contained?”
Situation
Most architecture reviews still begin with boxes and arrows. A client calls an API. The API writes to a database. A queue absorbs bursts. A worker processes jobs. A cache makes reads fast. A load balancer spreads traffic.
That drawing is useful, but it is not a design. It is a routing diagram.
A production system is defined less by its happy path than by its behavior under pressure: partial dependency failure, retry storms, hot partitions, schema drift, stale caches, split ownership, noisy neighbors, slow rollbacks, and alerts that arrive after customers have already found the bug.
Cloud systems made this sharper. Teams can assemble infrastructure faster than they can reason about its failure behavior. Managed queues, serverless functions, multi-zone databases, service meshes, and global CDNs reduce operational work, but they also introduce new coupling. The diagram gets cleaner while the runtime gets more asynchronous, more distributed, and harder to inspect.
The senior engineering task is to reverse the order. Start with failure modes. Then choose boxes and arrows that make those failures survivable.
The Problem
A conventional system design interview or review tends to reward component fluency. It asks whether you know when to add a cache, queue, shard, replica, CDN, or read model. That produces architectures that look plausible on a whiteboard and fail in predictable ways in production.
The missing work is operational causality.
If the payment provider times out, do we retry synchronously and hold open user requests? If a worker crashes after charging a card but before updating the order, what record becomes the source of truth? If a cache serves stale authorization data, is the failure merely inconvenient or a security incident? If Kafka lag grows for thirty minutes, do we shed load, degrade features, or silently build an impossible recovery queue?
A box-and-arrow diagram rarely answers those questions because it describes intended communication, not bounded damage.
The core question is: what architecture would we choose if every dependency were assumed to fail partially, slowly, and repeatedly?
Failure-First Architecture
A failure-first design begins by naming the invariants that must survive disorder.
For an order system, the invariant may be: never mark an order paid unless payment is durably recorded. For a collaboration system: never lose accepted edits, even if presence and notifications lag. For a machine learning platform: never serve a model whose lineage, feature schema, and rollback target are unknown.
Once invariants are explicit, the architecture becomes a set of containment decisions.
flowchart TD
A[user request — intent enters system] --> B[command boundary — validate invariant]
B --> C[durable record — source of truth]
C --> D[event stream — asynchronous propagation]
D --> E[read model — optimized query state]
D --> F[side effect worker — external dependency]
F --> G[idempotency store — duplicate suppression]
E --> H[client response — observable state]
C --> I[audit log — recovery evidence]
What this diagram shows: A system design skeleton where the command boundary validates intent before writing a durable record. That record fans out to an event stream, which feeds the read model and side effect workers. The idempotency store prevents duplicate side effects on retry; the audit log provides the recovery evidence needed to reconstruct what happened. Every node is a potential failure boundary.
The important feature of this diagram is not that it has an event stream or a worker. The important feature is where the irreversible decision occurs. The command boundary validates the request. The durable record captures the accepted intent. Everything after that is propagation, projection, or side effect.
That separation changes failure behavior.
If the read model is stale, users may see old state, but the accepted command is not lost. If the worker retries, idempotency prevents duplicate external actions. If the event stream falls behind, operators have a measurable backlog and a replay path. If a deployment corrupts a projection, the durable record and audit log provide the evidence needed to rebuild.
The same reasoning applies to synchronous systems. A request path that depends on five services is not automatically wrong, but it must have explicit budgets. Each dependency needs a timeout, retry policy, fallback behavior, and owner. Otherwise the architecture has quietly converted a downstream brownout into an upstream outage.
Failure-first design asks four questions before adding any component:
- What invariant must remain true?
- What is the smallest durable fact we need to preserve?
- What work can be delayed, retried, or rebuilt?
- What signal proves the system is recovering?
Those questions prevent accidental complexity. They also prevent false simplicity. Sometimes the right answer is a queue. Sometimes it is a transaction. Sometimes it is a single database table with a status column and a carefully designed reconciliation job. The component is secondary. The failure contract is primary.
In Practice
Context: Amazon’s public writing on retries, timeouts, backoff, and jitter in the Amazon Builders’ Library documents a recurring distributed systems problem: retries are selfish. They help one caller, but when many callers retry at the same time, they can amplify overload on the dependency.
Action: The documented pattern is to set timeouts deliberately, cap retries, use exponential backoff, add jitter, and design APIs to tolerate duplicate requests through idempotency. This is not a product-specific trick. It is a control mechanism for limiting retry synchronization and duplicate side effects.
Result: The operational result is not “the service never fails.” The result is narrower: dependency failure is less likely to become coordinated client pressure, and repeated calls are less likely to create repeated business actions.
Learning: A retry policy is architecture. If it is left to library defaults, the system has still made a decision; it has merely made it implicitly.
Context: Google’s Site Reliability Engineering material describes error budgets as a way to connect reliability targets with release velocity. The documented pattern treats reliability as an explicit product constraint rather than an infinite aspiration.
Action: Teams define an acceptable level of unreliability, measure service behavior against that budget, and use budget burn to govern operational decisions. When a service consumes too much of its budget, the next architectural move may be slowing releases, reducing risky changes, or investing in reliability work.
Result: This reframes design tradeoffs. The question stops being “can we make this more reliable?” and becomes “which failure modes are spending the budget, and what change buys it back most directly?”
Learning: Reliability architecture needs an economic model. Without one, teams overbuild low-risk paths and underinvest in the failure modes that actually dominate user pain.
Context: PostgreSQL’s transactional behavior provides a different lesson. A transaction gives atomicity inside the database boundary, but it does not automatically make external side effects atomic. Sending an email, charging a card, publishing a message, and committing a row are not one magical unit unless the design creates a durable coordination pattern.
Action: A common documented pattern is the transactional outbox: write business state and an outbound message record in the same database transaction, then have a relay publish the message. Consumers still need idempotency because delivery can repeat.
Result: The system trades immediate side effects for recoverable side effects. If the relay crashes, the outbox row remains. If the publish succeeds but acknowledgement fails, duplicate delivery is handled by the consumer contract.
Learning: Consistency is not a slogan. It is a boundary. Good architecture names where atomicity ends and recovery begins.
Where It Breaks
| Design choice | Failure it contains | New failure it introduces | Verification step |
|---|---|---|---|
| Synchronous service call | Avoids delayed propagation | Cascading latency and dependency coupling | Enforce timeout budgets and trace critical paths |
| Queue between services | Absorbs bursts and dependency outages | Backlog growth and delayed user-visible state | Alert on age of oldest message, not only queue depth |
| Cache | Reduces read load and latency | Stale data and invalidation bugs | Define freshness bounds and test invalidation paths |
| Read replica | Protects primary from query load | Replica lag and inconsistent reads | Expose lag and route invariant-sensitive reads to primary |
| Event-driven projection | Rebuildable query state | Duplicate, missing, or reordered events | Use idempotent consumers and replay tests |
| Multi-region active-active | Regional survivability | Conflict resolution and operational complexity | Run failover drills and validate conflict policy |
The table matters because every resilience mechanism is also a liability. A queue does not remove failure; it changes immediate failure into delayed work. A cache does not remove database pressure; it creates freshness risk. Multi-region deployment does not remove outages; it adds replication, routing, and conflict behavior that must be tested.
Architecture maturity is the ability to say which failure you are choosing.
What to Do Next
-
Problem: Your current diagram probably shows communication paths, not failure behavior. Re-read it as an outage map: mark every dependency that can be slow, stale, duplicated, unavailable, or inconsistent.
-
Solution: Rewrite the design around invariants, durable facts, retry boundaries, idempotency keys, and recovery paths. Add components only when they make a named failure mode easier to contain.
-
Proof: Test the failure contracts directly. Kill workers. delay queues. Force dependency timeouts. Replay events. Corrupt a read model and rebuild it. Measure recovery using user-visible signals, not only infrastructure health.
-
Action: In the next architecture review, start with three questions before showing the diagram: what must never happen, what will definitely fail, and how will we know the blast radius is contained?