The Staff Engineer's System Design Review: Questions That Expose Real Risk
Most system design reviews fail because they admire the proposed architecture instead of attacking the failure path.
Situation
Cloud systems have made it easy to assemble impressive diagrams: managed queues, autoscaling fleets, serverless workers, global databases, feature flags, caches, and observability stacks. The proposal often looks mature before anyone has proven the system can survive production.
A Staff Engineer’s job in design review is not to ask whether the boxes are modern. It is to find the part of the system where a normal fault becomes an operational incident. That usually means pushing past happy-path throughput and asking about recovery, ownership, overload, deletion, replay, migration, and rollback.
The review should change the design before production changes the outage report.
The Problem
Most reviews over-index on steady-state architecture. They ask whether the system can handle 10,000 requests per second, but not what happens when one dependency takes 800 milliseconds longer for twenty minutes. They ask whether events are durable, but not whether the queue can drain after consumers are down for six hours. They ask whether the service is observable, but not whether the alerts distinguish customer impact from internal noise.
The dangerous designs are rarely obviously bad. They are plausible. They use standard components. They pass load tests. They are presented by capable engineers. The risk is hidden in coupling: retries that multiply load, queues that preserve every mistake, caches that turn misses into database storms, migrations that require perfect sequencing, and fallbacks that silently corrupt business meaning.
The core question is not “does this architecture work?” It is: what exact condition makes this architecture stop recovering on its own?
Risk-Led Design Review
A useful review turns broad confidence into specific risk inventory. The Staff Engineer should force the design through five gates: demand, dependency, state, change, and recovery.
flowchart TD
A[proposal — stated goal] --> B[demand review — load shape]
B --> C[dependency review — failure budget]
C --> D[state review — ownership and replay]
D --> E[change review — migration and rollback]
E --> F[recovery review — drain and repair]
F --> G[decision — accept defer or redesign]
B --> H[question — what spikes first]
C --> I[question — what waits and retries]
D --> J[question — what is source of truth]
E --> K[question — what must be reversible]
F --> L[question — how does it heal]
The demand gate asks how traffic arrives, not just how much arrives. Bursty writes, fan-out reads, scheduled jobs, batch imports, and retry storms create different pressure. Averages hide the incident.
The dependency gate asks what happens when a required service is slow, wrong, or unavailable. Timeouts, retries, concurrency caps, circuit breakers, and fallback behavior should be reviewed as first-class design elements, not library defaults.
The state gate asks where truth lives and how it moves. If there are multiple stores, the review must identify which one wins during conflict, replay, duplication, and partial failure. If there is an event stream, the design must explain idempotency and poison-message handling.
The change gate asks how the system evolves. Schema changes, backfills, feature launches, model swaps, and regional migrations are failure modes. A design that cannot be safely changed is unfinished.
The recovery gate asks how operators know the system is recovering. The review should require concrete drain metrics, repair tools, runbooks, and rollback triggers. “We will monitor it” is not a recovery plan.
In Practice
Context: Google’s SRE guidance on cascading failures documents a common pattern: overload on one part of a serving system can shift work elsewhere, making the remaining replicas more likely to fail. It also calls out retries, load shifting, health checks, and cache behavior as mechanisms that can unintentionally amplify failure when a system is already stressed. See Google SRE, Addressing Cascading Failures.
Action: In a design review, this becomes a concrete question set: What is the maximum retry fan-out per original request? Are retries budgeted globally or configured per client? Do health checks remove capacity faster than replacement capacity appears? Are cache misses more expensive than cache hits, and can the database survive a cold-cache event?
Result: The result is a design that treats overload as a state to control, not a surprise to observe. The architecture should include retry budgets, bounded concurrency, load shedding, and degraded responses where correctness permits them.
Learning: A dependency failure is not isolated if every caller reacts by increasing pressure.
Context: Amazon’s Builders’ Library describes queue backlog as a recovery problem, not merely a durability problem. In Avoiding insurmountable queue backlogs, the documented pattern is that overload or downstream failure can create a backlog that a service cannot drain in a reasonable time after the original fault is fixed.
Action: In review, ask for the oldest-message-age metric, not just queue depth. Ask what work should expire, what work should be prioritized, and what work can be dropped or compacted. Ask whether replay produces duplicate side effects. Ask how many consumers are needed to drain six hours of backlog in one hour, and whether the downstream systems can absorb that drain rate.
Result: The design becomes explicit about recovery objectives. Durable queues stop being treated as a universal safety net. They become controlled buffers with aging, prioritization, idempotency, and drain plans.
Learning: A queue can preserve availability during a short fault and still convert a long fault into delayed customer impact.
Context: Netflix’s Hystrix project documented thread and semaphore isolation, circuit breaking, and fallback behavior for distributed service calls. The public project describes Hystrix as a latency and fault tolerance library intended to isolate remote dependency access and stop cascading failure in distributed systems. See Netflix Hystrix.
Action: In review, ask which dependency calls are isolated from each other. If a recommendation service stalls, can checkout still complete? If an analytics write blocks, can the user request finish? If the circuit opens, what does the caller return, and is that response safe for the business workflow?
Result: The architecture separates critical path from optional enrichment. It also makes fallback semantics visible. A fallback is not automatically safe; returning stale prices, stale permissions, or stale inventory can be worse than failing closed.
Learning: Isolation only reduces risk when the fallback preserves the product’s correctness contract.
Where It Breaks
| Review Question | Risk It Exposes | Weak Answer | Strong Answer |
|---|---|---|---|
| What is the retry budget? | Load amplification | ”The client retries three times." | "Retries are capped per request class and stop when downstream saturation begins.” |
| How does the queue drain? | Delayed recovery | ”Workers autoscale." | "We track oldest age, prioritize urgent work, expire stale work, and cap downstream drain rate.” |
| What is the source of truth? | Divergent state | ”Both stores are updated." | "This store owns truth; the other is rebuilt from events and can lag safely.” |
| What happens during rollback? | Irreversible change | ”We redeploy the old version." | "The schema and messages are backward compatible for the rollback window.” |
| What is safe to degrade? | Incorrect fallback | ”We show cached data." | "Only non-authoritative recommendations degrade; authorization and pricing fail closed.” |
| Who operates repair? | Unowned recovery | ”The on-call will handle it." | "The owning team has a runbook, replay tool, and tested repair path.” |
What to Do Next
-
Problem: Design reviews often validate architecture shape while missing the failure path that turns a normal fault into an incident.
-
Solution: Review the system through demand, dependency, state, change, and recovery gates. Require bounded behavior for retries, queues, fallbacks, migrations, and repair.
-
Proof: Public engineering guidance from Google, Amazon, and Netflix converges on the same operational lesson: overload, backlog, and dependency coupling are architecture risks, not just runtime events.
-
Action: For your next review, ask one question first: “What condition prevents this system from recovering automatically?” If the team cannot answer with metrics, limits, ownership, and a tested recovery path, the design is not ready.