System Design Review Checklist for Senior Engineers
Most system designs fail in production for reasons that were visible in review: overloaded dependencies, ambiguous ownership, unsafe retries, unbounded queues, missing rollback paths, and observability that explains symptoms after the blast radius has already expanded.
Situation
Senior engineers are increasingly asked to review systems that are not single services. A checkout flow, ingestion pipeline, feature platform, fraud scorer, or notification engine usually crosses product code, queues, databases, caches, identity, observability, deployment automation, and cloud limits. The design document may describe components correctly and still miss the operational behavior that decides whether the system survives real traffic.
The review therefore cannot stop at boxes and arrows. It has to ask what happens when the write path is slow, when a dependency returns partial errors, when a batch job catches up after downtime, when one tenant becomes noisy, when a deployment must be rolled back, and when the team on call has ten minutes to decide whether to shed traffic or keep retrying.
A senior design review is not a ceremony. It is a controlled attempt to find production failures while they are still cheap.
The Problem
Most checklists are too polite. They ask whether the system is scalable, reliable, secure, and observable. Those are useful words, but they are not review questions. A system is not “scalable” because it uses Kafka, Kubernetes, DynamoDB, Postgres replicas, or a cache. It is scalable only if the design names the bottleneck, bounds the queue, protects the dependency, and explains the recovery behavior.
The common failure is architectural optimism. The design assumes the happy path is representative. It says the service will retry transient failures, but not whether retries are capped, jittered, idempotent, and budgeted. It says data will be eventually consistent, but not which user decision can observe stale state. It says the database can be scaled vertically, but not what happens when an index change locks writes or when a hot partition absorbs the launch.
The review question is not “does the design make sense?” The question is: which operational failure is this architecture choosing, and has the team made that failure bounded, observable, and reversible?
A Review Loop That Finds Failures
A senior engineer should review a design in passes. Each pass should force the author to replace architectural adjectives with operational commitments.
flowchart TD
A[Design review — request intake] --> B[Business invariant — what must remain true]
B --> C[Ownership map — read path and write path]
C --> D[Load model — steady state and surge]
D --> E[Failure model — timeout retry and fallback]
E --> F[Data model — consistency and repair]
F --> G[Release model — migration rollback and flags]
G --> H[Operations model — alerts dashboards and runbooks]
H --> I[Decision — approve revise or reject]
E -->|stress| D
F -->|constraints| C
H -->|evidence| I
Start with the invariant. Every serious system has one or two properties that matter more than everything else: never double charge, never lose an accepted write, never send a customer-visible message before consent is committed, never make authorization depend on a stale cache. If the document cannot name the invariant, the review is premature.
Then map ownership. For each request, identify the service that accepts responsibility, the system of record, the derived stores, and the repair path. Ownership is not the same as code ownership. The owning system is the one that can answer, “what is the truth after a retry, replay, partial failure, or manual correction?”
Next, model load. Ask for expected request rate, burst behavior, fanout, payload size, cardinality, hot keys, queue depth, backfill rate, and tenant isolation. A design without a load model is not architecture; it is a component inventory.
Then review failure behavior. Every remote call needs a timeout. Every retry needs a cap, backoff, jitter, and idempotency story. Every queue needs a maximum depth, dead letter path, and replay procedure. Every cache needs a miss path and stampede control. Every dependency needs a degraded mode or an explicit decision that the whole product feature fails closed.
Data review comes next. Ask which writes are atomic, which reads can be stale, which events can be duplicated, and which records can arrive out of order. Require reconciliation for any workflow where truth crosses service boundaries. “Eventually consistent” is not a design until the document says who observes the inconsistency and how it heals.
Finally, review release and operations. The design needs migration order, backward compatibility, rollback safety, feature flags, alert ownership, dashboards, and runbooks. If rollback requires deleting data, manually editing rows, or coordinating three teams in a live incident, it is not a rollback plan.
In Practice
Context: Amazon’s documented retry guidance treats retries as a load amplifier, not a harmless reliability feature. The AWS Builders Library article on timeouts, retries, and backoff with jitter describes why synchronized retries can worsen overload and why jitter spreads retry traffic over time.
Action: In design review, require retry budgets to be part of the API contract. The author should state which errors are retryable, where retries happen, how many attempts are allowed, whether calls are idempotent, and how clients avoid synchronized retry storms.
Result: The documented pattern is that retries become bounded recovery behavior instead of an accidental denial of service against a dependency already under stress.
Learning: A senior reviewer should reject “we retry on failure” as incomplete. The acceptable design is “we retry this class of failure, with this cap, this backoff, this jitter, this timeout, and this idempotency key.”
Context: Google’s SRE material on addressing cascading failures treats overload as a system property. It discusses load shedding, queue management, throttling, and graceful degradation as ways to prevent local saturation from becoming global failure.
Action: In review, require every overloaded component to have a deliberate policy: shed, queue, degrade, reject, or isolate. The policy must be tied to a signal such as latency, queue length, CPU saturation, error rate, or dependency health.
Result: The documented pattern is that systems survive overload by preserving the most important work and refusing work they cannot safely complete.
Learning: Capacity is not just how much traffic the system can accept. It is how clearly the system says no before it corrupts latency, exhausts threads, or collapses downstream dependencies.
Context: Netflix has publicly described reliability patterns around gateway and service level load shedding, including prioritized traffic handling in its technology blog article on service-level prioritized load shedding. The relevant architectural pattern is prioritizing critical requests when capacity is constrained.
Action: In review, classify traffic by business importance before production load forces the decision. Reads that support playback, writes that protect account state, background refreshes, analytics, and experiments should not compete blindly for the same saturated worker pool.
Result: The documented pattern is graceful degradation through prioritization: lower value work is delayed or dropped so critical user journeys keep enough capacity.
Learning: A design that treats all requests equally often fails the most important request first, because low value work can be cheaper, more numerous, and easier to retry.
Where It Breaks
| Review Area | Failure Mode | What To Ask |
|---|---|---|
| Ownership | Two services believe they own the same truth | Which system can repair incorrect state without asking another team? |
| Retries | Clients multiply load during dependency failure | Where is the retry budget enforced and how is jitter applied? |
| Queues | Backlog hides an outage until recovery overwhelms storage | What is the max depth, age limit, and replay rate? |
| Caches | Cache miss storms overload the source of truth | How are hot keys, refreshes, and stampedes controlled? |
| Databases | Hot partitions or missing indexes dominate tail latency | What query, key, or tenant becomes the bottleneck first? |
| Consistency | Users observe half completed workflows | Which states are visible, repairable, and terminal? |
| Deployments | Rollback is blocked by irreversible schema or data changes | What is the exact backward compatible migration sequence? |
| Observability | Alerts page symptoms without locating ownership | Which dashboard proves the invariant is still true? |
The checklist also breaks when used as a compliance form. A weak review asks every question with equal weight. A strong review follows risk. A stateless internal read API may need intense dependency and latency review but little migration analysis. A payments workflow may deserve most of its scrutiny on idempotency, reconciliation, auditability, and rollback. A machine learning feature store may need review around freshness, backfill safety, cardinality, and training serving skew.
The goal is not to make every design larger. The goal is to make the chosen architecture honest.
What to Do Next
-
Problem: Design reviews often approve diagrams instead of production behavior. Require each review to start with the business invariant and the most likely operational failure.
-
Solution: Use passes: ownership, load, failure behavior, data consistency, release safety, and operations. Do not accept generic claims where a bound, policy, or owner is required.
-
Proof: Compare the design against documented patterns from AWS retry guidance, Google SRE overload handling, and Netflix prioritized load shedding. These are public examples of architectures shaped around failure, not just component selection.
-
Action: Before approval, ask the author to write the incident summary they hope never to send. If the design cannot explain detection, containment, mitigation, repair, and rollback, the review is not done.