Topic

System Design

Architecture reviews, scalability, failure modes, guardrails, distributed systems, reliability boundaries, and production tradeoffs.

50 posts 49 deep dives

Start Here

Good entry points for this topic before browsing the full archive.

Jun 25, 2022 7 min read

L2 Deep Dive

System Design

System Design Review Checklist for Senior Engineers

Most system designs fail for reasons visible at review time: overloaded dependencies, ambiguous ownership, unsafe retries, unbounded queues, and missing rollback paths — a checklist senior engineers use to surface those risks early.

#architecture #system-design #cloud

May 11, 2022 7 min read

L2 Deep Dive

System Design

Capacity Planning From First Principles: QPS, Fanout, and Hot Keys

Capacity planning fails when teams size for the average request and ignore fanout, hot keys, and bursty traffic — a framework for sizing from QPS, read/write ratios, and peak multipliers before the first incident teaches the lesson.

#architecture #system-design #cloud

May 26, 2022 8 min read

L2 Deep Dive

System Design

Backpressure Design: How Healthy Systems Say No

Healthy systems preserve their ability to recover by refusing work before a failure becomes contagious — how to design backpressure at the queue boundary, connection pool, and API layer so overload stops propagating upstream.

#architecture #failures #cloud

Deep Dives

L2 and L3 posts with architecture, reliability, and tradeoff detail.

Nov 20, 2025 6 min read

L2 Deep Dive

System Design

330 Redundant Data Centers All Failed Simultaneously — Because They Were Identical

Cloudflare's November 2023 outage is a case study in correlated failure. Redundancy protects against independent failures. It does nothing when every node runs the same defective code.

#architecture #failures

Jun 17, 2025 6 min read

L2 Deep Dive

System Design

The End of Single-Signal Alerting: Correlating Metrics, Logs, Traces, Deployments, and Cost

Why paging an engineer solely because CPU hit 85% is an anti-pattern, and how to build correlated alerts that require real operational evidence.

#architecture #failures #system-design

Nov 26, 2024 6 min read

L2 Deep Dive

System Design

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Review questions a staff engineer asks to surface cascade failures, missing fallbacks, state boundaries, and load assumptions that design docs bury.

#architecture #system-design #cloud

Nov 11, 2024 7 min read

L2 Deep Dive

System Design

Designing for Peak Traffic Without Designing for Permanent Waste

Pre-positioned capacity, elastic response, bounded queues, and overload shedding — controls for peak traffic without permanent fleet waste.

#architecture #system-design #cloud

Oct 27, 2024 6 min read

L2 Deep Dive

System Design

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse

Ownership boundaries for OLTP, search, cache, queue, and warehouse in a commerce data plane — so no datastore becomes source of truth during an incident.

#architecture #system-design #cloud

Oct 12, 2024 7 min read

L2 Deep Dive

System Design

Managed Database Selection: Operational Burden, Feature Fit, Cost, and Exit Risk

Managed database selection across operational burden, feature fit, cost trajectory, and exit risk — with failure modes the easy adoption story hides.

#architecture #system-design #cloud

Latest in System Design

Apr 8, 2026 2 min read

L1 Field Note

System Design

Why Your Non-Prod Databases Cost as Much as Production

Architectural strategies to eliminate waste in Dev, Test, and Staging database environments.

#failures #architecture

Nov 20, 2025 6 min read

L2 Deep Dive

System Design

330 Redundant Data Centers All Failed Simultaneously — Because They Were Identical

Cloudflare's November 2023 outage is a case study in correlated failure. Redundancy protects against independent failures. It does nothing when every node runs the same defective code.

#architecture #failures

Jun 17, 2025 6 min read

L2 Deep Dive

System Design

The End of Single-Signal Alerting: Correlating Metrics, Logs, Traces, Deployments, and Cost

Why paging an engineer solely because CPU hit 85% is an anti-pattern, and how to build correlated alerts that require real operational evidence.

#architecture #failures #system-design

Nov 26, 2024 6 min read

L2 Deep Dive

System Design

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Review questions a staff engineer asks to surface cascade failures, missing fallbacks, state boundaries, and load assumptions that design docs bury.

#architecture #system-design #cloud

Nov 11, 2024 7 min read

L2 Deep Dive

System Design

Designing for Peak Traffic Without Designing for Permanent Waste

Pre-positioned capacity, elastic response, bounded queues, and overload shedding — controls for peak traffic without permanent fleet waste.

#architecture #system-design #cloud

Oct 27, 2024 6 min read

L2 Deep Dive

System Design

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse

Ownership boundaries for OLTP, search, cache, queue, and warehouse in a commerce data plane — so no datastore becomes source of truth during an incident.

#architecture #system-design #cloud

All System Design Posts

Apr 8, 2026 2 min read

L1 Field Note

System Design

Why Your Non-Prod Databases Cost as Much as Production

Architectural strategies to eliminate waste in Dev, Test, and Staging database environments.

#failures #architecture

Nov 20, 2025 6 min read

L2 Deep Dive

System Design

330 Redundant Data Centers All Failed Simultaneously — Because They Were Identical

Cloudflare's November 2023 outage is a case study in correlated failure. Redundancy protects against independent failures. It does nothing when every node runs the same defective code.

#architecture #failures

Jun 17, 2025 6 min read

L2 Deep Dive

System Design

The End of Single-Signal Alerting: Correlating Metrics, Logs, Traces, Deployments, and Cost

Why paging an engineer solely because CPU hit 85% is an anti-pattern, and how to build correlated alerts that require real operational evidence.

#architecture #failures #system-design

Nov 26, 2024 6 min read

L2 Deep Dive

System Design

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Review questions a staff engineer asks to surface cascade failures, missing fallbacks, state boundaries, and load assumptions that design docs bury.

#architecture #system-design #cloud

Nov 11, 2024 7 min read

L2 Deep Dive

System Design

Designing for Peak Traffic Without Designing for Permanent Waste

Pre-positioned capacity, elastic response, bounded queues, and overload shedding — controls for peak traffic without permanent fleet waste.

#architecture #system-design #cloud

Oct 27, 2024 6 min read

L2 Deep Dive

System Design

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse

Ownership boundaries for OLTP, search, cache, queue, and warehouse in a commerce data plane — so no datastore becomes source of truth during an incident.

#architecture #system-design #cloud

Oct 12, 2024 7 min read

L2 Deep Dive

System Design

Managed Database Selection: Operational Burden, Feature Fit, Cost, and Exit Risk

Managed database selection across operational burden, feature fit, cost trajectory, and exit risk — with failure modes the easy adoption story hides.

#architecture #system-design #cloud

Aug 28, 2024 7 min read

L2 Deep Dive

System Design

Service Decomposition Review: When a New Microservice Creates a Worse Database Problem

Splitting a service without relocating the database boundary creates distributed coordination overhead worse than the monolith the split was meant to fix.

#architecture #system-design #cloud

Aug 13, 2024 7 min read

L3 Reference Guide

System Design

Event-Driven Architecture Review: Schema Evolution, Ordering, Replay, and Dead Letters

The four failure boundaries in event-driven systems: schema evolution contracts, ordering guarantees, consumer replay safety, and dead-letter queue handling.

#architecture #system-design #cloud

Jul 29, 2024 8 min read

L2 Deep Dive

System Design

Database Migration Cutover Workflow: Dual Writes, CDC, Backfill, Freeze, and Rollback

Database migration cutover using dual writes, CDC, backfill, and freeze phases — with rollback boundaries for when 'almost synchronized' is not an operational state.

#architecture #system-design #cloud

Jul 14, 2024 7 min read

L2 Deep Dive

System Design

Cloud Cost Triage Workflow: Compute, Storage, Data Transfer, Logs, and Managed Services

Cloud cost triage across compute, storage, data transfer, logs, and managed services — a repeatable workflow for finding runaway spend before the bill arrives.

#architecture #system-design #cloud

Jun 29, 2024 6 min read

L2 Deep Dive

System Design

Multi-Region Failover Game Day: What to Test Before the Region Is Down

Designing a failover game day that validates DNS cutover, replication lag thresholds, and traffic routing before a real region failure forces the test.

#architecture #system-design #cloud

May 30, 2024 7 min read

L2 Deep Dive

System Design

Queue Backlog Workflow: Producer Spike, Consumer Lag, Poison Messages, and Retry Storms

Producer spikes, consumer lag, poison messages, and retry storms each need a different intervention — the diagnosis order matters as much as the fix.

#architecture #system-design #cloud

May 15, 2024 7 min read

L2 Deep Dive

System Design

Cache Incident Workflow: Hit Rate Collapse, Stampede, TTLs, and Database Protection

Cache hit-rate collapse leads to stampede, TTL misconfiguration, and unprotected database load — a workflow for diagnosing each failure in sequence.

#architecture #system-design #cloud

Apr 30, 2024 7 min read

L2 Deep Dive

System Design

API Gateway Incident Workflow: Auth, Rate Limits, Routing, and Downstream Saturation

API gateway incidents are misdiagnosed when teams treat them as proxy failures instead of control-plane failures with downstream saturation blast radius.

#architecture #system-design #cloud

Mar 31, 2024 7 min read

L2 Deep Dive

System Design

Amazon-Style Commerce Data Architecture: What Public Systems Teach Without Copying Blindly

Cart writability, inventory oversell, order durability, and analytics isolation are the real failure boundaries in commerce data architecture.

#architecture #system-design #cloud

Mar 16, 2024 6 min read

L2 Deep Dive

System Design

Customer Data Boundary: PII, Consent, Encryption, and Regional Residency

PII boundary enforcement breaks when consent, encryption, and regional residency are conventions scattered across services, queues, and warehouses.

#architecture #system-design #cloud

Mar 1, 2024 7 min read

L2 Deep Dive

System Design

Order Analytics Pipeline: OLTP, CDC, Warehouse, and Reconciliation Checks

Order count discrepancies between OLTP and the warehouse often trace to CDC pipeline schema drift redefining what counts as a committed order.

#architecture #system-design #cloud

Feb 15, 2024 8 min read

L2 Deep Dive

System Design

Catalog Sync Workflow: Database, Search Index, CDN, and Cache Invalidation

Propagating a catalog update from database commit through Elasticsearch, CDN edge cache, and application cache without stranding stale reads downstream.

#architecture #system-design #cloud

Jan 31, 2024 7 min read

L2 Deep Dive

System Design

Inventory Consistency Playbook: Reservation, Release, Reconciliation, and Oversell Risk

Reservation, release, and reconciliation for inventory systems where carts, payments, and retries generate conflicting stock counts across writes.

#architecture #system-design #cloud

Jan 1, 2024 8 min read

L2 Deep Dive

System Design

Black Friday Database Readiness: Hot Keys, Connection Pools, Cache Misses, and Queue Depth

Hot key contention, connection pool exhaustion, and cache miss bursts each hit local thresholds before aggregate dashboards show anything alarming.

#architecture #system-design #cloud

Dec 2, 2023 7 min read

L2 Deep Dive

System Design

Search Indexes in Commerce: Why Elasticsearch Is Not the Source of Truth

Elasticsearch is a read index, not a record system — routing writes through it creates catalog drift that surfaces only after orders are placed.

#architecture #system-design #cloud

Nov 2, 2023 7 min read

L2 Deep Dive

System Design

Order State Machines: The Database Model Behind Checkout Reliability

Order state machines prevent checkout duplication by constraining which database transitions are legal — so a paid order cannot be paid twice.

#architecture #system-design #cloud

Oct 3, 2023 6 min read

L2 Deep Dive

System Design

Shopping Cart Storage: Session Cache, Durable Cart, and Recovery Semantics

Session cache versus durable cart: the recovery semantics that determine data survival across session loss, browser closure, and checkout failure.

#architecture #system-design #cloud

Sep 3, 2023 7 min read

L2 Deep Dive

System Design

E-Commerce Databases Are Not One Database: Catalog, Cart, Orders, Inventory, Payments

Catalog, cart, orders, inventory, and payments as five distinct consistency problems — why a shared transaction boundary causes e-commerce system failures.

#architecture #system-design #cloud

Aug 4, 2023 7 min read

L2 Deep Dive

System Design

OCI Disaster Recovery Review: Regions, ADs, Backups, Data Guard, and GoldenGate

OCI disaster recovery gaps that emerge when teams rely on regional failover alone, and how Data Guard and GoldenGate address the database replication tier.

#architecture #system-design #cloud

Jul 20, 2023 7 min read

L2 Deep Dive

System Design

OCI E-Commerce Database Architecture: Autonomous Transaction Processing, GoldenGate, and Object Storage

Isolating the OCI Autonomous Transaction Processing write path from catalog and analytics load using GoldenGate replication and Object Storage offloading.

#architecture #system-design #cloud

Jul 5, 2023 7 min read

L2 Deep Dive

System Design

Exadata Cloud Service: When Hardware Architecture Still Matters

Exadata Cloud Service exposes RDMA interconnects and Smart Scan offload tiers that matter when Oracle workload latency cannot be fixed with software alone.

#architecture #system-design #cloud

Jun 20, 2023 7 min read

L2 Deep Dive

System Design

Oracle Autonomous Database: What It Automates and What It Cannot Know

Oracle Autonomous Database automates patching and scaling, but cannot substitute for query intent, schema decisions, and access patterns the team must own.

#architecture #system-design #cloud

Jun 5, 2023 7 min read

L2 Deep Dive

System Design

OCI Reference Architecture: Load Balancing, OKE, Autonomous Database, Cache, and Queue

How OCI load balancing, OKE, Autonomous Database, cache, and queue layers interact — and why cross-service ambiguity assumptions cause the first failure.

#architecture #system-design #cloud

May 21, 2023 7 min read

L2 Deep Dive

System Design

GCP Multi-Region Architecture: Global Load Balancing, Spanner, Pub/Sub, and Failure Testing

Control plane coupling, Spanner split boundaries, and untested Pub/Sub failover are why GCP multi-region architectures break before the region goes dark.

#architecture #system-design #cloud

Apr 21, 2023 7 min read

L2 Deep Dive

System Design

BigQuery as an Operational Analytics Boundary, Not an OLTP Escape Hatch

Slot contention and multi-second scan latency are the failure modes when BigQuery gets used as the transactional backend of a user-facing service.

#architecture #system-design #cloud

Mar 22, 2023 6 min read

L2 Deep Dive

System Design

Pub/Sub Ordering Keys: The Detail That Decides Your Event Model

Pub/Sub ordering keys control which events serialize together, determining whether failures stall the whole stream or only the affected partition.

#architecture #system-design #cloud

Mar 7, 2023 7 min read

L2 Deep Dive

System Design

Cloud Spanner vs Cloud SQL: The Real Distributed Database Decision

Cloud Spanner vs Cloud SQL turns on failure domain tolerance — whether your SLA survives a regional primary outage, not on scale or throughput alone.

#architecture #system-design #cloud

Feb 5, 2023 7 min read

L2 Deep Dive

System Design

Azure Multi-Region Design: Front Door, Cosmos DB, SQL Failover, and Operational Tradeoffs

Azure multi-region design tradeoffs: Front Door routing, Cosmos DB consistency, and SQL failover group lag — and which failures each bet absorbs.

#architecture #system-design #cloud

Jan 21, 2023 7 min read

L2 Deep Dive

System Design

Azure Database Reliability Review: Failover Groups, Backups, and Geo-Replication

Azure database recovery beyond 'we have backups': failover group cutover, geo-replication lag, and backup restore testing as the real reliability floor.

#architecture #system-design #cloud

Nov 22, 2022 7 min read

L2 Deep Dive

System Design

Azure SQL vs Cosmos DB: The Partition Key Decision

The wrong Azure database choice announces itself when one tenant or region becomes hot enough to make every clean abstraction expensive — how to decide between Azure SQL and Cosmos DB based on access patterns, consistency needs, and operational cost.

#architecture #system-design #cloud

Nov 7, 2022 6 min read

L2 Deep Dive

System Design

Azure Reference Architecture: Front Door, App Service, SQL, Cache, and Service Bus

Azure applications typically fail first at the edges: Front Door configuration, App Service connection pools, SQL failover groups, Redis cache invalidation, and Service Bus backlog — a reference architecture that makes these failure boundaries explicit.

#architecture #system-design #cloud

Oct 23, 2022 8 min read

L2 Deep Dive

System Design

AWS Multi-Region Failover: Route 53, Global Accelerator, Aurora, and DynamoDB Global Tables

AWS multi-region failover fails most often in traffic steering, write promotion, and schema drift — how Route 53, Global Accelerator, Aurora global databases, and DynamoDB global tables behave under a real regional failure.

#architecture #system-design #cloud

Oct 8, 2022 7 min read

L2 Deep Dive

System Design

AWS Database Cost Triage: RDS, Aurora, DynamoDB, ElastiCache, and OpenSearch

Database bills grow when ownership, workload shape, and control loops drift apart — a structured triage approach for RDS, Aurora, DynamoDB, ElastiCache, and OpenSearch spend before it becomes an emergency.

#architecture #system-design #cloud

Sep 23, 2022 7 min read

L2 Deep Dive

System Design

AWS Multi-Account Data Boundary: VPCs, KMS, IAM, and Audit Trails

Most AWS data leaks happen when identity, network, encryption, and audit boundaries are designed as separate controls by separate teams — a multi-account architecture that treats VPCs, KMS, IAM, and CloudTrail as a unified boundary.

#architecture #system-design #cloud

Sep 8, 2022 7 min read

L2 Deep Dive

System Design

AWS E-Commerce Checkout Architecture: SQS, Lambda, Aurora, and DynamoDB

Checkout fails when payment, inventory, order history, and notification are treated as one synchronous request — how to model checkout as one committed decision followed by recoverable asynchronous consequences using SQS, Lambda, Aurora, and DynamoDB.

#architecture #system-design #cloud

Aug 24, 2022 7 min read

L2 Deep Dive

System Design

S3 Event Architectures: Durable, Cheap, and Easy to Misorder

S3 event processing is durable and cheap but the event stream and the bucket tell different stories — how to design S3-driven pipelines around ordering guarantees, duplicate delivery, and eventual consistency without data loss.

#architecture #system-design #cloud

Aug 9, 2022 9 min read

L2 Deep Dive

System Design

Aurora vs RDS: The Operational Difference Engineers Actually Feel

The real difference between Aurora and RDS shows up during storage stall, replica lag, and failover at 03:00 — how the two products behave differently under failure and what those differences mean for operational choice and cost.

#architecture #system-design #cloud

Jun 25, 2022 7 min read

L2 Deep Dive

System Design

System Design Review Checklist for Senior Engineers

#architecture #system-design #cloud

Jun 10, 2022 7 min read

L2 Deep Dive

System Design

Multi-Region Architecture: Latency, Consistency, and Blast Radius

Multi-region is usually a failure-containment project, not a scalability project — and deploying across regions exposes every weak assumption in your data model, write ownership strategy, and cross-region blast-radius planning.

#architecture #system-design #cloud

May 26, 2022 8 min read

L2 Deep Dive

System Design

Backpressure Design: How Healthy Systems Say No

#architecture #failures #cloud

May 11, 2022 7 min read

L2 Deep Dive

System Design

Capacity Planning From First Principles: QPS, Fanout, and Hot Keys

#architecture #system-design #cloud

Feb 25, 2022 8 min read

L2 Deep Dive

System Design

Queues vs Streams: The Decision Engineers Keep Reversing

Queues and streams solve different problems: commands vs events, at-most-once delivery vs replay, immediate consumption vs historical processing — and teams that choose without understanding the difference reverse the decision under load.

#architecture #system-design #cloud

Jan 11, 2022 8 min read

L2 Deep Dive

System Design

System Design Starts With Failure Modes, Not Boxes and Arrows

The first system design question is not 'what are the services' — it is 'what breaks, how fast does it spread, and what evidence tells us the damage is contained.' A framework for failure-mode-first design.

#architecture #system-design #cloud

System Design

Start Here

Deep Dives

Related Series

Latest in System Design

All System Design Posts