#system-design

58 posts

May 12, 2026 7 min read

L2 Deep Dive

Agentic SRE Architecture: Skills, Agents, MCP Servers, and Human Approval Loops

The definitive 2026 reference architecture for autonomous database operations, from detection to multi-agent diagnosis to human-in-the-loop remediation.

#ai-engineering #architecture #system-design #cloud

Mar 10, 2026 8 min read

L2 Deep Dive

MCP Server Observability: The New Control Plane for AI + Enterprise Tools

How the Model Context Protocol (MCP) became the networking layer for AI agents, and why monitoring these connections is critical for enterprise security.

#ai-engineering #architecture #system-design #security

Jan 20, 2026 8 min read

L2 Deep Dive

AI Agent Observability: Monitor Tool Calls, Token Spend, Latency, and Failure Loops

Why monitoring autonomous SRE agents requires tracking tool-call hallucinations, context window saturation, and recursive retry loops, rather than just basic CPU metrics.

#ai-engineering #architecture #failures #system-design

Jun 17, 2025 6 min read

L2 Deep Dive

The End of Single-Signal Alerting: Correlating Metrics, Logs, Traces, Deployments, and Cost

Why paging an engineer solely because CPU hit 85% is an anti-pattern, and how to build correlated alerts that require real operational evidence.

#architecture #failures #system-design

Nov 26, 2024 6 min read

L2 Deep Dive

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Review questions a staff engineer asks to surface cascade failures, missing fallbacks, state boundaries, and load assumptions that design docs bury.

#architecture #system-design #cloud

Nov 11, 2024 7 min read

L2 Deep Dive

Designing for Peak Traffic Without Designing for Permanent Waste

Pre-positioned capacity, elastic response, bounded queues, and overload shedding — controls for peak traffic without permanent fleet waste.

#architecture #system-design #cloud

Oct 27, 2024 6 min read

L2 Deep Dive

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse

Ownership boundaries for OLTP, search, cache, queue, and warehouse in a commerce data plane — so no datastore becomes source of truth during an incident.

#architecture #system-design #cloud

Oct 12, 2024 7 min read

L2 Deep Dive

Managed Database Selection: Operational Burden, Feature Fit, Cost, and Exit Risk

Managed database selection across operational burden, feature fit, cost trajectory, and exit risk — with failure modes the easy adoption story hides.

#architecture #system-design #cloud

Aug 28, 2024 7 min read

L2 Deep Dive

Service Decomposition Review: When a New Microservice Creates a Worse Database Problem

Splitting a service without relocating the database boundary creates distributed coordination overhead worse than the monolith the split was meant to fix.

#architecture #system-design #cloud

Aug 13, 2024 7 min read

L3 Reference Guide

Event-Driven Architecture Review: Schema Evolution, Ordering, Replay, and Dead Letters

The four failure boundaries in event-driven systems: schema evolution contracts, ordering guarantees, consumer replay safety, and dead-letter queue handling.

#architecture #system-design #cloud

Jul 29, 2024 8 min read

L2 Deep Dive

Database Migration Cutover Workflow: Dual Writes, CDC, Backfill, Freeze, and Rollback

Database migration cutover using dual writes, CDC, backfill, and freeze phases — with rollback boundaries for when 'almost synchronized' is not an operational state.

#architecture #system-design #cloud

Jul 14, 2024 7 min read

L2 Deep Dive

Cloud Cost Triage Workflow: Compute, Storage, Data Transfer, Logs, and Managed Services

Cloud cost triage across compute, storage, data transfer, logs, and managed services — a repeatable workflow for finding runaway spend before the bill arrives.

#architecture #system-design #cloud

Jun 29, 2024 6 min read

L2 Deep Dive

Multi-Region Failover Game Day: What to Test Before the Region Is Down

Designing a failover game day that validates DNS cutover, replication lag thresholds, and traffic routing before a real region failure forces the test.

#architecture #system-design #cloud

May 30, 2024 7 min read

L2 Deep Dive

Queue Backlog Workflow: Producer Spike, Consumer Lag, Poison Messages, and Retry Storms

Producer spikes, consumer lag, poison messages, and retry storms each need a different intervention — the diagnosis order matters as much as the fix.

#architecture #system-design #cloud

May 15, 2024 7 min read

L2 Deep Dive

Cache Incident Workflow: Hit Rate Collapse, Stampede, TTLs, and Database Protection

Cache hit-rate collapse leads to stampede, TTL misconfiguration, and unprotected database load — a workflow for diagnosing each failure in sequence.

#architecture #system-design #cloud

Apr 30, 2024 7 min read

L2 Deep Dive

API Gateway Incident Workflow: Auth, Rate Limits, Routing, and Downstream Saturation

API gateway incidents are misdiagnosed when teams treat them as proxy failures instead of control-plane failures with downstream saturation blast radius.

#architecture #system-design #cloud

Mar 31, 2024 7 min read

L2 Deep Dive

Amazon-Style Commerce Data Architecture: What Public Systems Teach Without Copying Blindly

Cart writability, inventory oversell, order durability, and analytics isolation are the real failure boundaries in commerce data architecture.

#architecture #system-design #cloud

Mar 16, 2024 6 min read

L2 Deep Dive

Customer Data Boundary: PII, Consent, Encryption, and Regional Residency

PII boundary enforcement breaks when consent, encryption, and regional residency are conventions scattered across services, queues, and warehouses.

#architecture #system-design #cloud

Mar 1, 2024 7 min read

L2 Deep Dive

Order Analytics Pipeline: OLTP, CDC, Warehouse, and Reconciliation Checks

Order count discrepancies between OLTP and the warehouse often trace to CDC pipeline schema drift redefining what counts as a committed order.

#architecture #system-design #cloud

Feb 15, 2024 8 min read

L2 Deep Dive

Catalog Sync Workflow: Database, Search Index, CDN, and Cache Invalidation

Propagating a catalog update from database commit through Elasticsearch, CDN edge cache, and application cache without stranding stale reads downstream.

#architecture #system-design #cloud

Jan 31, 2024 7 min read

L2 Deep Dive

Inventory Consistency Playbook: Reservation, Release, Reconciliation, and Oversell Risk

Reservation, release, and reconciliation for inventory systems where carts, payments, and retries generate conflicting stock counts across writes.

#architecture #system-design #cloud

Jan 1, 2024 8 min read

L2 Deep Dive

Black Friday Database Readiness: Hot Keys, Connection Pools, Cache Misses, and Queue Depth

Hot key contention, connection pool exhaustion, and cache miss bursts each hit local thresholds before aggregate dashboards show anything alarming.

#architecture #system-design #cloud

Dec 2, 2023 7 min read

L2 Deep Dive

Search Indexes in Commerce: Why Elasticsearch Is Not the Source of Truth

Elasticsearch is a read index, not a record system — routing writes through it creates catalog drift that surfaces only after orders are placed.

#architecture #system-design #cloud

Nov 2, 2023 7 min read

L2 Deep Dive

Order State Machines: The Database Model Behind Checkout Reliability

Order state machines prevent checkout duplication by constraining which database transitions are legal — so a paid order cannot be paid twice.

#architecture #system-design #cloud

Oct 3, 2023 6 min read

L2 Deep Dive

Shopping Cart Storage: Session Cache, Durable Cart, and Recovery Semantics

Session cache versus durable cart: the recovery semantics that determine data survival across session loss, browser closure, and checkout failure.

#architecture #system-design #cloud

Sep 3, 2023 7 min read

L2 Deep Dive

E-Commerce Databases Are Not One Database: Catalog, Cart, Orders, Inventory, Payments

Catalog, cart, orders, inventory, and payments as five distinct consistency problems — why a shared transaction boundary causes e-commerce system failures.

#architecture #system-design #cloud

Aug 4, 2023 7 min read

L2 Deep Dive

OCI Disaster Recovery Review: Regions, ADs, Backups, Data Guard, and GoldenGate

OCI disaster recovery gaps that emerge when teams rely on regional failover alone, and how Data Guard and GoldenGate address the database replication tier.

#architecture #system-design #cloud

Jul 20, 2023 7 min read

L2 Deep Dive

OCI E-Commerce Database Architecture: Autonomous Transaction Processing, GoldenGate, and Object Storage

Isolating the OCI Autonomous Transaction Processing write path from catalog and analytics load using GoldenGate replication and Object Storage offloading.

#architecture #system-design #cloud

Jul 5, 2023 7 min read

L2 Deep Dive

Exadata Cloud Service: When Hardware Architecture Still Matters

Exadata Cloud Service exposes RDMA interconnects and Smart Scan offload tiers that matter when Oracle workload latency cannot be fixed with software alone.

#architecture #system-design #cloud

Jun 20, 2023 7 min read

L2 Deep Dive

Oracle Autonomous Database: What It Automates and What It Cannot Know

Oracle Autonomous Database automates patching and scaling, but cannot substitute for query intent, schema decisions, and access patterns the team must own.

#architecture #system-design #cloud

Jun 5, 2023 7 min read

L2 Deep Dive

OCI Reference Architecture: Load Balancing, OKE, Autonomous Database, Cache, and Queue

How OCI load balancing, OKE, Autonomous Database, cache, and queue layers interact — and why cross-service ambiguity assumptions cause the first failure.

#architecture #system-design #cloud

May 21, 2023 7 min read

L2 Deep Dive

GCP Multi-Region Architecture: Global Load Balancing, Spanner, Pub/Sub, and Failure Testing

Control plane coupling, Spanner split boundaries, and untested Pub/Sub failover are why GCP multi-region architectures break before the region goes dark.

#architecture #system-design #cloud

May 6, 2023 6 min read

L2 Deep Dive

Cloud & Platform

GCP Database Cost Review: Cloud SQL, Spanner, Bigtable, Memorystore, and BigQuery

Cloud SQL, Spanner, Bigtable, Memorystore, and BigQuery each bill differently — cost overruns trace to applying the wrong model to the wrong workload.

#architecture #system-design #cloud

Apr 21, 2023 7 min read

L2 Deep Dive

BigQuery as an Operational Analytics Boundary, Not an OLTP Escape Hatch

Slot contention and multi-second scan latency are the failure modes when BigQuery gets used as the transactional backend of a user-facing service.

#architecture #system-design #cloud

Mar 22, 2023 6 min read

L2 Deep Dive

Pub/Sub Ordering Keys: The Detail That Decides Your Event Model

Pub/Sub ordering keys control which events serialize together, determining whether failures stall the whole stream or only the affected partition.

#architecture #system-design #cloud

Mar 7, 2023 7 min read

L2 Deep Dive

Cloud Spanner vs Cloud SQL: The Real Distributed Database Decision

Cloud Spanner vs Cloud SQL turns on failure domain tolerance — whether your SLA survives a regional primary outage, not on scale or throughput alone.

#architecture #system-design #cloud

Feb 5, 2023 7 min read

L2 Deep Dive

Azure Multi-Region Design: Front Door, Cosmos DB, SQL Failover, and Operational Tradeoffs

Azure multi-region design tradeoffs: Front Door routing, Cosmos DB consistency, and SQL failover group lag — and which failures each bet absorbs.

#architecture #system-design #cloud

Jan 21, 2023 7 min read

L2 Deep Dive

Azure Database Reliability Review: Failover Groups, Backups, and Geo-Replication

Azure database recovery beyond 'we have backups': failover group cutover, geo-replication lag, and backup restore testing as the real reliability floor.

#architecture #system-design #cloud

Dec 22, 2022 8 min read

L2 Deep Dive

Cloud & Platform

Azure E-Commerce Order Pipeline: Service Bus, Functions, SQL, and Cosmos DB

Azure checkout fails when order acceptance, payment, inventory reservation, and fulfillment are treated as one clean transaction — how Service Bus, Functions, Azure SQL, and Cosmos DB handle the recoverable steps that follow commitment.

#architecture #system-design #cloud

Nov 22, 2022 7 min read

L2 Deep Dive

Azure SQL vs Cosmos DB: The Partition Key Decision

The wrong Azure database choice announces itself when one tenant or region becomes hot enough to make every clean abstraction expensive — how to decide between Azure SQL and Cosmos DB based on access patterns, consistency needs, and operational cost.

#architecture #system-design #cloud

Nov 7, 2022 6 min read

L2 Deep Dive

Azure Reference Architecture: Front Door, App Service, SQL, Cache, and Service Bus

Azure applications typically fail first at the edges: Front Door configuration, App Service connection pools, SQL failover groups, Redis cache invalidation, and Service Bus backlog — a reference architecture that makes these failure boundaries explicit.

#architecture #system-design #cloud

Oct 23, 2022 8 min read

L2 Deep Dive

AWS Multi-Region Failover: Route 53, Global Accelerator, Aurora, and DynamoDB Global Tables

AWS multi-region failover fails most often in traffic steering, write promotion, and schema drift — how Route 53, Global Accelerator, Aurora global databases, and DynamoDB global tables behave under a real regional failure.

#architecture #system-design #cloud

Oct 8, 2022 7 min read

L2 Deep Dive

AWS Database Cost Triage: RDS, Aurora, DynamoDB, ElastiCache, and OpenSearch

Database bills grow when ownership, workload shape, and control loops drift apart — a structured triage approach for RDS, Aurora, DynamoDB, ElastiCache, and OpenSearch spend before it becomes an emergency.

#architecture #system-design #cloud

Sep 23, 2022 7 min read

L2 Deep Dive

AWS Multi-Account Data Boundary: VPCs, KMS, IAM, and Audit Trails

Most AWS data leaks happen when identity, network, encryption, and audit boundaries are designed as separate controls by separate teams — a multi-account architecture that treats VPCs, KMS, IAM, and CloudTrail as a unified boundary.

#architecture #system-design #cloud

Sep 8, 2022 7 min read

L2 Deep Dive

AWS E-Commerce Checkout Architecture: SQS, Lambda, Aurora, and DynamoDB

Checkout fails when payment, inventory, order history, and notification are treated as one synchronous request — how to model checkout as one committed decision followed by recoverable asynchronous consequences using SQS, Lambda, Aurora, and DynamoDB.

#architecture #system-design #cloud

Aug 24, 2022 7 min read

L2 Deep Dive

S3 Event Architectures: Durable, Cheap, and Easy to Misorder

S3 event processing is durable and cheap but the event stream and the bucket tell different stories — how to design S3-driven pipelines around ordering guarantees, duplicate delivery, and eventual consistency without data loss.

#architecture #system-design #cloud

Aug 9, 2022 9 min read

L2 Deep Dive

Aurora vs RDS: The Operational Difference Engineers Actually Feel

The real difference between Aurora and RDS shows up during storage stall, replica lag, and failover at 03:00 — how the two products behave differently under failure and what those differences mean for operational choice and cost.

#architecture #system-design #cloud

Jun 25, 2022 7 min read

L2 Deep Dive

System Design Review Checklist for Senior Engineers

Most system designs fail for reasons visible at review time: overloaded dependencies, ambiguous ownership, unsafe retries, unbounded queues, and missing rollback paths — a checklist senior engineers use to surface those risks early.

#architecture #system-design #cloud

Jun 10, 2022 7 min read

L2 Deep Dive

Multi-Region Architecture: Latency, Consistency, and Blast Radius

Multi-region is usually a failure-containment project, not a scalability project — and deploying across regions exposes every weak assumption in your data model, write ownership strategy, and cross-region blast-radius planning.

#architecture #system-design #cloud

May 11, 2022 7 min read

L2 Deep Dive

Capacity Planning From First Principles: QPS, Fanout, and Hot Keys

Capacity planning fails when teams size for the average request and ignore fanout, hot keys, and bursty traffic — a framework for sizing from QPS, read/write ratios, and peak multipliers before the first incident teaches the lesson.

#architecture #system-design #cloud

Apr 26, 2022 6 min read

L2 Deep Dive

Engineering Fundamentals

Read-After-Write Consistency: The UX Bug That Becomes a Database Bug

Acknowledging a write before the system knows where the next read will land turns a clean product experience into a staleness bug that looks like data loss — how read-after-write consistency works and where it breaks under replica lag.

#architecture #system-design #cloud

Apr 11, 2022 7 min read

L2 Deep Dive

Engineering Fundamentals

Rate Limiting Is a Product Contract, Not Just a Redis Counter

Rate limiting fails when the platform enforces one behavior while the product promised another to clients. The technical mechanism matters less than treating rate limits as a documented contract with defined scope, limits, and error semantics.

#architecture #system-design #cloud

Mar 27, 2022 7 min read

L2 Deep Dive

Engineering Fundamentals

Consistent Hashing: What It Solves and What It Does Not

Consistent hashing is a damage-control mechanism for cluster membership change, not a general scalability strategy — what it limits during node additions and removals, and the tradeoffs that make it unsuitable as a universal sharding approach.

#architecture #system-design #cloud

Mar 12, 2022 7 min read

L2 Deep Dive

Engineering Fundamentals

Idempotency Keys: The Small Table That Saves Distributed Systems

The most reliable distributed systems depend on an unimpressive table with a unique constraint and a saved response — how idempotency keys prevent double charges, duplicate events, and retry amplification at the database layer.

#architecture #system-design #cloud

Feb 25, 2022 8 min read

L2 Deep Dive

Queues vs Streams: The Decision Engineers Keep Reversing

Queues and streams solve different problems: commands vs events, at-most-once delivery vs replay, immediate consumption vs historical processing — and teams that choose without understanding the difference reverse the decision under load.

#architecture #system-design #cloud

Feb 10, 2022 7 min read

L2 Deep Dive

Engineering Fundamentals

Caches Do Not Remove Database Load Unless You Design the Miss Path

A cache is not a shield around the database — it is a second traffic control system whose failure mode is a synchronized stampede back to the database. How to design the miss path so cache failures don't become database incidents.

#architecture #system-design #cloud

Jan 26, 2022 8 min read

L2 Deep Dive

Engineering Fundamentals

Load Balancers: The Hidden State Machine in Front of Your App

A load balancer is not a pipe — it is a distributed state machine making routing and health decisions on stale, partial evidence. Its configuration choices propagate directly into application availability and failure modes.

#architecture #system-design #cloud

Jan 11, 2022 8 min read

L2 Deep Dive

System Design Starts With Failure Modes, Not Boxes and Arrows

The first system design question is not 'what are the services' — it is 'what breaks, how fast does it spread, and what evidence tells us the damage is contained.' A framework for failure-mode-first design.

#architecture #system-design #cloud