System design notes centered on data boundaries, failure modes, consistency, queues, caches, backpressure, capacity, and database-backed product workflows.
98 postsSystem Design
Who This Is For
Backend engineers and senior engineers preparing for system design interviews or designing real systems — focused on the data layer decisions that interviews and design docs routinely get wrong.
What You Will Be Able to Do
Explain CAP theorem in operational terms, not theoretical ones
Choose the right isolation level for a given consistency requirement
Design idempotency and exactly-once semantics across distributed writes
Model queue vs cache vs database tradeoffs with specific failure modes for each
Prerequisites
You write backend code and have some exposure to distributed systems concepts. This series fills in the gaps, not the basics.
How CPU, GPU, and TPU architectures differ in ways that matter for databases and AI workloads — and which compute class to reach for when adding vector search, embedding generation, or GPU-accelerated analytics.
How CPU, GPU, and TPU architectures differ in ways that matter for databases and AI workloads — and which compute class to reach for when adding vector search, embedding generation, or GPU-accelerated analytics.
Without a connection pool, traffic spikes exhaust OS-level resources before a single slow query runs — here is what actually happens and how to fix it.
The skills that make a good cost-aware DBA — measuring usage, finding structural waste, balancing cost against reliability — transfer almost directly to AI workloads. Database engineers are unusually well positioned to own AI cost.
Aurora Global Database delivers sub-second cross-region replication and under-one-minute RTO for disaster recovery — but it is not active-active, and application failover is never automatic.
A practical, DBA-friendly explanation of why modern analytical databases are increasingly using GPUs for scans, joins, aggregations, and AI-adjacent workloads.
How to set database alert thresholds that catch real failures without burning the team on autovacuum noise, checkpoint churn, and replication lag spikes — with specific values for PostgreSQL, MySQL, and Aurora.
How multi-version concurrency control lets readers and writers run without blocking each other — and why misunderstanding it causes table bloat, undo log growth, and stalled vacuums.
A DBA-friendly walkthrough of how modern GPU databases execute large analytical SQL queries using columnar storage, parallel scans, and GPU aggregation.
How to instrument PostgreSQL and MySQL with postgres_exporter and mysqld_exporter, configure Prometheus scrape jobs, and build Grafana panels that surface the metrics that matter — with working PromQL queries.
How to configure Datadog Database Monitoring for PostgreSQL, MySQL, and Aurora — query samples, explain plans, wait event analysis, and the specific Agent settings that make the difference between metric collection and real observability.
Why monitoring autonomous SRE agents requires tracking tool-call hallucinations, context window saturation, and recursive retry loops, rather than just basic CPU metrics.
The first system design question is not 'what are the services' — it is 'what breaks, how fast does it spread, and what evidence tells us the damage is contained.' A framework for failure-mode-first design.
A cache is not a shield around the database — it is a second traffic control system whose failure mode is a synchronized stampede back to the database. How to design the miss path so cache failures don't become database incidents.
Queues and streams solve different problems: commands vs events, at-most-once delivery vs replay, immediate consumption vs historical processing — and teams that choose without understanding the difference reverse the decision under load.
The most reliable distributed systems depend on an unimpressive table with a unique constraint and a saved response — how idempotency keys prevent double charges, duplicate events, and retry amplification at the database layer.
Acknowledging a write before the system knows where the next read will land turns a clean product experience into a staleness bug that looks like data loss — how read-after-write consistency works and where it breaks under replica lag.
Capacity planning fails when teams size for the average request and ignore fanout, hot keys, and bursty traffic — a framework for sizing from QPS, read/write ratios, and peak multipliers before the first incident teaches the lesson.
Healthy systems preserve their ability to recover by refusing work before a failure becomes contagious — how to design backpressure at the queue boundary, connection pool, and API layer so overload stops propagating upstream.
Multi-region is usually a failure-containment project, not a scalability project — and deploying across regions exposes every weak assumption in your data model, write ownership strategy, and cross-region blast-radius planning.
Most system designs fail for reasons visible at review time: overloaded dependencies, ambiguous ownership, unsafe retries, unbounded queues, and missing rollback paths — a checklist senior engineers use to surface those risks early.
The standard AWS web-tier stack works until the first dependency slows down, the cache goes cold, or a queue starts redriving poison messages — the failure modes hidden inside the ALB, ECS, RDS, ElastiCache, and SQS reference architecture.
The two mechanisms databases use to survive crashes — redo brings committed changes forward, undo rolls back uncommitted ones — and why the distinction matters operationally.
Database bills grow when ownership, workload shape, and control loops drift apart — a structured triage approach for RDS, Aurora, DynamoDB, ElastiCache, and OpenSearch spend before it becomes an emergency.
What a checkpoint actually does in PostgreSQL, why dirty page flush matters for recovery time, and what engineers should monitor to avoid checkpoint pressure.
Azure applications typically fail first at the edges: Front Door configuration, App Service connection pools, SQL failover groups, Redis cache invalidation, and Service Bus backlog — a reference architecture that makes these failure boundaries explicit.
Azure database recovery beyond 'we have backups': failover group cutover, geo-replication lag, and backup restore testing as the real reliability floor.
WiredTiger's internal cache is MongoDB's primary memory tier — how to read its metrics, recognize eviction pressure, and size it correctly for your working set.
Control plane coupling, Spanner split boundaries, and untested Pub/Sub failover are why GCP multi-region architectures break before the region goes dark.
How OCI load balancing, OKE, Autonomous Database, cache, and queue layers interact — and why cross-service ambiguity assumptions cause the first failure.
Oracle Autonomous Database automates patching and scaling, but cannot substitute for query intent, schema decisions, and access patterns the team must own.
Isolating the OCI Autonomous Transaction Processing write path from catalog and analytics load using GoldenGate replication and Object Storage offloading.
Catalog, cart, orders, inventory, and payments as five distinct consistency problems — why a shared transaction boundary causes e-commerce system failures.
The decision framework for choosing between a cache, a queue, and a database — including the failure modes that appear when engineers use the wrong one for the job.
Payment idempotency keys and atomic state transitions prevent the double-charge failure where a transaction succeeds while surrounding systems log failure.
Propagating a catalog update from database commit through Elasticsearch, CDN edge cache, and application cache without stranding stale reads downstream.
The difference between read committed, repeatable read, and serializable isolation in operational terms — and why most applications are running with weaker guarantees than engineers assume.
API gateway incidents are misdiagnosed when teams treat them as proxy failures instead of control-plane failures with downstream saturation blast radius.
Granting an autonomous AI agent access to your database breaks every assumption of traditional RBAC. How to secure databases against unpredictable, unbounded AI queries.
Database changes in CI/CD require separate gates for schema migrations, backfills, and expand-contract patterns — not just a shell command before deployment.
Database migration cutover using dual writes, CDC, backfill, and freeze phases — with rollback boundaries for when 'almost synchronized' is not an operational state.
Splitting a service without relocating the database boundary creates distributed coordination overhead worse than the monolith the split was meant to fix.
How to choose between AWS, Azure, GCP, and OCI for database-backed systems by matching managed database failure behavior to your system's dominant recovery requirement.
Queue time, flake rate, lead time, failure domains, and change risk as CI/CD signals that reveal whether a delivery system is becoming safer or just busier.
Ownership boundaries for OLTP, search, cache, queue, and warehouse in a commerce data plane — so no datastore becomes source of truth during an incident.
Prompt-level guardrails fail open when the agent misinterprets context. The only boundary that mechanically rejects destructive SQL is the database — dedicated read-only roles, sanitized view schemas, and a network path that application credentials never touch.
A 2027 cloud database architecture roadmap for teams that can no longer satisfy consistency, latency, residency, and recovery SLOs with a single engine.
Replacing the translation overhead between business questions and SQL queries requires an architecture that bridges LLM intent parsing with strict execution validation and schema retrieval.
Building a database operations agent requires a workflow framework, production observability, and scalable inference — April 2025 shipped open-source solutions for all three layers simultaneously.
May 2025's most-starred new projects solve three specific database team problems: backup restores that are never verified, internal knowledge that can't be retrieved, and AI agents blind to your schema history.
The risk in a natural-language SQL agent is not bad SQL — it is authority compilation: a user sentence becomes a database operation unless the control plane proves, before execution, which role, rows, cost, and columns the query is allowed to touch.
The 2026 automation priorities for SRE, DevOps, and database teams: what to finish, what to stop maintaining manually, and where agent workflows are actually production-ready.
Three November 2025 open-source releases eliminate manual work from three engineering reliability tasks — multi-database backup verification, self-hosted log and trace collection, and SQL static analysis in CI pipelines.
Database repositories contain hidden rules human reviewers know: never add a blocking index at peak hours, never widen IAM without owner approval. Agent review surfaces these violations before merge — without displacing the human judgment that set the rules.
Three May 2026 breakout projects close the gaps that stop database teams from moving schema changes, query assistance, and operational workflows to AI: declarative Postgres migrations, local LLM inference, and a full agent platform.
When to choose Azure Flexible Server vs Citus for PostgreSQL on Azure — failover behavior, connection pooling, and the workload shapes where each architecture wins and breaks.
How Cassandra's commit log, Memtable, and SSTable pipeline works, why write amplification is the dominant operational cost, and how compaction strategy selection changes it.
Aurora cost hides in places the console doesn't foreground — I/O charges, oversized writers and readers, replica sprawl, and storage. A structured way to find and reduce each without hurting reliability.
A practitioner walkthrough of the review method: what to look at, in what order, how to quantify an opportunity honestly, and how to turn findings into a prioritized 30/60/90-day plan.
Datadog Database Monitoring can surface enormous detail — and bill for it. The skill is choosing the few signals that answer real cost and reliability questions, and not paying to collect noise nobody acts on.
Infrastructure as Code becomes operationally safe only when the state store has concurrency control, durability, auditability, and documented recovery procedures — treating Terraform backends as production databases, not build artifacts.
Database Terraform modules fail when they hide operational decisions behind convenient defaults — a checklist covering parameter groups, backup policies, encryption, and the boundaries that must never be automated away.
Database provisioning via catalog request and Terraform module: the policy and audit gates that make self-service trustworthy to security and operations.
Python database maintenance jobs that skip lock checks, batch limits, and replication lag awareness will corrupt data or starve live queries under load.