Best Of
The posts worth reading twice
These are the posts that have stood up to production reality — concrete failure modes, verified configurations, decision frameworks with tradeoffs, and no fabricated metrics. Start anywhere. Each one is standalone.
Database Internals
How PostgreSQL, MySQL, and storage engines actually work — the parts that cause production problems.
- MVCC Explained Like a Database Engineer How readers and writers avoid blocking each other — and why misunderstanding this causes table bloat and stalled vacuums.
- WAL Explained for Database Engineers Write-ahead logging, LSN ordering, crash recovery, and replication lag — the shared foundation behind every ACID database.
- PostgreSQL Autovacuum: What Every Engineer Should Know Why autovacuum is a capacity problem, not a maintenance task — and how to stop it from surprising you at 3 AM.
- B-Tree vs LSM Tree: The Storage Engine Tradeoff The read/write tradeoff that determines which storage engine fits your workload — and why this choice is irreversible.
- Logical Replication vs Physical Replication in PostgreSQL Physical replication copies bytes; logical replication copies row changes — confusing the two causes silent schema drift and failed upgrades.
Production Runbooks
Operational playbooks for when something is already wrong — not tutorials, field notes.
- PostgreSQL Slow Query Triage Workflow From symptom to fix: how to isolate slow queries, read execution plans, and decide between index, rewrite, or configuration change.
- MySQL Slow Query Playbook: From Slow Log to Fix How to read the slow query log, use EXPLAIN, and find the right fix — with a decision tree that covers the common cases.
- Schema Deployment Risk Checklist The 12 checks that catch most migration failures before they reach production — with specific PostgreSQL and MySQL behavior.
- PostgreSQL 18 Replication Upgrade Opportunities What parallel apply and pg_createsubscriber change about your upgrade plan — and the surprises that happen in the first week.
AI Engineering
Building agents and AI systems that are safe for production — authority boundaries, evaluation, and database access patterns.
- The Agent Should Not Have Your App Credentials How to expose PostgreSQL to AI coding agents with database-enforced read-only access, project-scoped MCP config, and secrets outside Git.
- Natural Language SQL Agents Need Database Guardrails The SQL agent control plane that makes prompt-to-query workflows safe: role design, SQL parsing, planner checks, and audit records.
- Runtime Boundaries for Agentic App Builders How to contain what an agent can read, write, call, and spend — before the demo becomes a production system.
- Evals Are the New Unit Tests for Agents Why agents need evaluation harnesses, not just test coverage — and how to build a regression suite that catches model behavior drift.
Architecture Decisions
Decision frameworks for the calls that are hard to reverse — cloud, database, and distributed system choices.
- System Design Starts With Failure Modes, Not Boxes and Arrows Why drawing the happy path first produces a diagram that looks correct but fails badly — and how to lead with failure instead.
- Queues vs Streams: The Decision Engineers Keep Reversing The one question that determines which one is right: is this data a disposable unit of work, or a durable fact that future systems must reinterpret?
- Aurora vs RDS: The Operational Difference Engineers Actually Feel What actually differs at 03:00 during a storage stall or failover — and which failure boundary you are buying with each choice.
- Idempotency Keys: The Small Table That Saves Distributed Systems How a single lookup table prevents duplicate charges, duplicate emails, and duplicate state — the implementation most systems get wrong once.
- AWS vs Azure vs GCP vs OCI: Database Decision Framework How to pick a cloud provider by matching its managed database failure behavior to your system's dominant recovery requirement — not by benchmark.
Infrastructure & Platform
Terraform, GitOps, and platform engineering — with the operational edge cases that tutorials skip.
- Terraform State Is a Production Dependency Why the state file deserves the same access controls, backup policy, and incident runbook as your database — and what happens when it doesn't.
- Terraform Workspaces vs Separate State The isolation boundary question: when workspace selection is enough, and when production needs its own backend, credentials, and approval path.
- Terraform Plan Review: What Senior Engineers Look For The eleven things a plan reviewer checks that engineers new to Terraform consistently miss — written as a repeatable checklist.
- Service Catalogs Are Not Portals. They Are Control Planes. Why the platform team keeps rebuilding its catalog — and the architectural shift that makes it a control plane instead of a glorified wiki.
Security & Encryption
Encryption decisions, key management, and credential isolation — the parts that have permanent consequences.
- pgcrypto vs KMS vs HSM: Decision Framework Where key material should actually live — the decision matrix for choosing between database-layer, cloud service, and hardware encryption.
- Automation Readiness Review: Inputs, State, Permissions, Rollback, and Audit The five dimensions that distinguish safe automation from automation that works until it doesn't — and how to verify each before production.