Best Of

The posts worth reading twice

These are the posts that have stood up to production reality — concrete failure modes, verified configurations, decision frameworks with tradeoffs, and no fabricated metrics. Start anywhere. Each one is standalone.

Database Internals

How PostgreSQL, MySQL, and storage engines actually work — the parts that cause production problems.

MVCC Explained Like a Database Engineer How readers and writers avoid blocking each other — and why misunderstanding this causes table bloat and stalled vacuums.
WAL Explained for Database Engineers Write-ahead logging, LSN ordering, crash recovery, and replication lag — the shared foundation behind every ACID database.
PostgreSQL Autovacuum: What Every Engineer Should Know Why autovacuum is a capacity problem, not a maintenance task — and how to stop it from surprising you at 3 AM.
B-Tree vs LSM Tree: The Storage Engine Tradeoff The read/write tradeoff that determines which storage engine fits your workload — and why this choice is irreversible.
Logical Replication vs Physical Replication in PostgreSQL Physical replication copies bytes; logical replication copies row changes — confusing the two causes silent schema drift and failed upgrades.

Production Runbooks

Operational playbooks for when something is already wrong — not tutorials, field notes.

PostgreSQL Slow Query Triage Workflow From symptom to fix: how to isolate slow queries, read execution plans, and decide between index, rewrite, or configuration change.
MySQL Slow Query Playbook: From Slow Log to Fix How to read the slow query log, use EXPLAIN, and find the right fix — with a decision tree that covers the common cases.
Schema Deployment Risk Checklist The 12 checks that catch most migration failures before they reach production — with specific PostgreSQL and MySQL behavior.
PostgreSQL 18 Replication Upgrade Opportunities What parallel apply and pg_createsubscriber change about your upgrade plan — and the surprises that happen in the first week.

AI Engineering

Building agents and AI systems that are safe for production — authority boundaries, evaluation, and database access patterns.

The Agent Should Not Have Your App Credentials How to expose PostgreSQL to AI coding agents with database-enforced read-only access, project-scoped MCP config, and secrets outside Git.
Natural Language SQL Agents Need Database Guardrails The SQL agent control plane that makes prompt-to-query workflows safe: role design, SQL parsing, planner checks, and audit records.
Runtime Boundaries for Agentic App Builders How to contain what an agent can read, write, call, and spend — before the demo becomes a production system.
Evals Are the New Unit Tests for Agents Why agents need evaluation harnesses, not just test coverage — and how to build a regression suite that catches model behavior drift.

Architecture Decisions

Decision frameworks for the calls that are hard to reverse — cloud, database, and distributed system choices.

System Design Starts With Failure Modes, Not Boxes and Arrows Why drawing the happy path first produces a diagram that looks correct but fails badly — and how to lead with failure instead.
Queues vs Streams: The Decision Engineers Keep Reversing The one question that determines which one is right: is this data a disposable unit of work, or a durable fact that future systems must reinterpret?
Aurora vs RDS: The Operational Difference Engineers Actually Feel What actually differs at 03:00 during a storage stall or failover — and which failure boundary you are buying with each choice.
Idempotency Keys: The Small Table That Saves Distributed Systems How a single lookup table prevents duplicate charges, duplicate emails, and duplicate state — the implementation most systems get wrong once.
AWS vs Azure vs GCP vs OCI: Database Decision Framework How to pick a cloud provider by matching its managed database failure behavior to your system's dominant recovery requirement — not by benchmark.

Infrastructure & Platform

Terraform, GitOps, and platform engineering — with the operational edge cases that tutorials skip.

Terraform State Is a Production Dependency Why the state file deserves the same access controls, backup policy, and incident runbook as your database — and what happens when it doesn't.
Terraform Workspaces vs Separate State The isolation boundary question: when workspace selection is enough, and when production needs its own backend, credentials, and approval path.
Terraform Plan Review: What Senior Engineers Look For The eleven things a plan reviewer checks that engineers new to Terraform consistently miss — written as a repeatable checklist.
Service Catalogs Are Not Portals. They Are Control Planes. Why the platform team keeps rebuilding its catalog — and the architectural shift that makes it a control plane instead of a glorified wiki.

Security & Encryption

Encryption decisions, key management, and credential isolation — the parts that have permanent consequences.

pgcrypto vs KMS vs HSM: Decision Framework Where key material should actually live — the decision matrix for choosing between database-layer, cloud service, and hardware encryption.
Automation Readiness Review: Inputs, State, Permissions, Rollback, and Audit The five dimensions that distinguish safe automation from automation that works until it doesn't — and how to verify each before production.