#checklist

PostgreSQL 18 Replication Upgrade Opportunities

What changes in replication when upgrading from PostgreSQL 14–16 to PostgreSQL 18: parallel apply, pg_createsubscriber, and surfaced conflict visibility.

#databases #architecture #checklist

Sep 25, 2025 6 min read

L2 Deep Dive

PostgreSQL 18: Features DB Engineers Should Watch

PostgreSQL 18 introduces fundamental changes to the storage engine — asynchronous I/O, parallel logical apply, and improved conflict visibility are the changes operators need to understand before upgrading.

Sep 13, 2025 10 min read

L3 Reference Guide

Autovacuum Is a Capacity Problem, Not a Maintenance Task

PostgreSQL vacuum failures often start with blocked cleanup, table bloat, and weak lock observability during peak load.

#databases #failures #checklist

Jul 5, 2025 9 min read

L2 Deep Dive

When Autovacuum Becomes a Backpressure Signal

PostgreSQL vacuum stalls are often symptoms of lock pressure, table bloat, and missing operational visibility.

#databases #failures #checklist

Jun 25, 2025 9 min read

L2 Deep Dive

Parallel AI Agents Need an Operating Model

Running many coding agents only works when git isolation, shared memory, permissions, hooks, and verification are designed as a system.

May 12, 2025 7 min read

L3 Reference Guide

MongoDB Queryable Encryption Architecture Review

A pre-go-live architecture review for MongoDB Queryable Encryption — key management, field classification, query type constraints, driver requirements, and key rotation.

#databases #architecture #checklist

Mar 1, 2025 6 min read

L2 Deep Dive

#ai-engineering #checklist #architecture

Evaluate AI Agents by Completed Work, Not Token Price

Production AI agent selection should measure quality, retries, tokens, latency, and verification cost per completed task.

Dec 20, 2024 6 min read

L2 Deep Dive

Remote Agents Need Deployment, Permissions, and Feedback Loops

Codex mobile turns local agents into remote workflows, but production value depends on deployment, access control, and observability.

#ai-engineering #cloud #checklist

Dec 12, 2024 10 min read

L3 Reference Guide

Prompt Architecture Needs Load Boundaries

The default AI coding setup loads everything into one always-on instruction file. The production alternative is a layered architecture — project memory, task skills, commands, and MCP servers each with a defined load boundary — so context bloat and stale policy stop reaching the model on every turn.

Nov 19, 2024 5 min read

L2 Deep Dive

Engineering Fundamentals

Cost Observability: Build Dashboards That Show Waste Before Finance Finds It

How to expand monitoring beyond uptime by building dashboards that expose underutilized RDS instances, EBS io2 waste, and backup retention drift.

#cloud #architecture #checklist

Oct 24, 2024 6 min read

L2 Deep Dive

PostgreSQL 16/17 Features That Matter to Operators

Which PostgreSQL 16 and 17 changes operators actually need to prepare for: logical replication improvements, vacuum visibility, connection limits, and monitoring additions that change on-call behavior.

Oct 15, 2024 4 min read

L1 Field Note

Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works

How to position Prometheus and Grafana as the open-source baseline for teams that cannot send every byte of database telemetry to managed services.

#databases #architecture #failures #checklist

Oct 14, 2024 8 min read

L2 Deep Dive

Datadog Database Monitoring: PostgreSQL, MySQL, and Aurora Setup

How to configure Datadog Database Monitoring for PostgreSQL, MySQL, and Aurora — query samples, explain plans, wait event analysis, and the specific Agent settings that make the difference between metric collection and real observability.

Sep 12, 2024 7 min read

L3 Reference Guide

Cloud Architecture Review Checklist for Database-Backed Applications

Review checklist for database-backed cloud applications: connection saturation, migration locking, retry amplification, and region dependency failures.

#architecture #cloud #databases #failures

Sep 9, 2024 6 min read

L2 Deep Dive

Prometheus and Grafana for Database Monitoring: PostgreSQL and MySQL Setup

How to instrument PostgreSQL and MySQL with postgres_exporter and mysqld_exporter, configure Prometheus scrape jobs, and build Grafana panels that surface the metrics that matter — with working PromQL queries.

Aug 12, 2024 8 min read

L2 Deep Dive

Database Alert Design: Thresholds That Fire on Real Problems

How to set database alert thresholds that catch real failures without burning the team on autovacuum noise, checkpoint churn, and replication lag spikes — with specific values for PostgreSQL, MySQL, and Aurora.

Jul 22, 2024 8 min read

L2 Deep Dive

MySQL and Aurora Monitoring: The Dashboard That Catches Problems Before Users Do

The seven MySQL and Aurora metric groups that matter for production operations — threads, replication lag, InnoDB buffer pool, slow queries, connections, locks, and disk — with exact SQL, CloudWatch metrics, and alert thresholds.

Jul 8, 2024 7 min read

L2 Deep Dive

PostgreSQL Monitoring: The Dashboard That Surfaces Problems Before Users Do

The eight PostgreSQL metric groups that matter for production operations — queries, connections, replication lag, autovacuum, locks, cache pressure, checkpoint behavior, and bloat — with exact SQL and alert thresholds.

Jun 8, 2024 6 min read

L2 Deep Dive

Runtime Boundaries for Agentic App Builders

A hosted AI app generator fails when the mobile chat becomes the platform — API keys end up in binaries, execution state blurs with chat, and previews break without artifact handoff. The control-plane architecture that keeps these concerns separated.

Jun 4, 2024 4 min read

L1 Field Note

The Database Observability Baseline: What Every DBA Dashboard Must Show

Before you can adopt AI-assisted triage, your database dashboard needs a foundation built on saturation, locking, and lag metrics.

#databases #architecture #failures #checklist

May 27, 2024 7 min read

L2 Deep Dive

AI Agents Need a Control Plane, Not More Interfaces

Production AI agents work best when coding, files, tools, and knowledge workflows share one governed execution model.

May 20, 2024 5 min read

L1 Field Note

Database Security Review for AI Access

Granting an autonomous AI agent access to your database breaks every assumption of traditional RBAC. How to secure databases against unpredictable, unbounded AI queries.

#ai-engineering #databases #checklist

May 16, 2024 6 min read

L2 Deep Dive

Use Coding Agents as a Toolchain, Not a Vendor Bet

A production-minded workflow for running Cursor and Aider together without locking engineering practice to one agent.

May 7, 2024 5 min read

L1 Field Note

MySQL 8.4 LTS: What DBAs Should Check Before Upgrade

MySQL 8.4 is the first long-term support release in the 8.x line — five breaking changes that require verification before any production upgrade.

Apr 8, 2024 7 min read

L2 Deep Dive

MongoDB Version Upgrade Risk Review

A systematic runbook for assessing MongoDB version upgrade risk — FCV, driver compatibility, deprecated operators, and rollback paths before any production cutover.

#databases #checklist #architecture

Apr 2, 2024 6 min read

L2 Deep Dive

#architecture #ai-engineering #failures #checklist

Durable State for Long-Running LLM Coding Sessions

A practical workflow for separating planning from execution, checkpointing progress in GitHub issues, and resuming multi-phase LLM implementation without context collapse.

Mar 27, 2024 9 min read

L2 Deep Dive

#ai-engineering #architecture #checklist #failures

From Chat to Agents: Designing Goal-to-Result Systems for Real Work

Chat is request-response; agents are task systems that plan, call tools, iterate, and stop when done. The minimum architecture — loop, tools, bounded memory, stopping conditions — required to make the transition from chat reliable.

Mar 20, 2024 7 min read

L2 Deep Dive

#ai-engineering #architecture #failures #checklist

Why Long-Running AI Coding Sessions Fail

A practical control plane for keeping AI coding sessions on track: separate planning from execution, validate deterministically, reset context aggressively, and isolate parallel work.

Mar 18, 2024 10 min read

L3 Reference Guide

Index Debt Review: How to Find Bad, Missing, and Duplicate Indexes

A SQL-driven audit workflow for identifying unused, duplicate, bloated, and missing indexes in PostgreSQL before they drain write performance and storage.

Mar 12, 2024 8 min read

L2 Deep Dive

Cloud & Platform

Internal Developer Platform Reference Architecture: Catalog, IaC, CI/CD, Policy, and Observability

Reference architecture for an IDP as a control plane—connecting service catalog, IaC, CI/CD pipelines, policy enforcement, and observability feedback.

#architecture #cloud #checklist

Feb 26, 2024 10 min read

L3 Reference Guide

PostgreSQL Statistics Drift Workflow

When the query planner gets row estimates wrong, queries regress silently. This runbook diagnoses statistics drift and restores accurate plans.

Jul 17, 2023 10 min read

L3 Reference Guide

Logical Replication Failure Workflow

A diagnostic runbook for logical replication lag, apply worker failures, replication conflicts, and schema drift between publisher and subscriber.

Jun 26, 2023 13 min read

L3 Reference Guide

Schema Deployment Risk Checklist

Assessing lock type, table size, reversibility, and rollback plan before every schema migration — a structured checklist for zero-downtime deployments.

#databases #checklist #architecture

Jun 5, 2023 10 min read

L3 Reference Guide

Cloud Database Cost Triage: Storage, IOPS, CPU, Replicas

A structured runbook for identifying which cost dimension is driving your AWS RDS or Aurora bill before making any changes.

#databases #cloud #checklist

May 15, 2023 11 min read

L3 Reference Guide

Database Backup Validation Workflow

A repeatable runbook for proving that your database backups are actually restorable — with exact commands, decision tree, and automation patterns.

Apr 3, 2023 10 min read

L3 Reference Guide

PostgreSQL Connection Storm Runbook

Diagnosing and resolving connection exhaustion in PostgreSQL: too many clients, idle-in-transaction accumulation, and the case for connection pooling.

Mar 6, 2023 8 min read

L2 Deep Dive

Aurora MySQL Writer CPU Spike Workflow

A systematic runbook for diagnosing Aurora MySQL writer CPU spikes — from Performance Insights through lock contention, long transactions, and read offload.

#databases #cloud #checklist #failures

Feb 6, 2023 8 min read

L2 Deep Dive

MySQL Replication Lag Decision Tree

A systematic runbook for diagnosing MySQL replication lag — from initial SHOW REPLICA STATUS to parallel apply, long transactions, and relay log space.

Jan 16, 2023 9 min read

L2 Deep Dive

PostgreSQL Autovacuum Failure Workflow

A step-by-step runbook for diagnosing and resolving autovacuum failures: dead tuple accumulation, bloat, and transaction ID wraparound risk.

Nov 14, 2022 5 min read

L2 Deep Dive

Backups Are Not Recovery: The DBA Rule Everyone Learns Late

A backup file proves you captured data. Recovery is the process of producing a running, consistent database on a different system inside your RTO. They are not the same thing, and confusing them is how incidents get worse.

#databases #failures #checklist

Sep 26, 2022 7 min read

L2 Deep Dive

MongoDB Query Performance Workflow

A systematic runbook for diagnosing slow MongoDB queries — from explain output through COLLSCAN, index selectivity, in-memory sort, and WiredTiger cache pressure.

Jun 6, 2022 5 min read

L1 Field Note

MySQL EXPLAIN: Reading the Plan Without Guessing

How to read MySQL EXPLAIN output systematically — type column, key column, rows estimate, and Extra flags — so you stop adding indexes blindly.

May 23, 2022 11 min read

L3 Reference Guide

MySQL Slow Query Playbook: From Slow Log to Fix

A repeatable workflow for diagnosing MySQL slow queries — from enabling the slow log through reading EXPLAIN output to committing a safe fix.

Apr 11, 2022 5 min read

L1 Field Note

PostgreSQL Autovacuum: What Every Engineer Should Know

Autovacuum is not optional maintenance — it is the mechanism that prevents table bloat and transaction ID wraparound from taking your database offline.

Mar 21, 2022 12 min read

L3 Reference Guide

PostgreSQL Slow Query Triage Workflow

A structured runbook for diagnosing slow query root causes in PostgreSQL — missing indexes, stale statistics, lock contention, and I/O saturation — in the order that wastes the least time.