Databases

PostgreSQL Connection Storm Runbook

Diagnosing and resolving connection exhaustion in PostgreSQL: too many clients, idle-in-transaction accumulation, and the case for connection pooling.

May 25, 2026 7 min read

L2 Deep Dive

Cassandra Write Path Fundamentals for Database Engineers

How Cassandra's commit log, Memtable, and SSTable pipeline works, why write amplification is the dominant operational cost, and how compaction strategy selection changes it.

Sep 18, 2023 6 min read

L2 Deep Dive

Product Catalog Modeling: Relational, Document, Search Index, or All Three

Modeling a product catalog across relational, document, and search-index layers: where each fits and why a single schema fails all three workloads.

Deep Dives

L2 and L3 posts with architecture, reliability, and tradeoff detail.

May 28, 2026 17 min read

L3 Reference Guide

Per-App Postgres on Kubernetes Changes the Failure Boundary

How CloudNativePG, GitOps, and external secrets make per-application Postgres viable without hiding the operational cost.

May 25, 2026 6 min read

L2 Deep Dive

Azure Database for PostgreSQL: Flexible Server vs Hyperscale (Citus) Architecture Decision

When to choose Azure Flexible Server vs Citus for PostgreSQL on Azure — failover behavior, connection pooling, and the workload shapes where each architecture wins and breaks.

May 25, 2026 7 min read

L2 Deep Dive

Cassandra Write Path Fundamentals for Database Engineers

How Cassandra's commit log, Memtable, and SSTable pipeline works, why write amplification is the dominant operational cost, and how compaction strategy selection changes it.

May 25, 2026 6 min read

L2 Deep Dive

GCP AlloyDB vs Cloud SQL for PostgreSQL: When to Upgrade

When Cloud SQL's managed PostgreSQL hits its limits and AlloyDB's columnar cache and HTAP architecture become worth the migration complexity and cost jump.

May 24, 2026 9 min read

L2 Deep Dive

The Stack for AI-Accelerated Database Operations Is Now Open Source

Three May 2026 breakout projects close the gaps that stop database teams from moving schema changes, query assistance, and operational workflows to AI: declarative Postgres migrations, local LLM inference, and a full agent platform.

May 16, 2026 6 min read

L2 Deep Dive

Stop Writing Ad-Hoc Queries: Build a Skill Backbone for Your DB Engineering Workflows

How to codify repetitive DB tasks into testable, reusable Claude skills that produce consistent SQL, runbooks, and migration outputs instead of one-off chat prompts.

Latest in Databases

Jun 15, 2026 4 min read

L1 Field Note

Datadog DBM: What Database Teams Should Actually Monitor

Datadog Database Monitoring can surface enormous detail — and bill for it. The skill is choosing the few signals that answer real cost and reliability questions, and not paying to collect noise nobody acts on.

#databases #observability #cost #postgresql

Jun 13, 2026 4 min read

L1 Field Note

Why Database Engineers Should Care About AI Cost Engineering

The skills that make a good cost-aware DBA — measuring usage, finding structural waste, balancing cost against reliability — transfer almost directly to AI workloads. Database engineers are unusually well positioned to own AI cost.

#ai #cost #databases #career

Jun 12, 2026 4 min read

L1 Field Note

How to Run a Database Cost & Reliability Review

A practitioner walkthrough of the review method: what to look at, in what order, how to quantify an opportunity honestly, and how to turn findings into a prioritized 30/60/90-day plan.

#databases #cost #reliability #postgresql

Jun 11, 2026 3 min read

L1 Field Note

Aurora Cost Optimization: The Hidden Database Bill

Aurora cost hides in places the console doesn't foreground — I/O charges, oversized writers and readers, replica sprawl, and storage. A structured way to find and reduce each without hurting reliability.

#databases #cloud #cost #aurora

Jun 10, 2026 3 min read

L1 Field Note

PostgreSQL Bloat, Index Waste, and Cloud Cost

Table and index bloat and unused indexes are well-known Postgres problems — and direct cloud-cost problems: wasted storage, write amplification, and extra I/O. How to measure both with read-only queries and remediate safely.

#postgresql #databases #cost #performance

May 28, 2026 17 min read

L3 Reference Guide

Per-App Postgres on Kubernetes Changes the Failure Boundary

How CloudNativePG, GitOps, and external secrets make per-application Postgres viable without hiding the operational cost.

All Databases Posts

Jun 15, 2026 4 min read

L1 Field Note

Datadog DBM: What Database Teams Should Actually Monitor

#databases #observability #cost #postgresql

Jun 13, 2026 4 min read

L1 Field Note

Why Database Engineers Should Care About AI Cost Engineering

#ai #cost #databases #career

Jun 12, 2026 4 min read

L1 Field Note

How to Run a Database Cost & Reliability Review

A practitioner walkthrough of the review method: what to look at, in what order, how to quantify an opportunity honestly, and how to turn findings into a prioritized 30/60/90-day plan.

#databases #cost #reliability #postgresql

Jun 11, 2026 3 min read

L1 Field Note

Aurora Cost Optimization: The Hidden Database Bill

#databases #cloud #cost #aurora

Jun 10, 2026 3 min read

L1 Field Note

PostgreSQL Bloat, Index Waste, and Cloud Cost

#postgresql #databases #cost #performance

May 28, 2026 17 min read

L3 Reference Guide

Per-App Postgres on Kubernetes Changes the Failure Boundary

How CloudNativePG, GitOps, and external secrets make per-application Postgres viable without hiding the operational cost.

May 25, 2026 6 min read

L2 Deep Dive

Azure Database for PostgreSQL: Flexible Server vs Hyperscale (Citus) Architecture Decision

When to choose Azure Flexible Server vs Citus for PostgreSQL on Azure — failover behavior, connection pooling, and the workload shapes where each architecture wins and breaks.

May 25, 2026 7 min read

L2 Deep Dive

Cassandra Write Path Fundamentals for Database Engineers

How Cassandra's commit log, Memtable, and SSTable pipeline works, why write amplification is the dominant operational cost, and how compaction strategy selection changes it.

May 25, 2026 6 min read

L2 Deep Dive

GCP AlloyDB vs Cloud SQL for PostgreSQL: When to Upgrade

When Cloud SQL's managed PostgreSQL hits its limits and AlloyDB's columnar cache and HTAP architecture become worth the migration complexity and cost jump.

May 24, 2026 9 min read

L2 Deep Dive

The Stack for AI-Accelerated Database Operations Is Now Open Source

May 16, 2026 6 min read

L2 Deep Dive

Stop Writing Ad-Hoc Queries: Build a Skill Backbone for Your DB Engineering Workflows

How to codify repetitive DB tasks into testable, reusable Claude skills that produce consistent SQL, runbooks, and migration outputs instead of one-off chat prompts.

Apr 22, 2026 7 min read

L2 Deep Dive

Top GitHub Breakouts: March 2026 — Agent Adaptation and Production-Scale Vector Search

The second wave of March 2026 breakouts: an agent that learns from every conversation, a Rust vector index that outperforms FAISS at a fraction of the memory, and a Kubernetes-native agent control plane.

Apr 16, 2026 2 min read

L1 Field Note

SQL Server to PostgreSQL Migration Cost Defense Checklist

A pragmatic checklist to defend the business case for migrating away from Microsoft SQL Server.

#checklist #databases

Mar 25, 2026 2 min read

L1 Field Note

Oracle Cloud BYOL: True Cost Analysis Beyond the Headline Rate

Understanding the financial nuances, OCPU conversions, and hidden costs of bringing your Oracle licenses to OCI.

#databases #cloud

Mar 11, 2026 2 min read

L1 Field Note

Oracle to Aurora PostgreSQL: License Cost Elimination in Practice

The engineering reality and ROI of migrating from Oracle to Amazon Aurora PostgreSQL.

Mar 4, 2026 2 min read

L1 Field Note

AWS RDS Oracle and SQL Server: The License Cost Nobody Talks About

Why the default License-Included model on AWS RDS is a financial trap for enterprise database workloads.

#databases #cloud #failures

Feb 25, 2026 2 min read

L2 Deep Dive

Azure Hybrid Benefit for SQL Server: The Exact Math

A deep dive into the cost savings and mechanics of applying Azure Hybrid Benefit to SQL Server deployments.

Feb 18, 2026 2 min read

L1 Field Note

Azure Synapse Cost Optimization: DWU Right-Sizing, Serverless, and Hybrid Benefit

How to reduce your Azure Synapse compute bill by right-sizing dedicated pools and offloading to serverless.

Jan 30, 2026 4 min read

L1 Field Note

#databases #ai-engineering #architecture #checklist

Database Runbooks as Agent Contracts

A reference operating model for turning human database runbooks into machine-usable agent contracts.

Jan 23, 2026 4 min read

L1 Field Note

Repo-Embedded Skills for Database Teams

Why database teams should store agent instructions, runbook contracts, and review policies in the repository instead of in memory.

Jan 20, 2026 4 min read

L1 Field Note

Agentic Code Review for Database Repositories

Database repositories contain hidden rules human reviewers know: never add a blocking index at peak hours, never widen IAM without owner approval. Agent review surfaces these violations before merge — without displacing the human judgment that set the rules.

Dec 20, 2025 8 min read

L2 Deep Dive

Automated Reliability Across the Stack: Database Backups, Platform Observability, and SQL Quality (November 2025)

Three November 2025 open-source releases eliminate manual work from three engineering reliability tasks — multi-database backup verification, self-hosted log and trace collection, and SQL static analysis in CI pipelines.

Oct 25, 2025 11 min read

L3 Reference Guide

Torn Page Protection Belongs Off the Foreground Path

A PostgreSQL kernel experiment shows why moving torn-page protection from WAL to background flush can change write latency.

Oct 7, 2025 13 min read

L2 Deep Dive

PostgreSQL 18 Replication Upgrade Opportunities

What changes in replication when upgrading from PostgreSQL 14–16 to PostgreSQL 18: parallel apply, pg_createsubscriber, and surfaced conflict visibility.

#databases #architecture #checklist

Sep 27, 2025 7 min read

L2 Deep Dive

Top GitHub Breakouts: August 2025 — Part II

The highest-starred new open-source projects in August 2025 where AI takes over cloud operations, infrastructure provisioning, and production Postgres coding.

#ai-engineering #cloud #databases

Sep 25, 2025 6 min read

L2 Deep Dive

PostgreSQL 18: Features DB Engineers Should Watch

PostgreSQL 18 introduces fundamental changes to the storage engine — asynchronous I/O, parallel logical apply, and improved conflict visibility are the changes operators need to understand before upgrading.

Sep 13, 2025 10 min read

L3 Reference Guide

Autovacuum Is a Capacity Problem, Not a Maintenance Task

PostgreSQL vacuum failures often start with blocked cleanup, table bloat, and weak lock observability during peak load.

#databases #failures #checklist

Sep 6, 2025 7 min read

L2 Deep Dive

#ai-engineering #architecture #databases

Top GitHub Breakouts: August 2025 — Part I

The gap between AI prototype and production system is routing tables, deployment YAML, and observability scaffolding. August 2025's top breakouts targeted exactly the code engineers keep rewriting: model routing logic, agent deployment manifests, and PostgreSQL diagnostics.

Aug 30, 2025 12 min read

L3 Reference Guide

The Semantics AI Misses When Porting Storage Designs

Why a PostgreSQL double write buffer prototype failed despite compiling, and what it reveals about AI-assisted systems design.

#databases #ai-engineering #failures

Jul 26, 2025 19 min read

L3 Reference Guide

Natural Language SQL Agents Need Database Guardrails

The risk in a natural-language SQL agent is not bad SQL — it is authority compilation: a user sentence becomes a database operation unless the control plane proves, before execution, which role, rows, cost, and columns the query is allowed to touch.

Jul 12, 2025 8 min read

L2 Deep Dive

Covering Indexes Are Not Enough Without Visibility

PostgreSQL index-only scans only stay fast when covering indexes and visibility map maintenance work together.

Jul 5, 2025 9 min read

L2 Deep Dive

When Autovacuum Becomes a Backpressure Signal

PostgreSQL vacuum stalls are often symptoms of lock pressure, table bloat, and missing operational visibility.

#databases #failures #checklist

Jun 22, 2025 8 min read

L2 Deep Dive

Top GitHub Breakouts: May 2025 — Operational Baseline in a Config File

Three May 2025 open-source projects replace multi-tool assembly in document ingestion, deployment governance, and PostgreSQL backup with single-binary or configuration-first alternatives.

Jun 14, 2025 9 min read

L2 Deep Dive

Three Open-Source Tools Filling the Gaps in Database Operations (May 2025)

May 2025's most-starred new projects solve three specific database team problems: backup restores that are never verified, internal knowledge that can't be retrieved, and AI agents blind to your schema history.

May 12, 2025 7 min read

L3 Reference Guide

MongoDB Queryable Encryption Architecture Review

A pre-go-live architecture review for MongoDB Queryable Encryption — key management, field classification, query type constraints, driver requirements, and key rotation.

#databases #architecture #checklist

Apr 26, 2025 8 min read

L2 Deep Dive

Per-Application Postgres on Kubernetes Is an Isolation Strategy

How CloudNativePG, GitOps, and External Secrets turn Postgres-on-Kubernetes into an operational isolation pattern.

Apr 15, 2025 14 min read

L3 Reference Guide

GitHub Breakouts: Q1 2025 — The Quarter's Top Productivity Shifts

Six high-traction open-source projects from Q1 2025 converged on eliminating the manual integration layer between AI assistants and production systems across databases, platform operations, and developer tooling.

#ai-engineering #architecture #databases #cloud

Apr 8, 2025 7 min read

L2 Deep Dive

Python Automation Framework for DB and Cloud Ops: Architecture and Failure Model

DB and cloud automation fails when partial failures leave the database, cloud account, and ticketing system describing different operation states.

#architecture #cloud #databases

Mar 1, 2025 9 min read

L2 Deep Dive

Natural Language SQL Agents Need Guardrails Before Orchestration

How Postgres chat agents turn intent into SQL, and why production systems need schema controls, validation, and auditability.

Feb 22, 2025 8 min read

L2 Deep Dive

Double Write Buffers Fail at the I/O Boundary

Why porting InnoDB’s double write buffer to PostgreSQL breaks on buffered I/O, fsync semantics, and background writer design.

#databases #ai-engineering #failures

Dec 11, 2024 7 min read

L2 Deep Dive

The 2027 Cloud Database Architecture Roadmap

A 2027 cloud database architecture roadmap for teams that can no longer satisfy consistency, latency, residency, and recovery SLOs with a single engine.

Oct 24, 2024 6 min read

L2 Deep Dive

PostgreSQL 16/17 Features That Matter to Operators

Which PostgreSQL 16 and 17 changes operators actually need to prepare for: logical replication improvements, vacuum visibility, connection limits, and monitoring additions that change on-call behavior.

Oct 15, 2024 6 min read

L2 Deep Dive

MongoDB 8.0: Why Queryable Encryption Matters

MongoDB Queryable Encryption stores and queries sensitive fields in encrypted form — what it enables, how it differs from standard FLE, and where the query type constraints bite.

Oct 15, 2024 4 min read

L1 Field Note

Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works

How to position Prometheus and Grafana as the open-source baseline for teams that cannot send every byte of database telemetry to managed services.

#databases #architecture #failures #checklist

Oct 14, 2024 8 min read

L2 Deep Dive

Datadog Database Monitoring: PostgreSQL, MySQL, and Aurora Setup

How to configure Datadog Database Monitoring for PostgreSQL, MySQL, and Aurora — query samples, explain plans, wait event analysis, and the specific Agent settings that make the difference between metric collection and real observability.

Sep 17, 2024 6 min read

L2 Deep Dive

Cassandra Observability: Compaction, Tombstones, Repair, Latency, and Hot Partitions

Why generic server monitoring fails for Apache Cassandra, and how to track the true operational signals of a distributed masterless database.

Sep 12, 2024 7 min read

L3 Reference Guide

Cloud Architecture Review Checklist for Database-Backed Applications

Review checklist for database-backed cloud applications: connection saturation, migration locking, retry amplification, and region dependency failures.

#architecture #cloud #databases #failures

Sep 9, 2024 6 min read

L2 Deep Dive

Prometheus and Grafana for Database Monitoring: PostgreSQL and MySQL Setup

How to instrument PostgreSQL and MySQL with postgres_exporter and mysqld_exporter, configure Prometheus scrape jobs, and build Grafana panels that surface the metrics that matter — with working PromQL queries.

Aug 26, 2024 5 min read

L1 Field Note

Why pgcrypto Is Not a Full Key Management Strategy

PostgreSQL's pgcrypto is a cryptographic function library, not a key management system. Treating it as one guarantees your encryption keys will eventually leak.

#databases #security #failures

Aug 20, 2024 5 min read

L2 Deep Dive

PostgreSQL Observability: Vacuum, Bloat, Locks, Replication Lag, and Query Plans

Monitoring PostgreSQL requires looking past the operating system and into the internal bookkeeping of MVCC, autovacuum, and replication streams.

Aug 12, 2024 8 min read

L2 Deep Dive

Database Alert Design: Thresholds That Fire on Real Problems

How to set database alert thresholds that catch real failures without burning the team on autovacuum noise, checkpoint churn, and replication lag spikes — with specific values for PostgreSQL, MySQL, and Aurora.

Aug 5, 2024 6 min read

L2 Deep Dive

Database Encryption: TDE, Column Encryption, pgcrypto, KMS

Why Transparent Data Encryption ticks compliance boxes but fails against compromised credentials, and how to push encryption boundaries up the stack.

#databases #architecture #security

Jul 22, 2024 8 min read

L2 Deep Dive

MySQL and Aurora Monitoring: The Dashboard That Catches Problems Before Users Do

The seven MySQL and Aurora metric groups that matter for production operations — threads, replication lag, InnoDB buffer pool, slow queries, connections, locks, and disk — with exact SQL, CloudWatch metrics, and alert thresholds.

Jul 16, 2024 5 min read

L2 Deep Dive

CloudWatch Database Insights for Aurora and RDS: The New AWS Monitoring Center

How to use CloudWatch and Performance Insights to root-cause Aurora and RDS incidents without deploying third-party agents.

Jul 16, 2024 7 min read

L2 Deep Dive

Database Changes in CI/CD: Migrations, Backfills, Expand-Contract, and Verification

Database changes in CI/CD require separate gates for schema migrations, backfills, and expand-contract patterns — not just a shell command before deployment.

Jul 8, 2024 7 min read

L2 Deep Dive

PostgreSQL Monitoring: The Dashboard That Surfaces Problems Before Users Do

The eight PostgreSQL metric groups that matter for production operations — queries, connections, replication lag, autovacuum, locks, cache pressure, checkpoint behavior, and bloat — with exact SQL and alert thresholds.

Jun 14, 2024 7 min read

L2 Deep Dive

Search Index Drift Workflow: Rebuilds, Dual Writes, CDC, and User-Visible Staleness

Search index drift is a truth-management failure: when to rebuild vs. dual-write vs. CDC, and how to bound user-visible staleness.

Jun 4, 2024 4 min read

L1 Field Note

The Database Observability Baseline: What Every DBA Dashboard Must Show

Before you can adopt AI-assisted triage, your database dashboard needs a foundation built on saturation, locking, and lag metrics.

#databases #architecture #failures #checklist

Jun 3, 2024 7 min read

L2 Deep Dive

pgvector Basics: Embeddings Inside PostgreSQL

How pgvector adds vector storage and similarity search to PostgreSQL, what the three distance operators do, and the index you must create before you hit 100K rows.

#databases #vector-db #ai-engineering

May 23, 2024 9 min read

L2 Deep Dive

#ai-engineering #architecture #databases

Top GitHub Breakouts: March 2025 (Part 2)

Three March 2025 open-source projects that eliminate the iteration pauses engineers manually bridge — research review loops, vector index calibration, and agent provisioning YAML.

May 16, 2024 6 min read

L2 Deep Dive

Vectorless RAG Patterns for Database Knowledge Systems

How tree-based retrieval can improve DB runbooks, schema docs, and incident knowledge over chunked vector search.

#databases #vector-db #ai-engineering

May 13, 2024 6 min read

L2 Deep Dive

Redis Licensing and Valkey: What Engineers Should Know

In March 2024, Redis Ltd changed Redis 7.4+ to a non-OSS license. Here is what that actually means for your deployment — and what Valkey is.

May 7, 2024 5 min read

L1 Field Note

MySQL 8.4 LTS: What DBAs Should Check Before Upgrade

MySQL 8.4 is the first long-term support release in the 8.x line — five breaking changes that require verification before any production upgrade.

Apr 15, 2024 6 min read

L2 Deep Dive

Shopify-Style Multi-Tenant Commerce Databases: Isolation, Sharding, and Operational Controls

Shopify-style per-merchant sharding prevents one large tenant from turning shared commerce database infrastructure into a shared outage.

Apr 8, 2024 7 min read

L2 Deep Dive

MongoDB Version Upgrade Risk Review

A systematic runbook for assessing MongoDB version upgrade risk — FCV, driver compatibility, deprecated operators, and rollback paths before any production cutover.

#databases #checklist #architecture

Mar 18, 2024 10 min read

L3 Reference Guide

Index Debt Review: How to Find Bad, Missing, and Duplicate Indexes

A SQL-driven audit workflow for identifying unused, duplicate, bloated, and missing indexes in PostgreSQL before they drain write performance and storage.

Mar 11, 2024 6 min read

L2 Deep Dive

Aurora Serverless v2: Good Fit, Bad Fit

Aurora Serverless v2 scales ACUs rather than to zero — understanding the cost floor, scale-up lag, and workload fit before you commit to it for production OLTP.

Mar 6, 2024 4 min read

L1 Field Note

Vector Search on GPU Databases

A DBA-friendly explanation of how vector search works, why GPUs help, and where vector retrieval fits inside modern database and AI systems.

#databases #gpu #vector-search #retrieval

Mar 5, 2024 5 min read

L1 Field Note

#databases #architecture #ai-engineering

How a 10 Billion Row SQL Query Runs in 200ms on a GPU Database

A DBA-friendly walkthrough of how modern GPU databases execute large analytical SQL queries using columnar storage, parallel scans, and GPU aggregation.

Mar 4, 2024 5 min read

L1 Field Note

#databases #architecture #ai-engineering

Why Databases Are Moving Toward GPU Execution Engines

A practical, DBA-friendly explanation of why modern analytical databases are increasingly using GPUs for scans, joins, aggregations, and AI-adjacent workloads.

Feb 26, 2024 10 min read

L3 Reference Guide

PostgreSQL Statistics Drift Workflow

When the query planner gets row estimates wrong, queries regress silently. This runbook diagnoses statistics drift and restores accurate plans.

Feb 19, 2024 5 min read

L1 Field Note

Aurora Global Database: What It Solves and What It Does Not

Aurora Global Database delivers sub-second cross-region replication and under-one-minute RTO for disaster recovery — but it is not active-active, and application failover is never automatic.

Oct 2, 2023 5 min read

L1 Field Note

Why SELECT * Still Hurts Production Systems

SELECT * causes four distinct problems that compound at scale: it prevents covering index usage, transfers unnecessary data, breaks application code silently, and defeats column pruning in analytical systems.

Sep 18, 2023 6 min read

L2 Deep Dive

Product Catalog Modeling: Relational, Document, Search Index, or All Three

Modeling a product catalog across relational, document, and search-index layers: where each fits and why a single schema fails all three workloads.

Aug 21, 2023 6 min read

L2 Deep Dive

Partitioning Is Not a Performance Feature by Default

PostgreSQL declarative partitioning only speeds up queries when the partition key appears in the WHERE clause — without it, you get the overhead of many tables with none of the pruning benefit.

Aug 19, 2023 8 min read

L2 Deep Dive

OCI for Oracle-Heavy Enterprises: Migration Pattern, Risk Boundary, and Cost Model

OCI migration risk model for Oracle-heavy enterprises — where the lift-and-shift boundary shifts from the database tier into dependent application contracts.

Jul 31, 2023 6 min read

L2 Deep Dive

Deadlocks vs Blocking: The Difference Engineers Miss

Blocking and deadlocks are two distinct failure modes that require opposite responses — confusing them leads to retry logic that doesn't help and investigations that point at the wrong cause.

Jul 17, 2023 10 min read

L3 Reference Guide

Logical Replication Failure Workflow

A diagnostic runbook for logical replication lag, apply worker failures, replication conflicts, and schema drift between publisher and subscriber.

Jul 10, 2023 6 min read

L2 Deep Dive

Database Connection Pooling: Why Apps Kill Databases

Without a connection pool, traffic spikes exhaust OS-level resources before a single slow query runs — here is what actually happens and how to fix it.

Jun 26, 2023 13 min read

L3 Reference Guide

Schema Deployment Risk Checklist

Assessing lock type, table size, reversibility, and rollback plan before every schema migration — a structured checklist for zero-downtime deployments.

#databases #checklist #architecture

Jun 5, 2023 10 min read

L3 Reference Guide

Cloud Database Cost Triage: Storage, IOPS, CPU, Replicas

A structured runbook for identifying which cost dimension is driving your AWS RDS or Aurora bill before making any changes.

#databases #cloud #checklist

May 29, 2023 5 min read

L1 Field Note

MySQL Binlog Format: Row vs Statement vs Mixed

Choosing the wrong MySQL binary log format silently breaks replication or bloats the binlog — this is the decision tree for picking the right one.

#databases

May 15, 2023 11 min read

L3 Reference Guide

Database Backup Validation Workflow

A repeatable runbook for proving that your database backups are actually restorable — with exact commands, decision tree, and automation patterns.

May 8, 2023 6 min read

L2 Deep Dive

Logical Replication vs Physical Replication in PostgreSQL

Physical replication copies bytes; logical replication copies row changes — and confusing the two causes silent schema drift, sequence divergence, and failed zero-downtime upgrades.

Apr 17, 2023 5 min read

L1 Field Note

Read Replicas Are Not Free Scale

Read replicas add read throughput but they do not reduce write load, do not eliminate replication lag, and silently serve stale data under write bursts — understanding those constraints before you add replicas is the decision engineers skip.

Apr 3, 2023 10 min read

L3 Reference Guide

PostgreSQL Connection Storm Runbook

Diagnosing and resolving connection exhaustion in PostgreSQL: too many clients, idle-in-transaction accumulation, and the case for connection pooling.

Mar 13, 2023 5 min read

L1 Field Note

MongoDB WiredTiger Cache: Practical Basics

WiredTiger's internal cache is MongoDB's primary memory tier — how to read its metrics, recognize eviction pressure, and size it correctly for your working set.

#databases

Mar 6, 2023 8 min read

L2 Deep Dive

Aurora MySQL Writer CPU Spike Workflow

A systematic runbook for diagnosing Aurora MySQL writer CPU spikes — from Performance Insights through lock contention, long transactions, and read offload.

#databases #cloud #checklist #failures

Feb 6, 2023 8 min read

L2 Deep Dive

MySQL Replication Lag Decision Tree

A systematic runbook for diagnosing MySQL replication lag — from initial SHOW REPLICA STATUS to parallel apply, long transactions, and relay log space.

Jan 30, 2023 5 min read

L1 Field Note

MySQL Cardinality and Index Selectivity

MySQL ignores an index when the optimizer estimates a full scan is cheaper — which happens when cardinality is too low, statistics are stale, or the query shape doesn't match index selectivity. How to diagnose which problem it is and what to do about each.

Jan 16, 2023 9 min read

L2 Deep Dive

PostgreSQL Autovacuum Failure Workflow

A step-by-step runbook for diagnosing and resolving autovacuum failures: dead tuple accumulation, bloat, and transaction ID wraparound risk.

Jan 9, 2023 5 min read

L1 Field Note

PostgreSQL Statistics: Why the Optimizer Gets It Wrong

PostgreSQL's query planner depends entirely on per-column statistics that go stale after bulk loads — here is what that means for query plan quality and how to fix it.

Nov 14, 2022 5 min read

L2 Deep Dive

Backups Are Not Recovery: The DBA Rule Everyone Learns Late

A backup file proves you captured data. Recovery is the process of producing a running, consistent database on a different system inside your RTO. They are not the same thing, and confusing them is how incidents get worse.

#databases #failures #checklist

Oct 10, 2022 5 min read

L1 Field Note

Redis Memory Eviction Policies Explained

Redis has eight eviction policies and a maxmemory limit. The policy you pick determines whether your cache degrades safely or silently corrupts your hit rate under load.

Sep 26, 2022 7 min read

L2 Deep Dive

MongoDB Query Performance Workflow

A systematic runbook for diagnosing slow MongoDB queries — from explain output through COLLSCAN, index selectivity, in-memory sort, and WiredTiger cache pressure.

Sep 12, 2022 5 min read

L1 Field Note

MongoDB Index Basics: Why Your Query Became Slow

MongoDB's default behavior is a full collection scan when no index supports the query. Here is what you need to know about single-field, compound, and multikey indexes before your collection grows past 10K documents.

Jul 25, 2022 7 min read

L2 Deep Dive

DynamoDB Single-Table Design: When It Works and When It Hurts

Single-table design in DynamoDB is an operational bet that access patterns are stable enough to encode into partition and sort keys — when the approach pays off, and when evolving query requirements turn it into a migration project.

Jun 6, 2022 5 min read

L1 Field Note

MySQL EXPLAIN: Reading the Plan Without Guessing

How to read MySQL EXPLAIN output systematically — type column, key column, rows estimate, and Extra flags — so you stop adding indexes blindly.

May 23, 2022 11 min read

L3 Reference Guide

MySQL Slow Query Playbook: From Slow Log to Fix

A repeatable workflow for diagnosing MySQL slow queries — from enabling the slow log through reading EXPLAIN output to committing a safe fix.

May 9, 2022 5 min read

L1 Field Note

MySQL InnoDB Buffer Pool: The First Thing to Check

The InnoDB buffer pool hit ratio and size are the first metrics to verify on any MySQL server — a default 128MB pool on a 32GB machine sends every query to disk.

#databases

Apr 11, 2022 5 min read

L1 Field Note

PostgreSQL Autovacuum: What Every Engineer Should Know

Autovacuum is not optional maintenance — it is the mechanism that prevents table bloat and transaction ID wraparound from taking your database offline.

Mar 21, 2022 12 min read

L3 Reference Guide

PostgreSQL Slow Query Triage Workflow

A structured runbook for diagnosing slow query root causes in PostgreSQL — missing indexes, stale statistics, lock contention, and I/O saturation — in the order that wastes the least time.