#databases

Why Database Engineers Should Care About AI Cost Engineering

The skills that make a good cost-aware DBA — measuring usage, finding structural waste, balancing cost against reliability — transfer almost directly to AI workloads. Database engineers are unusually well positioned to own AI cost.

#ai #cost #databases #career

Jun 12, 2026 4 min read

L1 Field Note

How to Run a Database Cost & Reliability Review

A practitioner walkthrough of the review method: what to look at, in what order, how to quantify an opportunity honestly, and how to turn findings into a prioritized 30/60/90-day plan.

#databases #cost #reliability #postgresql

Jun 11, 2026 3 min read

L1 Field Note

Aurora Cost Optimization: The Hidden Database Bill

Aurora cost hides in places the console doesn't foreground — I/O charges, oversized writers and readers, replica sprawl, and storage. A structured way to find and reduce each without hurting reliability.

#databases #cloud #cost #aurora

Jun 10, 2026 3 min read

L1 Field Note

PostgreSQL Bloat, Index Waste, and Cloud Cost

Table and index bloat and unused indexes are well-known Postgres problems — and direct cloud-cost problems: wasted storage, write amplification, and extra I/O. How to measure both with read-only queries and remediate safely.

#postgresql #databases #cost #performance

May 28, 2026 17 min read

L3 Reference Guide

Per-App Postgres on Kubernetes Changes the Failure Boundary

How CloudNativePG, GitOps, and external secrets make per-application Postgres viable without hiding the operational cost.

May 25, 2026 6 min read

L2 Deep Dive

Azure Database for PostgreSQL: Flexible Server vs Hyperscale (Citus) Architecture Decision

When to choose Azure Flexible Server vs Citus for PostgreSQL on Azure — failover behavior, connection pooling, and the workload shapes where each architecture wins and breaks.

May 25, 2026 7 min read

L2 Deep Dive

Cassandra Write Path Fundamentals for Database Engineers

How Cassandra's commit log, Memtable, and SSTable pipeline works, why write amplification is the dominant operational cost, and how compaction strategy selection changes it.

May 25, 2026 6 min read

L2 Deep Dive

GCP AlloyDB vs Cloud SQL for PostgreSQL: When to Upgrade

When Cloud SQL's managed PostgreSQL hits its limits and AlloyDB's columnar cache and HTAP architecture become worth the migration complexity and cost jump.

May 24, 2026 9 min read

L2 Deep Dive

The Stack for AI-Accelerated Database Operations Is Now Open Source

Three May 2026 breakout projects close the gaps that stop database teams from moving schema changes, query assistance, and operational workflows to AI: declarative Postgres migrations, local LLM inference, and a full agent platform.

May 22, 2026 8 min read

L2 Deep Dive

Top GitHub Breakouts: April 2026 — Production Agent Infrastructure

The highest-starred new open-source projects in April 2026 targeting production-scale AI agent memory, protocol enforcement, and Postgres environment management — what breaks when agents leave single-developer scope.

#ai-engineering #databases #cloud

May 16, 2026 6 min read

L2 Deep Dive

Stop Writing Ad-Hoc Queries: Build a Skill Backbone for Your DB Engineering Workflows

How to codify repetitive DB tasks into testable, reusable Claude skills that produce consistent SQL, runbooks, and migration outputs instead of one-off chat prompts.

May 8, 2026 7 min read

L2 Deep Dive

Top GitHub Breakouts: April 2026 — Part I

The highest-starred new open-source projects in April 2026 relevant to database engineering, infrastructure, and AI tooling — focused on eliminating manual context re-injection across system design, platform automation, and AI memory.

Apr 22, 2026 7 min read

L2 Deep Dive

Top GitHub Breakouts: March 2026 — Agent Adaptation and Production-Scale Vector Search

The second wave of March 2026 breakouts: an agent that learns from every conversation, a Rust vector index that outperforms FAISS at a fraction of the memory, and a Kubernetes-native agent control plane.

Apr 16, 2026 2 min read

L1 Field Note

SQL Server to PostgreSQL Migration Cost Defense Checklist

A pragmatic checklist to defend the business case for migrating away from Microsoft SQL Server.

#checklist #databases

Apr 15, 2026 14 min read

L3 Reference Guide

GitHub Breakouts: Q1 2026 — The Quarter's Top Productivity Shifts

Six open-source projects from Q1 2026 that converged on eliminating the manual scaffolding between AI agents and production infrastructure: context management, local cloud testing, and vector retrieval.

Mar 25, 2026 2 min read

L1 Field Note

Oracle Cloud BYOL: True Cost Analysis Beyond the Headline Rate

Understanding the financial nuances, OCPU conversions, and hidden costs of bringing your Oracle licenses to OCI.

#databases #cloud

Mar 11, 2026 2 min read

L1 Field Note

Oracle to Aurora PostgreSQL: License Cost Elimination in Practice

The engineering reality and ROI of migrating from Oracle to Amazon Aurora PostgreSQL.

Mar 7, 2026 7 min read

L2 Deep Dive

Top GitHub Breakouts: February 2026 — Part I

The highest-starred new open-source projects in February 2026 — eliminating the context tax that slows AI-assisted code review, infrastructure generation, and database operations.

Mar 4, 2026 2 min read

L1 Field Note

AWS RDS Oracle and SQL Server: The License Cost Nobody Talks About

Why the default License-Included model on AWS RDS is a financial trap for enterprise database workloads.

#databases #cloud #failures

Feb 25, 2026 2 min read

L2 Deep Dive

Azure Hybrid Benefit for SQL Server: The Exact Math

A deep dive into the cost savings and mechanics of applying Azure Hybrid Benefit to SQL Server deployments.

Feb 24, 2026 4 min read

L1 Field Note

Programmatic Tool Calling for DB Automation

A reference pattern for keeping large database outputs out of model context by using scripts that summarize evidence before the agent sees it.

Feb 18, 2026 2 min read

L1 Field Note

Azure Synapse Cost Optimization: DWU Right-Sizing, Serverless, and Hybrid Benefit

How to reduce your Azure Synapse compute bill by right-sizing dedicated pools and offloading to serverless.

Feb 11, 2026 2 min read

L1 Field Note

Database Licensing Cost Across AWS, Azure, GCP, and OCI

A framework for managing commercial database licensing costs across the four major cloud providers.

Feb 4, 2026 3 min read

L1 Field Note

Cloud Database Cost Engineering: How to Reduce Database, Data Warehouse, and Licensing Spend Across Azure, AWS, GCP, and OCI

A comprehensive framework for reigning in cloud database costs, focusing on licensing, right-sizing, and architectural tradeoffs.

#databases #cloud #architecture #checklist

Jan 30, 2026 4 min read

L1 Field Note

#databases #ai-engineering #architecture #checklist

Database Runbooks as Agent Contracts

A reference operating model for turning human database runbooks into machine-usable agent contracts.

Jan 28, 2026 16 min read

L3 Reference Guide

GitHub Year in Review: 2025 — What Open Source Changed in the Engineering Stack

Nine breakout repos across four themes — MCP protocol adoption, agent memory infrastructure, AI-native platform ops, and database automation — that eliminated the hand-built glue code between AI agents and production systems.

Jan 23, 2026 4 min read

L1 Field Note

Repo-Embedded Skills for Database Teams

Why database teams should store agent instructions, runbook contracts, and review policies in the repository instead of in memory.

Jan 20, 2026 4 min read

L1 Field Note

Agentic Code Review for Database Repositories

Database repositories contain hidden rules human reviewers know: never add a blocking index at peak hours, never widen IAM without owner approval. Agent review surfaces these violations before merge — without displacing the human judgment that set the rules.

Jan 15, 2026 14 min read

L3 Reference Guide

GitHub Breakouts: Q4 2025 — The Quarter's Top Productivity Shifts

Six open-source projects that collectively delivered the missing infrastructure layer for production AI agents: secure sandboxes, deployment platforms, persistent memory, token-efficient encoding, and AI-native storage.

Jan 5, 2026 6 min read

L2 Deep Dive

Agent Loop Anatomy for DB and Cloud Engineers

A practical mental model for how coding agents plan, call tools, observe results, and complete infrastructure work without treating the model response as the whole system.

Dec 20, 2025 8 min read

L2 Deep Dive

Automated Reliability Across the Stack: Database Backups, Platform Observability, and SQL Quality (November 2025)

Three November 2025 open-source releases eliminate manual work from three engineering reliability tasks — multi-database backup verification, self-hosted log and trace collection, and SQL static analysis in CI pipelines.

Nov 22, 2025 8 min read

L2 Deep Dive

Top GitHub Breakouts: October 2025 (Part 2)

October's memory and retrieval breakouts: a structured agent memory framework with benchmarks, a self-hosted cognitive memory engine, and sub-10ms semantic search without a vector database cluster.

Nov 8, 2025 7 min read

L2 Deep Dive

Top GitHub Breakouts: October 2025 (Part 1)

Three October breakouts targeting LLM prompt verbosity, parallel agent orchestration, and fragmented hybrid search stacks — all reducing coordination overhead in AI engineering.

Oct 25, 2025 11 min read

L3 Reference Guide

Torn Page Protection Belongs Off the Foreground Path

A PostgreSQL kernel experiment shows why moving torn-page protection from WAL to background flush can change write latency.

Oct 15, 2025 14 min read

L3 Reference Guide

GitHub Breakouts: Q3 2025 — The Quarter's Top Productivity Shifts

Six open-source tools from Q3 2025 that closed the infrastructure gaps blocking AI agents in production: persistent memory, intelligent model routing, and natural language database access.

Oct 7, 2025 13 min read

L2 Deep Dive

PostgreSQL 18 Replication Upgrade Opportunities

What changes in replication when upgrading from PostgreSQL 14–16 to PostgreSQL 18: parallel apply, pg_createsubscriber, and surfaced conflict visibility.

#databases #architecture #checklist

Sep 27, 2025 7 min read

L2 Deep Dive

Top GitHub Breakouts: August 2025 — Part II

The highest-starred new open-source projects in August 2025 where AI takes over cloud operations, infrastructure provisioning, and production Postgres coding.

#ai-engineering #cloud #databases

Sep 25, 2025 6 min read

L2 Deep Dive

PostgreSQL 18: Features DB Engineers Should Watch

PostgreSQL 18 introduces fundamental changes to the storage engine — asynchronous I/O, parallel logical apply, and improved conflict visibility are the changes operators need to understand before upgrading.

Sep 13, 2025 10 min read

L3 Reference Guide

Autovacuum Is a Capacity Problem, Not a Maintenance Task

PostgreSQL vacuum failures often start with blocked cleanup, table bloat, and weak lock observability during peak load.

#databases #failures #checklist

Sep 6, 2025 7 min read

L2 Deep Dive

Top GitHub Breakouts: August 2025 — Part I

The gap between AI prototype and production system is routing tables, deployment YAML, and observability scaffolding. August 2025's top breakouts targeted exactly the code engineers keep rewriting: model routing logic, agent deployment manifests, and PostgreSQL diagnostics.

Aug 30, 2025 12 min read

L3 Reference Guide

The Semantics AI Misses When Porting Storage Designs

Why a PostgreSQL double write buffer prototype failed despite compiling, and what it reveals about AI-assisted systems design.

#databases #ai-engineering #failures

Jul 26, 2025 19 min read

L3 Reference Guide

Natural Language SQL Agents Need Database Guardrails

The risk in a natural-language SQL agent is not bad SQL — it is authority compilation: a user sentence becomes a database operation unless the control plane proves, before execution, which role, rows, cost, and columns the query is allowed to touch.

Jul 15, 2025 14 min read

L3 Reference Guide

GitHub Breakouts: Q2 2025 — The Quarter's Top Productivity Shifts

Six Q2 2025 open-source breakouts that closed the gap between AI agents and engineering infrastructure across system design, platform operations, and database tooling.

Jul 12, 2025 8 min read

L2 Deep Dive

Covering Indexes Are Not Enough Without Visibility

PostgreSQL index-only scans only stay fast when covering indexes and visibility map maintenance work together.

Jul 5, 2025 9 min read

L2 Deep Dive

When Autovacuum Becomes a Backpressure Signal

PostgreSQL vacuum stalls are often symptoms of lock pressure, table bloat, and missing operational visibility.

#databases #failures #checklist

Jun 22, 2025 8 min read

L2 Deep Dive

Top GitHub Breakouts: May 2025 — Operational Baseline in a Config File

Three May 2025 open-source projects replace multi-tool assembly in document ingestion, deployment governance, and PostgreSQL backup with single-binary or configuration-first alternatives.

Jun 14, 2025 9 min read

L2 Deep Dive

Three Open-Source Tools Filling the Gaps in Database Operations (May 2025)

May 2025's most-starred new projects solve three specific database team problems: backup restores that are never verified, internal knowledge that can't be retrieved, and AI agents blind to your schema history.

May 12, 2025 7 min read

L3 Reference Guide

MongoDB Queryable Encryption Architecture Review

A pre-go-live architecture review for MongoDB Queryable Encryption — key management, field classification, query type constraints, driver requirements, and key rotation.

#databases #architecture #checklist

May 3, 2025 6 min read

L2 Deep Dive

The Architecture of Natural Language Database Interfaces

Replacing the translation overhead between business questions and SQL queries requires an architecture that bridges LLM intent parsing with strict execution validation and schema retrieval.

Apr 26, 2025 8 min read

L2 Deep Dive

Per-Application Postgres on Kubernetes Is an Isolation Strategy

How CloudNativePG, GitOps, and External Secrets turn Postgres-on-Kubernetes into an operational isolation pattern.

Apr 15, 2025 14 min read

L3 Reference Guide

GitHub Breakouts: Q1 2025 — The Quarter's Top Productivity Shifts

Six high-traction open-source projects from Q1 2025 converged on eliminating the manual integration layer between AI assistants and production systems across databases, platform operations, and developer tooling.

Apr 8, 2025 7 min read

L2 Deep Dive

Python Automation Framework for DB and Cloud Ops: Architecture and Failure Model

DB and cloud automation fails when partial failures leave the database, cloud account, and ticketing system describing different operation states.

#architecture #cloud #databases

Mar 1, 2025 9 min read

L2 Deep Dive

Natural Language SQL Agents Need Guardrails Before Orchestration

How Postgres chat agents turn intent into SQL, and why production systems need schema controls, validation, and auditability.

Feb 22, 2025 8 min read

L2 Deep Dive

Double Write Buffers Fail at the I/O Boundary

Why porting InnoDB’s double write buffer to PostgreSQL breaks on buffered I/O, fsync semantics, and background writer design.

#databases #ai-engineering #failures

Jan 28, 2025 23 min read

L3 Reference Guide

GitHub Year in Review: 2024 — What Open Source Changed in the Engineering Stack

Nine breakout repositories across three themes — agents that operated computers, RAG that grew a graph spine, and databases that finally spoke natively to LLMs — define what actually shifted in the engineering stack in 2024.

Dec 11, 2024 7 min read

L2 Deep Dive

The 2027 Cloud Database Architecture Roadmap

A 2027 cloud database architecture roadmap for teams that can no longer satisfy consistency, latency, residency, and recovery SLOs with a single engine.

Dec 10, 2024 10 min read

L3 Reference Guide

AI Agents Need Database Guardrails Below the Prompt

Prompt-level guardrails fail open when the agent misinterprets context. The only boundary that mechanically rejects destructive SQL is the database — dedicated read-only roles, sanitized view schemas, and a network path that application credentials never touch.

#ai-engineering #databases #failures

Dec 2, 2024 12 min read

L1 Field Note

The Agent Should Not Have Your App Credentials

Giving an AI coding agent your application's Postgres credentials is the default mistake — the agent inherits every permission the app has. Database-enforced read-only roles, replica routing, query limits, and project-scoped MCP config are the alternative that actually fails closed.

#ai-engineering #databases #failures

Oct 24, 2024 6 min read

L2 Deep Dive

PostgreSQL 16/17 Features That Matter to Operators

Which PostgreSQL 16 and 17 changes operators actually need to prepare for: logical replication improvements, vacuum visibility, connection limits, and monitoring additions that change on-call behavior.

Oct 15, 2024 6 min read

L2 Deep Dive

MongoDB 8.0: Why Queryable Encryption Matters

MongoDB Queryable Encryption stores and queries sensitive fields in encrypted form — what it enables, how it differs from standard FLE, and where the query type constraints bite.

Oct 15, 2024 4 min read

L1 Field Note

Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works

How to position Prometheus and Grafana as the open-source baseline for teams that cannot send every byte of database telemetry to managed services.

#databases #architecture #failures #checklist

Oct 14, 2024 8 min read

L2 Deep Dive

Datadog Database Monitoring: PostgreSQL, MySQL, and Aurora Setup

How to configure Datadog Database Monitoring for PostgreSQL, MySQL, and Aurora — query samples, explain plans, wait event analysis, and the specific Agent settings that make the difference between metric collection and real observability.

Sep 27, 2024 9 min read

L3 Reference Guide

AWS vs Azure vs GCP vs OCI for Database-Backed Systems: Decision Framework

How to choose between AWS, Azure, GCP, and OCI for database-backed systems by matching managed database failure behavior to your system's dominant recovery requirement.

#architecture #cloud #databases

Sep 17, 2024 6 min read

L2 Deep Dive

Cassandra Observability: Compaction, Tombstones, Repair, Latency, and Hot Partitions

Why generic server monitoring fails for Apache Cassandra, and how to track the true operational signals of a distributed masterless database.

Sep 12, 2024 7 min read

L3 Reference Guide

Cloud Architecture Review Checklist for Database-Backed Applications

Review checklist for database-backed cloud applications: connection saturation, migration locking, retry amplification, and region dependency failures.

#architecture #cloud #databases #failures

Sep 9, 2024 6 min read

L2 Deep Dive

Prometheus and Grafana for Database Monitoring: PostgreSQL and MySQL Setup

How to instrument PostgreSQL and MySQL with postgres_exporter and mysqld_exporter, configure Prometheus scrape jobs, and build Grafana panels that surface the metrics that matter — with working PromQL queries.

Aug 26, 2024 5 min read

L1 Field Note

Why pgcrypto Is Not a Full Key Management Strategy

PostgreSQL's pgcrypto is a cryptographic function library, not a key management system. Treating it as one guarantees your encryption keys will eventually leak.

#databases #security #failures

Aug 20, 2024 5 min read

L2 Deep Dive

PostgreSQL Observability: Vacuum, Bloat, Locks, Replication Lag, and Query Plans

Monitoring PostgreSQL requires looking past the operating system and into the internal bookkeeping of MVCC, autovacuum, and replication streams.

Aug 12, 2024 8 min read

L2 Deep Dive

Database Alert Design: Thresholds That Fire on Real Problems

How to set database alert thresholds that catch real failures without burning the team on autovacuum noise, checkpoint churn, and replication lag spikes — with specific values for PostgreSQL, MySQL, and Aurora.

Aug 5, 2024 6 min read

L2 Deep Dive

Database Encryption: TDE, Column Encryption, pgcrypto, KMS

Why Transparent Data Encryption ticks compliance boxes but fails against compromised credentials, and how to push encryption boundaries up the stack.

#databases #architecture #security

Jul 22, 2024 8 min read

L2 Deep Dive

MySQL and Aurora Monitoring: The Dashboard That Catches Problems Before Users Do

The seven MySQL and Aurora metric groups that matter for production operations — threads, replication lag, InnoDB buffer pool, slow queries, connections, locks, and disk — with exact SQL, CloudWatch metrics, and alert thresholds.

Jul 16, 2024 5 min read

L2 Deep Dive

CloudWatch Database Insights for Aurora and RDS: The New AWS Monitoring Center

How to use CloudWatch and Performance Insights to root-cause Aurora and RDS incidents without deploying third-party agents.

Jul 16, 2024 7 min read

L2 Deep Dive

Database Changes in CI/CD: Migrations, Backfills, Expand-Contract, and Verification

Database changes in CI/CD require separate gates for schema migrations, backfills, and expand-contract patterns — not just a shell command before deployment.

Jul 8, 2024 7 min read

L2 Deep Dive

PostgreSQL Monitoring: The Dashboard That Surfaces Problems Before Users Do

The eight PostgreSQL metric groups that matter for production operations — queries, connections, replication lag, autovacuum, locks, cache pressure, checkpoint behavior, and bloat — with exact SQL and alert thresholds.

Jun 14, 2024 7 min read

L2 Deep Dive

Search Index Drift Workflow: Rebuilds, Dual Writes, CDC, and User-Visible Staleness

Search index drift is a truth-management failure: when to rebuild vs. dual-write vs. CDC, and how to bound user-visible staleness.

Jun 4, 2024 4 min read

L1 Field Note

The Database Observability Baseline: What Every DBA Dashboard Must Show

Before you can adopt AI-assisted triage, your database dashboard needs a foundation built on saturation, locking, and lag metrics.

#databases #architecture #failures #checklist

Jun 3, 2024 7 min read

L2 Deep Dive

pgvector Basics: Embeddings Inside PostgreSQL

How pgvector adds vector storage and similarity search to PostgreSQL, what the three distance operators do, and the index you must create before you hit 100K rows.

#databases #vector-db #ai-engineering

May 23, 2024 9 min read

L2 Deep Dive

Top GitHub Breakouts: March 2025 (Part 2)

Three March 2025 open-source projects that eliminate the iteration pauses engineers manually bridge — research review loops, vector index calibration, and agent provisioning YAML.

May 20, 2024 5 min read

L1 Field Note

Database Security Review for AI Access

Granting an autonomous AI agent access to your database breaks every assumption of traditional RBAC. How to secure databases against unpredictable, unbounded AI queries.

#ai-engineering #databases #checklist

May 16, 2024 6 min read

L2 Deep Dive

Vectorless RAG Patterns for Database Knowledge Systems

How tree-based retrieval can improve DB runbooks, schema docs, and incident knowledge over chunked vector search.

#databases #vector-db #ai-engineering

May 13, 2024 6 min read

L2 Deep Dive

Redis Licensing and Valkey: What Engineers Should Know

In March 2024, Redis Ltd changed Redis 7.4+ to a non-OSS license. Here is what that actually means for your deployment — and what Valkey is.

May 7, 2024 5 min read

L1 Field Note

MySQL 8.4 LTS: What DBAs Should Check Before Upgrade

MySQL 8.4 is the first long-term support release in the 8.x line — five breaking changes that require verification before any production upgrade.

Apr 15, 2024 6 min read

L2 Deep Dive

Shopify-Style Multi-Tenant Commerce Databases: Isolation, Sharding, and Operational Controls

Shopify-style per-merchant sharding prevents one large tenant from turning shared commerce database infrastructure into a shared outage.

Apr 8, 2024 7 min read

L2 Deep Dive

MongoDB Version Upgrade Risk Review

A systematic runbook for assessing MongoDB version upgrade risk — FCV, driver compatibility, deprecated operators, and rollback paths before any production cutover.

#databases #checklist #architecture

Mar 18, 2024 10 min read

L3 Reference Guide

Index Debt Review: How to Find Bad, Missing, and Duplicate Indexes

A SQL-driven audit workflow for identifying unused, duplicate, bloated, and missing indexes in PostgreSQL before they drain write performance and storage.

Mar 12, 2024 4 min read

L1 Field Note

Consistency Models Your Application Actually Needs

The difference between read committed, repeatable read, and serializable isolation in operational terms — and why most applications are running with weaker guarantees than engineers assume.

Mar 11, 2024 6 min read

L2 Deep Dive

Aurora Serverless v2: Good Fit, Bad Fit

Aurora Serverless v2 scales ACUs rather than to zero — understanding the cost floor, scale-up lag, and workload fit before you commit to it for production OLTP.

Mar 6, 2024 4 min read

L1 Field Note

Vector Search on GPU Databases

A DBA-friendly explanation of how vector search works, why GPUs help, and where vector retrieval fits inside modern database and AI systems.

#databases #gpu #vector-search #retrieval

Mar 5, 2024 5 min read

L1 Field Note

#databases #architecture #ai-engineering

How a 10 Billion Row SQL Query Runs in 200ms on a GPU Database

A DBA-friendly walkthrough of how modern GPU databases execute large analytical SQL queries using columnar storage, parallel scans, and GPU aggregation.

Mar 4, 2024 5 min read

L1 Field Note

#databases #architecture #ai-engineering

Why Databases Are Moving Toward GPU Execution Engines

A practical, DBA-friendly explanation of why modern analytical databases are increasingly using GPUs for scans, joins, aggregations, and AI-adjacent workloads.

Mar 3, 2024 5 min read

L1 Field Note

SIMD vs SIMT Explained for Database Engineers

A DBA-friendly explanation of SIMD and SIMT using query execution, vectorized processing, and GPU mental models instead of hardware jargon.

#databases #cpu #gpu #performance

Mar 2, 2024 5 min read

L1 Field Note

#databases #architecture #ai-engineering

CPU vs GPU vs TPU Explained for Database Engineers

How CPU, GPU, and TPU architectures differ in ways that matter for databases and AI workloads — and which compute class to reach for when adding vector search, embedding generation, or GPU-accelerated analytics.

Feb 26, 2024 10 min read

L3 Reference Guide

PostgreSQL Statistics Drift Workflow

When the query planner gets row estimates wrong, queries regress silently. This runbook diagnoses statistics drift and restores accurate plans.

Feb 19, 2024 5 min read

L1 Field Note

Aurora Global Database: What It Solves and What It Does Not

Aurora Global Database delivers sub-second cross-region replication and under-one-minute RTO for disaster recovery — but it is not active-active, and application failover is never automatic.

Jan 9, 2024 4 min read

L1 Field Note

#databases #fundamentals #architecture

CAP Theorem in Operational Terms

What CAP theorem actually says about distributed database tradeoffs, why the CP vs AP framing is more useful than the theory, and what it means for your system when the network fails.

Nov 14, 2023 4 min read

L1 Field Note

#databases #fundamentals #architecture

Caches, Queues, and Databases: When to Use Each

The decision framework for choosing between a cache, a queue, and a database — including the failure modes that appear when engineers use the wrong one for the job.

Oct 2, 2023 5 min read

L1 Field Note

Why SELECT * Still Hurts Production Systems

SELECT * causes four distinct problems that compound at scale: it prevents covering index usage, transfers unnecessary data, breaks application code silently, and defeats column pruning in analytical systems.

Sep 18, 2023 6 min read

L2 Deep Dive

Product Catalog Modeling: Relational, Document, Search Index, or All Three

Modeling a product catalog across relational, document, and search-index layers: where each fits and why a single schema fails all three workloads.

Sep 12, 2023 4 min read

L1 Field Note

Cardinality Estimation: Why the Query Planner Gets It Wrong

How PostgreSQL estimates row counts, why those estimates are wrong for correlated columns and skewed distributions, and what engineers can do when the planner picks a bad plan.

Aug 21, 2023 6 min read

L2 Deep Dive

Partitioning Is Not a Performance Feature by Default

PostgreSQL declarative partitioning only speeds up queries when the partition key appears in the WHERE clause — without it, you get the overhead of many tables with none of the pruning benefit.

Aug 19, 2023 8 min read

L2 Deep Dive

OCI for Oracle-Heavy Enterprises: Migration Pattern, Risk Boundary, and Cost Model

OCI migration risk model for Oracle-heavy enterprises — where the lift-and-shift boundary shifts from the database tier into dependent application contracts.

Jul 31, 2023 6 min read

L2 Deep Dive

Deadlocks vs Blocking: The Difference Engineers Miss

Blocking and deadlocks are two distinct failure modes that require opposite responses — confusing them leads to retry logic that doesn't help and investigations that point at the wrong cause.

Jul 17, 2023 10 min read

L3 Reference Guide

Logical Replication Failure Workflow

A diagnostic runbook for logical replication lag, apply worker failures, replication conflicts, and schema drift between publisher and subscriber.

Jul 11, 2023 4 min read

L1 Field Note

Index Selectivity: Why Cardinality Changes Everything

Why a low-cardinality index is often worse than no index, how the query planner uses selectivity estimates, and when to build a partial index instead.

Jul 10, 2023 6 min read

L2 Deep Dive

Database Connection Pooling: Why Apps Kill Databases

Without a connection pool, traffic spikes exhaust OS-level resources before a single slow query runs — here is what actually happens and how to fix it.

Jun 26, 2023 13 min read

L3 Reference Guide

Schema Deployment Risk Checklist

Assessing lock type, table size, reversibility, and rollback plan before every schema migration — a structured checklist for zero-downtime deployments.

#databases #checklist #architecture

Jun 5, 2023 10 min read

L3 Reference Guide

Cloud Database Cost Triage: Storage, IOPS, CPU, Replicas

A structured runbook for identifying which cost dimension is driving your AWS RDS or Aurora bill before making any changes.

#databases #cloud #checklist

May 29, 2023 5 min read

L1 Field Note

MySQL Binlog Format: Row vs Statement vs Mixed

Choosing the wrong MySQL binary log format silently breaks replication or bloats the binlog — this is the decision tree for picking the right one.

#databases

May 15, 2023 11 min read

L3 Reference Guide

Database Backup Validation Workflow

A repeatable runbook for proving that your database backups are actually restorable — with exact commands, decision tree, and automation patterns.

May 9, 2023 5 min read

L1 Field Note

Reading a Query Plan Without Getting Lost

How to read PostgreSQL EXPLAIN output, what seq scan vs index scan actually means in practice, and the three numbers that matter most in any query plan.

May 8, 2023 6 min read

L2 Deep Dive

Logical Replication vs Physical Replication in PostgreSQL

Physical replication copies bytes; logical replication copies row changes — and confusing the two causes silent schema drift, sequence divergence, and failed zero-downtime upgrades.

Apr 17, 2023 5 min read

L1 Field Note

Read Replicas Are Not Free Scale

Read replicas add read throughput but they do not reduce write load, do not eliminate replication lag, and silently serve stale data under write bursts — understanding those constraints before you add replicas is the decision engineers skip.

Apr 6, 2023 7 min read

L2 Deep Dive

GCP E-Commerce Inventory Architecture: Spanner, Pub/Sub, Dataflow, and BigQuery

Spanner prevents inventory oversells under concurrent checkouts; Pub/Sub and Dataflow push stock events to BigQuery without blocking reservation writes.

Apr 3, 2023 10 min read

L3 Reference Guide

PostgreSQL Connection Storm Runbook

Diagnosing and resolving connection exhaustion in PostgreSQL: too many clients, idle-in-transaction accumulation, and the case for connection pooling.

Mar 14, 2023 4 min read

L1 Field Note

Connection Pooling Explained

Why PostgreSQL connections are expensive, what a connection pool actually does, and the difference between session mode, transaction mode, and statement mode in PgBouncer.

Mar 13, 2023 5 min read

L1 Field Note

MongoDB WiredTiger Cache: Practical Basics

WiredTiger's internal cache is MongoDB's primary memory tier — how to read its metrics, recognize eviction pressure, and size it correctly for your working set.

#databases

Mar 6, 2023 8 min read

L2 Deep Dive

Aurora MySQL Writer CPU Spike Workflow

A systematic runbook for diagnosing Aurora MySQL writer CPU spikes — from Performance Insights through lock contention, long transactions, and read offload.

#databases #cloud #checklist #failures

Feb 20, 2023 7 min read

L2 Deep Dive

GCP Reference Architecture: Cloud Run, Load Balancing, Cloud SQL, Memorystore, and Pub/Sub

Cloud Run autoscales compute, but Cloud SQL connection limits, Memorystore eviction, and Pub/Sub backpressure are where capacity planning actually lives.

#architecture #cloud #databases

Feb 6, 2023 8 min read

L2 Deep Dive

MySQL Replication Lag Decision Tree

A systematic runbook for diagnosing MySQL replication lag — from initial SHOW REPLICA STATUS to parallel apply, long transactions, and relay log space.

Jan 30, 2023 5 min read

L1 Field Note

MySQL Cardinality and Index Selectivity

MySQL ignores an index when the optimizer estimates a full scan is cheaper — which happens when cardinality is too low, statistics are stale, or the query shape doesn't match index selectivity. How to diagnose which problem it is and what to do about each.

Jan 16, 2023 9 min read

L2 Deep Dive

PostgreSQL Autovacuum Failure Workflow

A step-by-step runbook for diagnosing and resolving autovacuum failures: dead tuple accumulation, bloat, and transaction ID wraparound risk.

Jan 10, 2023 4 min read

L1 Field Note

Replication Lag Explained

What replication lag actually measures in PostgreSQL, the three distinct lag components that most monitoring tools conflate, and which one matters for your RPO.

Jan 9, 2023 5 min read

L1 Field Note

PostgreSQL Statistics: Why the Optimizer Gets It Wrong

PostgreSQL's query planner depends entirely on per-column statistics that go stale after bulk loads — here is what that means for query plan quality and how to fix it.

Nov 14, 2022 5 min read

L2 Deep Dive

Backups Are Not Recovery: The DBA Rule Everyone Learns Late

A backup file proves you captured data. Recovery is the process of producing a running, consistent database on a different system inside your RTO. They are not the same thing, and confusing them is how incidents get worse.

#databases #failures #checklist

Oct 11, 2022 4 min read

L1 Field Note

Checkpoint and Flush: What Your Database Does Before It Can Rest

What a checkpoint actually does in PostgreSQL, why dirty page flush matters for recovery time, and what engineers should monitor to avoid checkpoint pressure.

Oct 10, 2022 5 min read

L1 Field Note

Redis Memory Eviction Policies Explained

Redis has eight eviction policies and a maxmemory limit. The policy you pick determines whether your cache degrades safely or silently corrupts your hit rate under load.

Sep 26, 2022 7 min read

L2 Deep Dive

MongoDB Query Performance Workflow

A systematic runbook for diagnosing slow MongoDB queries — from explain output through COLLSCAN, index selectivity, in-memory sort, and WiredTiger cache pressure.

Sep 12, 2022 5 min read

L1 Field Note

MongoDB Index Basics: Why Your Query Became Slow

MongoDB's default behavior is a full collection scan when no index supports the query. Here is what you need to know about single-field, compound, and multikey indexes before your collection grows past 10K documents.

Aug 9, 2022 4 min read

L1 Field Note

Redo vs Undo: How Databases Recover from Crashes

The two mechanisms databases use to survive crashes — redo brings committed changes forward, undo rolls back uncommitted ones — and why the distinction matters operationally.

Jul 25, 2022 7 min read

L2 Deep Dive

DynamoDB Single-Table Design: When It Works and When It Hurts

Single-table design in DynamoDB is an operational bet that access patterns are stable enough to encode into partition and sort keys — when the approach pays off, and when evolving query requirements turn it into a migration project.

Jun 14, 2022 4 min read

L1 Field Note

#databases #fundamentals #architecture

B-tree vs LSM Tree: The Storage Engine Tradeoff

Why PostgreSQL and MySQL use B-trees while Cassandra and RocksDB use LSM trees — the read/write tradeoff that determines which storage engine fits your workload.

Jun 6, 2022 5 min read

L1 Field Note

MySQL EXPLAIN: Reading the Plan Without Guessing

How to read MySQL EXPLAIN output systematically — type column, key column, rows estimate, and Extra flags — so you stop adding indexes blindly.

May 23, 2022 11 min read

L3 Reference Guide

MySQL Slow Query Playbook: From Slow Log to Fix

A repeatable workflow for diagnosing MySQL slow queries — from enabling the slow log through reading EXPLAIN output to committing a safe fix.

May 9, 2022 5 min read

L1 Field Note

MySQL InnoDB Buffer Pool: The First Thing to Check

The InnoDB buffer pool hit ratio and size are the first metrics to verify on any MySQL server — a default 128MB pool on a 32GB machine sends every query to disk.

#databases

Apr 11, 2022 5 min read

L1 Field Note

PostgreSQL Autovacuum: What Every Engineer Should Know

Autovacuum is not optional maintenance — it is the mechanism that prevents table bloat and transaction ID wraparound from taking your database offline.

Mar 21, 2022 12 min read

L3 Reference Guide

PostgreSQL Slow Query Triage Workflow

A structured runbook for diagnosing slow query root causes in PostgreSQL — missing indexes, stale statistics, lock contention, and I/O saturation — in the order that wastes the least time.

Mar 15, 2022 4 min read

L1 Field Note

WAL Explained for Database Engineers

What write-ahead logging is, why every ACID database uses it, and what engineers need to know about LSN ordering, crash recovery, and replication lag.

Feb 14, 2022 5 min read

L1 Field Note

MVCC Explained Like a Database Engineer

How multi-version concurrency control lets readers and writers run without blocking each other — and why misunderstanding it causes table bloat, undo log growth, and stalled vacuums.