Field Notes

Datadog DBM: What Database Teams Should Actually Monitor

Datadog Database Monitoring can surface enormous detail — and bill for it. The skill is choosing the few signals that answer real cost and reliability questions, and not paying to collect noise nobody acts on.

#databases #observability #cost #postgresql

Jun 14, 2026 4 min read

L1 Field Note

AI Token Cost Is the New Cloud Bill

Token spend behaves differently from compute and storage — it scales with usage and prompt design. Treating it like an engineering cost line, the way you treat a database bill, is how you bring it under control.

#ai #cost #cloud #finops

Jun 13, 2026 4 min read

L1 Field Note

Why Database Engineers Should Care About AI Cost Engineering

The skills that make a good cost-aware DBA — measuring usage, finding structural waste, balancing cost against reliability — transfer almost directly to AI workloads. Database engineers are unusually well positioned to own AI cost.

#ai #cost #databases #career

Jun 12, 2026 4 min read

L1 Field Note

How to Run a Database Cost & Reliability Review

A practitioner walkthrough of the review method: what to look at, in what order, how to quantify an opportunity honestly, and how to turn findings into a prioritized 30/60/90-day plan.

#databases #cost #reliability #postgresql

Jun 11, 2026 3 min read

L1 Field Note

Aurora Cost Optimization: The Hidden Database Bill

Aurora cost hides in places the console doesn't foreground — I/O charges, oversized writers and readers, replica sprawl, and storage. A structured way to find and reduce each without hurting reliability.

#databases #cloud #cost #aurora

Jun 10, 2026 3 min read

L1 Field Note

PostgreSQL Bloat, Index Waste, and Cloud Cost

Table and index bloat and unused indexes are well-known Postgres problems — and direct cloud-cost problems: wasted storage, write amplification, and extra I/O. How to measure both with read-only queries and remediate safely.

#postgresql #databases #cost #performance

All Field Notes Posts

Jun 15, 2026 4 min read

L1 Field Note

Datadog DBM: What Database Teams Should Actually Monitor

#databases #observability #cost #postgresql

Jun 14, 2026 4 min read

L1 Field Note

AI Token Cost Is the New Cloud Bill

#ai #cost #cloud #finops

Jun 13, 2026 4 min read

L1 Field Note

Why Database Engineers Should Care About AI Cost Engineering

#ai #cost #databases #career

Jun 12, 2026 4 min read

L1 Field Note

How to Run a Database Cost & Reliability Review

A practitioner walkthrough of the review method: what to look at, in what order, how to quantify an opportunity honestly, and how to turn findings into a prioritized 30/60/90-day plan.

#databases #cost #reliability #postgresql

Jun 11, 2026 3 min read

L1 Field Note

Aurora Cost Optimization: The Hidden Database Bill

#databases #cloud #cost #aurora

Jun 10, 2026 3 min read

L1 Field Note

PostgreSQL Bloat, Index Waste, and Cloud Cost

#postgresql #databases #cost #performance

Apr 29, 2026 4 min read

L1 Field Note

AI Coding Assistant ROI: When $200/Developer/Month Is Cheap — and When It Is Waste

Why treating AI assistant seats like standard SaaS licenses obscures their true infrastructure cost profile, and how to measure ROI using cloud compute parallels.

#ai-engineering #cloud #architecture #failures

Apr 22, 2026 4 min read

L1 Field Note

Token Budgeting for Engineering Teams: Daily, Weekly, Monthly Controls by Developer and Repository

How to implement token quotas, chargebacks, and spend controls for AI engineering teams, drawing parallels from cloud database cost management.

#cloud #ai-engineering #architecture

Apr 16, 2026 2 min read

L1 Field Note

SQL Server to PostgreSQL Migration Cost Defense Checklist

A pragmatic checklist to defend the business case for migrating away from Microsoft SQL Server.

#checklist #databases

Apr 15, 2026 5 min read

L1 Field Note

AI Cost Observability Dashboard: LangSmith vs Helicone

How to build an AI FinOps dashboard and choose between proxy-based and instrumentation-based observability.

Apr 8, 2026 2 min read

L1 Field Note

System Design

Why Your Non-Prod Databases Cost as Much as Production

Architectural strategies to eliminate waste in Dev, Test, and Staging database environments.

#failures #architecture

Apr 8, 2026 4 min read

L1 Field Note

Why Agentic AI Costs Explode: Context Size, Tool Calls, MCP Servers, Repo Size, and Retry Loops

Agentic AI systems can quietly accumulate massive API bills due to compounding context windows, retry loops, and unconstrained workspace parsing.

#ai-engineering #architecture #cloud #failures

Apr 1, 2026 2 min read

L1 Field Note

The Math Behind Database Reserved Instances: When to Wait

Why committing to 3-year database reserved instances too early locks in architectural waste.

#cloud #architecture

Apr 1, 2026 5 min read

L1 Field Note

Codex Credits and Cost Controls for Business Teams

Practical strategies for managing OpenAI Codex API consumption, workspace credits, and governance across your organization.

#ai-engineering #cloud

Mar 25, 2026 2 min read

L1 Field Note

Oracle Cloud BYOL: True Cost Analysis Beyond the Headline Rate

Understanding the financial nuances, OCPU conversions, and hidden costs of bringing your Oracle licenses to OCI.

#databases #cloud

Mar 18, 2026 2 min read

L1 Field Note

BigQuery Cost Optimization: On-Demand vs Slot Commitments

How to stop runaway BigQuery costs by analyzing query scans, enforcing partitions, and moving to capacity-based pricing.

#cloud #architecture #checklist

Mar 18, 2026 3 min read

L1 Field Note

The New AI FinOps Model: Seat Cost vs Token Cost vs Agent Runtime Cost

Why traditional SaaS spend models fail for agentic AI, and how platform teams are treating LLM compute like database provisioned IOPS.

#ai-engineering #cloud #architecture #failures

Mar 11, 2026 2 min read

L1 Field Note

Oracle to Aurora PostgreSQL: License Cost Elimination in Practice

The engineering reality and ROI of migrating from Oracle to Amazon Aurora PostgreSQL.

Mar 4, 2026 2 min read

L1 Field Note

AWS RDS Oracle and SQL Server: The License Cost Nobody Talks About

Why the default License-Included model on AWS RDS is a financial trap for enterprise database workloads.

#databases #cloud #failures

Feb 27, 2026 4 min read

L1 Field Note

#ai-engineering #architecture #failures

Context Anxiety and Harness Decay

Why agent harnesses become stale when they overfit today's model weaknesses instead of stable execution contracts.

Feb 24, 2026 4 min read

L1 Field Note

#databases #ai-engineering #architecture

Programmatic Tool Calling for DB Automation

A reference pattern for keeping large database outputs out of model context by using scripts that summarize evidence before the agent sees it.

Feb 20, 2026 4 min read

L1 Field Note

Tool Search vs Loading Every MCP Tool

Why production agents need discoverable tools and context budgets instead of one giant always-loaded MCP surface.

#ai-engineering #architecture #cloud

Feb 18, 2026 2 min read

L1 Field Note

Azure Synapse Cost Optimization: DWU Right-Sizing, Serverless, and Hybrid Benefit

How to reduce your Azure Synapse compute bill by right-sizing dedicated pools and offloading to serverless.

Feb 17, 2026 4 min read

L1 Field Note

Token-Efficient Tool Use

How to design agent tool surfaces that preserve context budget for reasoning instead of wasting it on tool metadata and raw output.

Feb 13, 2026 4 min read

L1 Field Note

Application Legibility for Agents

A reference architecture for making logs, metrics, test output, schemas, and deployment history readable by coding agents.

#ai-engineering #architecture #cloud

Feb 11, 2026 2 min read

L1 Field Note

Database Licensing Cost Across AWS, Azure, GCP, and OCI

A framework for managing commercial database licensing costs across the four major cloud providers.

Feb 6, 2026 4 min read

L1 Field Note

Agent-to-Agent Review Loops

A practical review pattern where one agent creates a change and specialized agents review risk, rollback, security, and observability.

Feb 4, 2026 3 min read

L1 Field Note

Cloud Database Cost Engineering: How to Reduce Database, Data Warehouse, and Licensing Spend Across Azure, AWS, GCP, and OCI

A comprehensive framework for reigning in cloud database costs, focusing on licensing, right-sizing, and architectural tradeoffs.

#databases #cloud #architecture #checklist

Feb 3, 2026 4 min read

L1 Field Note

Harness Engineering: The 2026 Breakthrough Concept

Why the real engineering surface around agents is the harness of tools, scripts, context, review, and telemetry.

Jan 30, 2026 4 min read

L1 Field Note

#databases #ai-engineering #architecture #checklist

Database Runbooks as Agent Contracts

A reference operating model for turning human database runbooks into machine-usable agent contracts.

Jan 27, 2026 4 min read

L1 Field Note

The New Engineer Role: Implementer to Orchestrator

Why agentic coding shifts senior engineering work toward decomposition, verification, and operating-model design.

Jan 23, 2026 4 min read

L1 Field Note

#ai-engineering #databases #architecture

Repo-Embedded Skills for Database Teams

Why database teams should store agent instructions, runbook contracts, and review policies in the repository instead of in memory.

Jan 20, 2026 4 min read

L1 Field Note

#ai-engineering #databases #architecture

Agentic Code Review for Database Repositories

Database repositories contain hidden rules human reviewers know: never add a blocking index at peak hours, never widen IAM without owner approval. Agent review surfaces these violations before merge — without displacing the human judgment that set the rules.

Jan 16, 2026 4 min read

L1 Field Note

Agent Autonomy Ladder: Manual, Confirm, Auto-Approve, Supervised

A governance model for deciding which database and cloud agent actions require approval and which can run automatically.

Jan 12, 2026 4 min read

L1 Field Note

Outcome-Based Agent Evaluation vs Transcript Review

A field note on why agent evaluation should measure verified state changes instead of polished reasoning traces.

Jan 9, 2026 5 min read

L1 Field Note

Evals Are the New Unit Tests for Agents

Why database and cloud teams need agent eval harnesses that grade outcomes, not persuasive transcripts.

Oct 21, 2025 4 min read

L1 Field Note

Alert Fatigue Engineering: How to Build Fewer, Better, Actionable Alerts

A dashboard is not observability, and an alert without a specific action is just operational debt masquerading as monitoring.

#failures #checklist #architecture

Dec 2, 2024 12 min read

L1 Field Note

The Agent Should Not Have Your App Credentials

Giving an AI coding agent your application's Postgres credentials is the default mistake — the agent inherits every permission the app has. Database-enforced read-only roles, replica routing, query limits, and project-scoped MCP config are the alternative that actually fails closed.

#ai-engineering #databases #failures

Oct 15, 2024 4 min read

L1 Field Note

Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works

How to position Prometheus and Grafana as the open-source baseline for teams that cannot send every byte of database telemetry to managed services.

#databases #architecture #failures #checklist

Aug 26, 2024 5 min read

L1 Field Note

Why pgcrypto Is Not a Full Key Management Strategy

PostgreSQL's pgcrypto is a cryptographic function library, not a key management system. Treating it as one guarantees your encryption keys will eventually leak.

#databases #security #failures

Jun 4, 2024 4 min read

L1 Field Note

The Database Observability Baseline: What Every DBA Dashboard Must Show

Before you can adopt AI-assisted triage, your database dashboard needs a foundation built on saturation, locking, and lag metrics.

#databases #architecture #failures #checklist

May 20, 2024 5 min read

L1 Field Note

Database Security Review for AI Access

Granting an autonomous AI agent access to your database breaks every assumption of traditional RBAC. How to secure databases against unpredictable, unbounded AI queries.

#ai-engineering #databases #checklist

May 7, 2024 5 min read

L1 Field Note

MySQL 8.4 LTS: What DBAs Should Check Before Upgrade

MySQL 8.4 is the first long-term support release in the 8.x line — five breaking changes that require verification before any production upgrade.

#databases #checklist

Mar 12, 2024 4 min read

L1 Field Note

Consistency Models Your Application Actually Needs

The difference between read committed, repeatable read, and serializable isolation in operational terms — and why most applications are running with weaker guarantees than engineers assume.

Mar 6, 2024 4 min read

L1 Field Note

Vector Search on GPU Databases

A DBA-friendly explanation of how vector search works, why GPUs help, and where vector retrieval fits inside modern database and AI systems.

#databases #gpu #vector-search #retrieval

Mar 5, 2024 5 min read

L1 Field Note

#databases #architecture #ai-engineering

How a 10 Billion Row SQL Query Runs in 200ms on a GPU Database

A DBA-friendly walkthrough of how modern GPU databases execute large analytical SQL queries using columnar storage, parallel scans, and GPU aggregation.

Mar 4, 2024 5 min read

L1 Field Note

#databases #architecture #ai-engineering

Why Databases Are Moving Toward GPU Execution Engines

A practical, DBA-friendly explanation of why modern analytical databases are increasingly using GPUs for scans, joins, aggregations, and AI-adjacent workloads.

Mar 3, 2024 5 min read

L1 Field Note

SIMD vs SIMT Explained for Database Engineers

A DBA-friendly explanation of SIMD and SIMT using query execution, vectorized processing, and GPU mental models instead of hardware jargon.

#databases #cpu #gpu #performance

Mar 2, 2024 5 min read

L1 Field Note

#databases #architecture #ai-engineering

CPU vs GPU vs TPU Explained for Database Engineers

How CPU, GPU, and TPU architectures differ in ways that matter for databases and AI workloads — and which compute class to reach for when adding vector search, embedding generation, or GPU-accelerated analytics.

Feb 19, 2024 5 min read

L1 Field Note

Aurora Global Database: What It Solves and What It Does Not

Aurora Global Database delivers sub-second cross-region replication and under-one-minute RTO for disaster recovery — but it is not active-active, and application failover is never automatic.

Jan 9, 2024 4 min read

L1 Field Note

#databases #fundamentals #architecture

CAP Theorem in Operational Terms

What CAP theorem actually says about distributed database tradeoffs, why the CP vs AP framing is more useful than the theory, and what it means for your system when the network fails.

Nov 14, 2023 4 min read

L1 Field Note

#databases #fundamentals #architecture

Caches, Queues, and Databases: When to Use Each

The decision framework for choosing between a cache, a queue, and a database — including the failure modes that appear when engineers use the wrong one for the job.

Oct 2, 2023 5 min read

L1 Field Note

Why SELECT * Still Hurts Production Systems

SELECT * causes four distinct problems that compound at scale: it prevents covering index usage, transfers unnecessary data, breaks application code silently, and defeats column pruning in analytical systems.

Sep 12, 2023 4 min read

L1 Field Note

Cardinality Estimation: Why the Query Planner Gets It Wrong

How PostgreSQL estimates row counts, why those estimates are wrong for correlated columns and skewed distributions, and what engineers can do when the planner picks a bad plan.

Jul 11, 2023 4 min read

L1 Field Note

Index Selectivity: Why Cardinality Changes Everything

Why a low-cardinality index is often worse than no index, how the query planner uses selectivity estimates, and when to build a partial index instead.

May 29, 2023 5 min read

L1 Field Note

MySQL Binlog Format: Row vs Statement vs Mixed

Choosing the wrong MySQL binary log format silently breaks replication or bloats the binlog — this is the decision tree for picking the right one.

#databases

May 9, 2023 5 min read

L1 Field Note

Reading a Query Plan Without Getting Lost

How to read PostgreSQL EXPLAIN output, what seq scan vs index scan actually means in practice, and the three numbers that matter most in any query plan.

Apr 17, 2023 5 min read

L1 Field Note

Read Replicas Are Not Free Scale

Read replicas add read throughput but they do not reduce write load, do not eliminate replication lag, and silently serve stale data under write bursts — understanding those constraints before you add replicas is the decision engineers skip.

#databases #architecture #failures

Mar 14, 2023 4 min read

L1 Field Note

Connection Pooling Explained

Why PostgreSQL connections are expensive, what a connection pool actually does, and the difference between session mode, transaction mode, and statement mode in PgBouncer.

Mar 13, 2023 5 min read

L1 Field Note

MongoDB WiredTiger Cache: Practical Basics

WiredTiger's internal cache is MongoDB's primary memory tier — how to read its metrics, recognize eviction pressure, and size it correctly for your working set.

#databases

Jan 30, 2023 5 min read

L1 Field Note

MySQL Cardinality and Index Selectivity

MySQL ignores an index when the optimizer estimates a full scan is cheaper — which happens when cardinality is too low, statistics are stale, or the query shape doesn't match index selectivity. How to diagnose which problem it is and what to do about each.

#databases #architecture #failures

Jan 10, 2023 4 min read

L1 Field Note

Replication Lag Explained

What replication lag actually measures in PostgreSQL, the three distinct lag components that most monitoring tools conflate, and which one matters for your RPO.

Jan 9, 2023 5 min read

L1 Field Note

PostgreSQL Statistics: Why the Optimizer Gets It Wrong

PostgreSQL's query planner depends entirely on per-column statistics that go stale after bulk loads — here is what that means for query plan quality and how to fix it.

Oct 11, 2022 4 min read

L1 Field Note

Checkpoint and Flush: What Your Database Does Before It Can Rest

What a checkpoint actually does in PostgreSQL, why dirty page flush matters for recovery time, and what engineers should monitor to avoid checkpoint pressure.

Oct 10, 2022 5 min read

L1 Field Note

Redis Memory Eviction Policies Explained

Redis has eight eviction policies and a maxmemory limit. The policy you pick determines whether your cache degrades safely or silently corrupts your hit rate under load.

Sep 12, 2022 5 min read

L1 Field Note

MongoDB Index Basics: Why Your Query Became Slow

MongoDB's default behavior is a full collection scan when no index supports the query. Here is what you need to know about single-field, compound, and multikey indexes before your collection grows past 10K documents.

Aug 9, 2022 4 min read

L1 Field Note

Redo vs Undo: How Databases Recover from Crashes

The two mechanisms databases use to survive crashes — redo brings committed changes forward, undo rolls back uncommitted ones — and why the distinction matters operationally.

Jun 14, 2022 4 min read

L1 Field Note

#databases #fundamentals #architecture

B-tree vs LSM Tree: The Storage Engine Tradeoff

Why PostgreSQL and MySQL use B-trees while Cassandra and RocksDB use LSM trees — the read/write tradeoff that determines which storage engine fits your workload.

Jun 6, 2022 5 min read

L1 Field Note

MySQL EXPLAIN: Reading the Plan Without Guessing

How to read MySQL EXPLAIN output systematically — type column, key column, rows estimate, and Extra flags — so you stop adding indexes blindly.

#databases #checklist

May 9, 2022 5 min read

L1 Field Note

MySQL InnoDB Buffer Pool: The First Thing to Check

The InnoDB buffer pool hit ratio and size are the first metrics to verify on any MySQL server — a default 128MB pool on a 32GB machine sends every query to disk.

#databases

Apr 11, 2022 5 min read

L1 Field Note

PostgreSQL Autovacuum: What Every Engineer Should Know

Autovacuum is not optional maintenance — it is the mechanism that prevents table bloat and transaction ID wraparound from taking your database offline.

#databases #checklist

Mar 15, 2022 4 min read

L1 Field Note

WAL Explained for Database Engineers

What write-ahead logging is, why every ACID database uses it, and what engineers need to know about LSN ordering, crash recovery, and replication lag.

Feb 14, 2022 5 min read

L1 Field Note