Series / AI Engineering

Observability 2024–2026

Three years of production observability: from building the first database dashboard, through AI-assisted incident triage, to autonomous SRE agents that observe themselves.

15 posts AI Engineering

Who This Is For

Platform engineers, DBAs, and SREs who own observability end-to-end — from setting up the first Prometheus exporter to deciding whether to let an agent execute remediation without human approval.

What You Will Be Able to Do

Build a database observability baseline that surfaces saturation before users notice
Instrument AI-assisted triage so alert noise collapses into a single root-cause hypothesis
Govern telemetry cost so the observability bill doesn't exceed the infrastructure bill
Monitor AI agents, MCP servers, and autonomous SRE pipelines using purpose-built telemetry

Prerequisites

Production experience with at least one database and one alerting system. No AI background required — the 2026 posts treat agents as distributed systems, not magic.

1 Foundation — 2024

What every production observability stack must measure: database dashboards, per-engine instrumentation, open-source tooling, and cost visibility.

Jun 4, 2024 4 min read

L1 Field Note

Databases

The Database Observability Baseline: What Every DBA Dashboard Must Show

Before you can adopt AI-assisted triage, your database dashboard needs a foundation built on saturation, locking, and lag metrics.

#databases #architecture #failures #checklist

Jul 16, 2024 5 min read

L2 Deep Dive

Databases

CloudWatch Database Insights for Aurora and RDS: The New AWS Monitoring Center

How to use CloudWatch and Performance Insights to root-cause Aurora and RDS incidents without deploying third-party agents.

#databases #cloud #architecture

Aug 20, 2024 5 min read

L2 Deep Dive

Databases

PostgreSQL Observability: Vacuum, Bloat, Locks, Replication Lag, and Query Plans

Monitoring PostgreSQL requires looking past the operating system and into the internal bookkeeping of MVCC, autovacuum, and replication streams.

#databases #architecture #failures

Sep 17, 2024 6 min read

L2 Deep Dive

Databases

Cassandra Observability: Compaction, Tombstones, Repair, Latency, and Hot Partitions

Why generic server monitoring fails for Apache Cassandra, and how to track the true operational signals of a distributed masterless database.

#databases #architecture #failures

Oct 15, 2024 4 min read

L1 Field Note

Databases

Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works

How to position Prometheus and Grafana as the open-source baseline for teams that cannot send every byte of database telemetry to managed services.

#databases #architecture #failures #checklist

Nov 19, 2024 5 min read

L2 Deep Dive

Engineering Fundamentals

Cost Observability: Build Dashboards That Show Waste Before Finance Finds It

How to expand monitoring beyond uptime by building dashboards that expose underutilized RDS instances, EBS io2 waste, and backup retention drift.

#cloud #architecture #checklist

2 AI-Assisted Operations — 2025 H1

When generative AI enters the on-call loop: compressing 500 alerts into a root-cause hypothesis, commercial AI SRE tooling, and the end of single-signal alerting.

Feb 18, 2025 5 min read

L2 Deep Dive

AI Engineering

AI-Assisted Incident Triage: From Alert Noise to Root-Cause Hypotheses

How generative AI tools like CloudWatch Investigations shift the operational burden from reading raw dashboards to validating machine-generated hypotheses.

#ai-engineering #failures #cloud

Apr 15, 2025 5 min read

L2 Deep Dive

AI Engineering

Datadog Bits AI SRE: What an AI On-Call Teammate Changes for DBAs

How autonomous AI agents like Bits AI SRE are shifting the database incident workflow from manual dashboard hunting to conversational investigation.

#ai-engineering #cloud #architecture

Jun 17, 2025 6 min read

L2 Deep Dive

System Design

The End of Single-Signal Alerting: Correlating Metrics, Logs, Traces, Deployments, and Cost

Why paging an engineer solely because CPU hit 85% is an anti-pattern, and how to build correlated alerts that require real operational evidence.

#architecture #failures #system-design

3 Cost Governance & Alert Discipline — 2025 H2

Scaling observability without breaking the budget: tying cloud cost to workload and team, eliminating alert fatigue, and governing telemetry spend at the pipeline level.

Aug 19, 2025 5 min read

L2 Deep Dive

AI Engineering

FinOps Observability: Tie Cloud Cost to Workload, Team, Product, and Customer

How to connect engineering telemetry with cost telemetry to achieve granular cloud unit economics using FinOps principles and FOCUS standards.

#cloud #architecture #ai-engineering

Oct 21, 2025 4 min read

L1 Field Note

Engineering Fundamentals

Alert Fatigue Engineering: How to Build Fewer, Better, Actionable Alerts

A dashboard is not observability, and an alert without a specific action is just operational debt masquerading as monitoring.

#failures #checklist #architecture

Dec 9, 2025 6 min read

L2 Deep Dive

AI Engineering

Telemetry Cost Control: Why Observability Data Itself Needs Governance

If you log everything and monitor every dimension, your observability bill will eventually exceed your database infrastructure bill. Here is how to fix it.

#cloud #architecture #ai-engineering

4 Agentic SRE — 2026

Autonomous operations: monitoring the agents themselves, observing MCP control planes, and the reference architecture for safe human-in-the-loop remediation.

Jan 20, 2026 8 min read

L2 Deep Dive

AI Engineering

AI Agent Observability: Monitor Tool Calls, Token Spend, Latency, and Failure Loops

Why monitoring autonomous SRE agents requires tracking tool-call hallucinations, context window saturation, and recursive retry loops, rather than just basic CPU metrics.

#ai-engineering #architecture #failures #system-design

Mar 10, 2026 8 min read

L2 Deep Dive

AI Engineering

MCP Server Observability: The New Control Plane for AI + Enterprise Tools

How the Model Context Protocol (MCP) became the networking layer for AI agents, and why monitoring these connections is critical for enterprise security.

#ai-engineering #architecture #system-design #security

May 12, 2026 7 min read

L2 Deep Dive

AI Engineering

Agentic SRE Architecture: Skills, Agents, MCP Servers, and Human Approval Loops

The definitive 2026 reference architecture for autonomous database operations, from detection to multi-agent diagnosis to human-in-the-loop remediation.

#ai-engineering #architecture #system-design #cloud