<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Databases | RajivOnAI</title><description>PostgreSQL, Aurora, MySQL, Oracle, Cassandra, MongoDB, pgvector, replication, migrations, indexing, and database operations.</description><link>https://rajivonai.com/topics/databases/</link><item><title>Datadog DBM: What Database Teams Should Actually Monitor</title><link>https://rajivonai.com/blog/2026-06-15-datadog-dbm-what-database-teams-should-actually-monitor/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-06-15-datadog-dbm-what-database-teams-should-actually-monitor/</guid><description>Datadog Database Monitoring can surface enormous detail — and bill for it. The skill is choosing the few signals that answer real cost and reliability questions, and not paying to collect noise nobody acts on.</description><pubDate>Mon, 15 Jun 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Datadog Database Monitoring (DBM) will happily show you every query, every plan, and every host metric your fleet produces. The trap is treating “more telemetry” as “better observability.” The teams who get value from DBM monitor a short list of signals tied to decisions — and deliberately ignore the rest, because in DBM the rest is also a line on the bill.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;problem&quot;&gt;Problem&lt;/h2&gt;
&lt;p&gt;A team turns on Datadog DBM expecting clarity and gets a firehose: thousands of normalized queries, host dashboards, plan samples, and a steadily climbing Datadog invoice. Six weeks later the on-call engineer still can’t answer “why was the database slow at 2am?” any faster than before, because the dashboards show &lt;em&gt;everything&lt;/em&gt; and therefore foreground &lt;em&gt;nothing&lt;/em&gt;. Meanwhile DBM is now a noticeable cost itself — host-based DBM pricing plus custom metrics plus log ingestion. Observability that you pay for but don’t act on is just a second cost problem stacked on the first.&lt;/p&gt;
&lt;h2 id=&quot;why-it-matters-financially&quot;&gt;Why it matters financially&lt;/h2&gt;
&lt;p&gt;Observability spend is real spend, and DBM has several meters running at once:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Per-host DBM&lt;/strong&gt; scales with your fleet — every replica and non-prod instance you instrument adds cost, whether or not anyone reads its dashboard.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom metrics&lt;/strong&gt; bill per unique metric+tag combination. High-cardinality tags (per-user, per-request-id) can multiply a single metric into thousands of billable timeseries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Log ingestion and retention&lt;/strong&gt; for slow-query and audit logs add a third meter.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The financial point cuts both ways: under-monitoring means you can’t see the cost and reliability problems that matter (the theme of every other article in this series), while &lt;em&gt;naïve&lt;/em&gt; monitoring means you pay to collect telemetry nobody uses. The goal is the small set of signals that actually change a decision.&lt;/p&gt;
&lt;h2 id=&quot;technical-root-causes-why-dbm-bills-and-dashboards-balloon&quot;&gt;Technical root causes (why DBM bills and dashboards balloon)&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Instrumenting everything by default&lt;/strong&gt; — every non-prod and idle replica gets a DBM host agent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High-cardinality custom metrics&lt;/strong&gt; — tagging metrics with unbounded values (user IDs, request IDs) explodes billable timeseries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Collecting without alerting&lt;/strong&gt; — query samples and metrics gathered but wired to no alert and no runbook.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Symptom-level alerts&lt;/strong&gt; — “host CPU high” instead of leading indicators (replication lag, connection saturation, storage runway).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No baseline&lt;/strong&gt; — without a normal range, dashboards can’t tell you whether 2am was abnormal.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;review-checklist--what-dbm-should-be-answering&quot;&gt;Review checklist — what DBM &lt;em&gt;should&lt;/em&gt; be answering&lt;/h2&gt;
&lt;p&gt;Monitor signals tied to a decision. At minimum:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Top queries by total time and by I/O&lt;/strong&gt; — the same &lt;code&gt;pg_stat_statements&lt;/code&gt; view DBM surfaces fleet-wide; this is your cost and latency hot list.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replication lag&lt;/strong&gt; — with a defined normal range and a threshold alert (not just a graph).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connection saturation&lt;/strong&gt; — active vs &lt;code&gt;max_connections&lt;/code&gt;, alerted &lt;em&gt;before&lt;/em&gt; the limit.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage runway&lt;/strong&gt; — free space / days-to-full, alerted with lead time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cache hit ratio&lt;/strong&gt; and &lt;strong&gt;deadlocks/lock waits&lt;/strong&gt; — early signals of memory pressure and contention.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Long-running / idle-in-transaction&lt;/strong&gt; — the transactions that block vacuum and cause incidents.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And on the cost side of DBM itself:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Which hosts are instrumented — are idle replicas and non-prod paying for DBM they don’t need?&lt;/li&gt;
&lt;li&gt;Are any custom metrics high-cardinality? Check your top metrics by timeseries count.&lt;/li&gt;
&lt;li&gt;For every collected signal: is there an alert and a runbook? If not, why collect it?&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;example-findings&quot;&gt;Example findings&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;(Illustrative — the patterns these reviews repeatedly surface.)&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;DBM was enabled on every host including 6 idle non-prod replicas; scoping DBM to production and active readers cut DBM host cost without losing a single useful dashboard.&lt;/li&gt;
&lt;li&gt;A custom metric tagged with &lt;code&gt;request_id&lt;/code&gt; had ballooned into tens of thousands of billable timeseries; dropping the unbounded tag collapsed it to a handful.&lt;/li&gt;
&lt;li&gt;The team had rich query dashboards but no alert on replication lag — the one signal that would have warned them before a read-after-write incident.&lt;/li&gt;
&lt;li&gt;Slow-query logs were ingested and retained for 30 days but never queried; trimming retention cut log cost with no operational loss.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;actions-to-take&quot;&gt;Actions to take&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Define the decision for every signal.&lt;/strong&gt; If a metric or log maps to no alert and no runbook, stop paying to collect it (or sample it).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scope DBM to what you act on.&lt;/strong&gt; Production and active replicas first; instrument non-prod only when you’re actively debugging it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kill high-cardinality tags.&lt;/strong&gt; Audit top custom metrics by timeseries count; remove unbounded tag values.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert on leading indicators, not symptoms.&lt;/strong&gt; Replication lag, connection saturation, storage runway, long-running transactions — each with a threshold and an owner.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Establish a baseline&lt;/strong&gt; so “is this abnormal?” has a data answer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Re-check DBM’s own cost&lt;/strong&gt; as a line item — observability is worth paying for; paying for noise is not.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Good database observability and a controlled observability bill are the same discipline as the rest of cost engineering: collect what answers a question, alert on what you’ll act on, and measure the cost of the tooling itself.&lt;/p&gt;
&lt;h2 id=&quot;review-checklist--next-step&quot;&gt;Review checklist &amp;#x26; next step&lt;/h2&gt;
&lt;p&gt;Use the free &lt;a href=&quot;https://aks.rajivonai.com/30-point-database-cost-review-checklist.md&quot;&gt;30-Point Database Cost Review Checklist&lt;/a&gt; — its Observability section maps directly to the signals above. To see how observability gaps show up in a full review, read the &lt;a href=&quot;https://aks.rajivonai.com/sample-report/&quot;&gt;Acme SaaS sample report&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Want your monitoring assessed against the questions that matter?&lt;/strong&gt; AKS runs a &lt;a href=&quot;https://aks.rajivonai.com/services/database-observability-review/&quot;&gt;Database Observability Review&lt;/a&gt; — what to collect, what to alert on, and what you’re paying to gather but never use. Or &lt;a href=&quot;https://aks.rajivonai.com/contact/&quot;&gt;get in touch&lt;/a&gt; to scope a pilot.&lt;/p&gt;</content:encoded><category>databases</category><category>observability</category><category>cost</category><category>postgresql</category></item><item><title>Why Database Engineers Should Care About AI Cost Engineering</title><link>https://rajivonai.com/blog/2026-06-13-why-database-engineers-should-care-about-ai-cost-engineering/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-06-13-why-database-engineers-should-care-about-ai-cost-engineering/</guid><description>The skills that make a good cost-aware DBA — measuring usage, finding structural waste, balancing cost against reliability — transfer almost directly to AI workloads. Database engineers are unusually well positioned to own AI cost.</description><pubDate>Sat, 13 Jun 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;AI cost engineering looks like a new discipline. For a database engineer, it is mostly a familiar one wearing different units. The mental model that finds a bloated index or an oversized instance is the same one that finds a wasteful prompt or an over-large model.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;problem&quot;&gt;Problem&lt;/h2&gt;
&lt;p&gt;AI spend is becoming a top infrastructure line item, and most orgs have nobody who owns it the way a DBA owns the database bill. Product engineers ship features; finance sees a total; no one connects usage to cost at the unit level. The role is open — and database engineers keep assuming it belongs to someone else.&lt;/p&gt;
&lt;h2 id=&quot;why-it-matters-financially&quot;&gt;Why it matters financially&lt;/h2&gt;
&lt;p&gt;For the engineer, this is leverage. AI cost work is high-visibility, under-supplied, and directly tied to dollars an executive cares about. For the org, putting cost-literate engineers on AI spend is the difference between a forecastable line and a quarterly surprise. The same person who can say “this query costs the business $4k/month in I/O” is the person who can say “this prompt design costs $9k/month in tokens” — and both sentences change budgets.&lt;/p&gt;
&lt;h2 id=&quot;technical-root-causes-why-the-analogy-holds&quot;&gt;Technical root causes (why the analogy holds)&lt;/h2&gt;
&lt;p&gt;The transferable model is: &lt;strong&gt;measure usage → find structural waste → quantify the opportunity → sequence the fix against risk.&lt;/strong&gt; The specifics map cleanly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;pg_stat_statements&lt;/code&gt; ↔ per-call token logging.&lt;/strong&gt; Both answer “where does the cost concentrate?”&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Indexes ↔ embeddings/retrieval.&lt;/strong&gt; Both are precomputation that trades storage/compute for query speed — and both are routinely over- or under-built.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Caching (buffer cache, result cache) ↔ prompt caching / result caching.&lt;/strong&gt; Same idea: don’t pay twice for the same work.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Instance right-sizing ↔ model right-sizing.&lt;/strong&gt; Don’t run a frontier model (or an r6g.4xlarge) for a workload a smaller one serves.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query plans ↔ context construction.&lt;/strong&gt; Both are about giving the engine exactly what it needs and no more.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;where-the-analogy-breaks&quot;&gt;Where the analogy breaks&lt;/h2&gt;
&lt;p&gt;One place it does not transfer: &lt;strong&gt;quality is a continuous tradeoff with no database equivalent.&lt;/strong&gt; Dropping an unused index is free; dropping to a cheaper model might lose accuracy. AI cost work therefore always needs a quality guardrail — an evaluation set you check before and after every change. A DBA’s instinct to optimize aggressively must be paired with that guardrail.&lt;/p&gt;
&lt;h2 id=&quot;review-checklist-a-dbas-first-look-at-ai-spend&quot;&gt;Review checklist (a DBA’s first look at AI spend)&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Is there per-call logging of tokens and model, tagged by feature? (Your &lt;code&gt;pg_stat_statements&lt;/code&gt;.)&lt;/li&gt;
&lt;li&gt;What share of calls use a model larger than the task needs? (Your right-sizing pass.)&lt;/li&gt;
&lt;li&gt;Is anything recomputed that could be cached? (Your buffer-cache instinct.)&lt;/li&gt;
&lt;li&gt;Is retrieved context larger than the model needs? (Your “why is this a seq scan?” instinct.)&lt;/li&gt;
&lt;li&gt;Is there an evaluation set guarding quality before cost changes ship?&lt;/li&gt;
&lt;li&gt;Who owns the AI cost number, and do they see it weekly?&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;example-findings&quot;&gt;Example findings&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;(Illustrative.)&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A database engineer reviewing an LLM feature spotted that retrieval returned 20 chunks where ranking showed the answer was almost always in the top 5 — the same “you’re scanning more than you read” pattern they’d flagged in SQL a hundred times.&lt;/li&gt;
&lt;li&gt;The same engineer recognized an uncached static prompt as exactly the repeated-work pattern a result cache solves on the database side.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;actions-to-take&quot;&gt;Actions to take&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Claim the unit-accounting work.&lt;/strong&gt; Add per-call cost logging; it is the AI analog of enabling statement stats, and it makes you the person with the data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apply your right-sizing playbook&lt;/strong&gt; to models, with an evaluation set as the guardrail.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bring caching and “don’t recompute” instincts&lt;/strong&gt; to prompts and retrieval.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Frame findings in dollars and risk&lt;/strong&gt;, exactly as you would a database cost review.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;a-30-day-ramp&quot;&gt;A 30-day ramp&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Week 1:&lt;/strong&gt; read your provider’s pricing and token mechanics; add per-call cost logging.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 2:&lt;/strong&gt; build a small evaluation set for one feature; baseline its quality and cost.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 3:&lt;/strong&gt; run a model right-sizing and caching experiment behind the guardrail.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 4:&lt;/strong&gt; write it up in impact × effort × risk terms — the same report you’d hand to an engineering manager after a database review.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Run the database review that proves the model first.&lt;/strong&gt; See &lt;a href=&quot;https://rajivonai.com/blog/2026-06-12-how-to-run-a-database-cost-and-reliability-review/&quot;&gt;How to Run a Database Cost &amp;#x26; Reliability Review&lt;/a&gt;, grab the free &lt;a href=&quot;https://aks.rajivonai.com/30-point-database-cost-review-checklist.md&quot;&gt;30-Point Checklist&lt;/a&gt;, or talk to AKS about a &lt;a href=&quot;https://aks.rajivonai.com/services/database-cost-reliability-review/&quot;&gt;Database Cost &amp;#x26; Reliability Review&lt;/a&gt; — and see the &lt;a href=&quot;https://aks.rajivonai.com/sample-report/&quot;&gt;Acme SaaS sample report&lt;/a&gt; for what one delivers.&lt;/p&gt;</content:encoded><category>ai</category><category>cost</category><category>databases</category><category>career</category></item><item><title>How to Run a Database Cost &amp; Reliability Review</title><link>https://rajivonai.com/blog/2026-06-12-how-to-run-a-database-cost-and-reliability-review/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-06-12-how-to-run-a-database-cost-and-reliability-review/</guid><description>A practitioner walkthrough of the review method: what to look at, in what order, how to quantify an opportunity honestly, and how to turn findings into a prioritized 30/60/90-day plan.</description><pubDate>Fri, 12 Jun 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A good cost review is not a tool that prints a number. It is a sequence: get the right access, look at nine areas in order, quantify each opportunity with its own math, and rank the fixes by impact, effort, and risk. Here is the method, end to end.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;problem&quot;&gt;Problem&lt;/h2&gt;
&lt;p&gt;Most database “cost reviews” are either a vendor dashboard screenshot or a one-off “make it cheaper” sprint. Neither produces something a team can act on with confidence. The first lacks engineering judgment; the second lacks reliability guardrails and tends to trade away durability for a short-term saving. A real review is structured, evidence-based, and sequenced.&lt;/p&gt;
&lt;h2 id=&quot;why-it-matters-financially&quot;&gt;Why it matters financially&lt;/h2&gt;
&lt;p&gt;Database spend grows quietly and compounds. The cost of &lt;em&gt;not&lt;/em&gt; reviewing is two-sided: you keep paying for waste (oversized instances, idle replicas, bloat), and you carry unmeasured reliability risk (untested failover, unverified restores) that turns into an expensive incident at the worst time. A structured review surfaces both — and, just as important, it produces a &lt;em&gt;prioritized&lt;/em&gt; plan, so the savings actually get implemented instead of dying in a backlog.&lt;/p&gt;
&lt;h2 id=&quot;technical-root-causes-why-bills-drift&quot;&gt;Technical root causes (why bills drift)&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Instances sized for a launch and never revisited.&lt;/li&gt;
&lt;li&gt;Storage and I/O charges that grow without anyone watching the trend.&lt;/li&gt;
&lt;li&gt;Replicas added “to be safe” that never receive read traffic.&lt;/li&gt;
&lt;li&gt;Bloat and unused indexes inflating storage and write cost.&lt;/li&gt;
&lt;li&gt;Observability too thin to even see where the money goes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;the-method-in-order&quot;&gt;The method, in order&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;0. Get read-only access and a metrics window.&lt;/strong&gt; Without it you are guessing. A replica, snapshot, or read-only role plus 2–4 weeks of metrics is enough. Sign a mutual NDA; never take write access for a review.&lt;/p&gt;
&lt;p&gt;Then work the &lt;strong&gt;nine areas&lt;/strong&gt;, in this order (cheap-to-see first, riskier-to-fix later):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Cost&lt;/strong&gt; — instance sizing vs utilization, idle/non-prod, pricing model, storage/I/O drivers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance&lt;/strong&gt; — top queries (&lt;code&gt;pg_stat_statements&lt;/code&gt;), index effectiveness, connections, cache hit ratio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reliability&lt;/strong&gt; — failover tested, HA posture, single points of failure, headroom.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage&lt;/strong&gt; — bloat/dead tuples, growth trend, retention/archival.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replication&lt;/strong&gt; — replica utilization, lag visibility, read/write routing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Backup &amp;#x26; recovery&lt;/strong&gt; — backups exist, restores tested, PITR/RPO understood.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Observability&lt;/strong&gt; — metrics coverage, query-level insight, alerting on leading indicators.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security&lt;/strong&gt; — encryption, least-privilege, audit/change visibility.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automation&lt;/strong&gt; — which toil could be automated to cut risk and cost.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;quantifying-an-opportunity-honestly&quot;&gt;Quantifying an opportunity honestly&lt;/h2&gt;
&lt;p&gt;This is where reviews earn or lose trust. For each opportunity:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Show the math.&lt;/strong&gt; “Writer at 14% peak CPU over 30 days; one class down ≈ 50% of compute cost ≈ $X/month.”&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Give a range, not a point.&lt;/strong&gt; Real savings depend on validation and execution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Never promise a percentage before you’ve looked.&lt;/strong&gt; Be wary of anyone who does.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flag the reliability tradeoff&lt;/strong&gt; of every cost cut explicitly.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;prioritizing-impact--effort--risk&quot;&gt;Prioritizing: impact × effort × risk&lt;/h2&gt;
&lt;p&gt;Score each finding on impact (cost or reliability), effort to fix, and risk of the fix. The plan writes itself when you sort by those three: low-risk high-impact first, risky changes later with guardrails.&lt;/p&gt;
&lt;h2 id=&quot;building-the-306090-plan&quot;&gt;Building the 30/60/90 plan&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;First 30 days — instrument &amp;#x26; capture low-risk wins:&lt;/strong&gt; enable statement stats and slow-query logging, add leading-indicator alerts, remove clearly idle resources, confirm restores work.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Days 31–60 — right-size &amp;#x26; reduce structural waste:&lt;/strong&gt; act on sizing and pricing findings backed by data, fix replica routing, begin bloat/index cleanup.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Days 61–90 — harden &amp;#x26; sustain:&lt;/strong&gt; failover testing, pooling, automation of toil, and a baseline so you can prove the changes worked.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;review-checklist&quot;&gt;Review checklist&lt;/h2&gt;
&lt;p&gt;Use the full &lt;a href=&quot;https://aks.rajivonai.com/30-point-database-cost-review-checklist.md&quot;&gt;30-Point Database Cost Review Checklist&lt;/a&gt; to run this yourself. It covers all nine areas plus the planning step.&lt;/p&gt;
&lt;h2 id=&quot;example-findings&quot;&gt;Example findings&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;(Illustrative.)&lt;/em&gt; A typical first review surfaces: one oversized non-prod-hours pattern, one or two idle replicas, a handful of unused indexes, a top-three I/O query missing an index, and — almost always — at least one untested restore or failover. The cost items pay for the review; the reliability items are why you do it before an incident.&lt;/p&gt;
&lt;h2 id=&quot;actions-to-take&quot;&gt;Actions to take&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Secure read-only access and a metrics export.&lt;/li&gt;
&lt;li&gt;Walk the nine areas in order; cite evidence for every finding.&lt;/li&gt;
&lt;li&gt;Quantify each opportunity with its own math and a range.&lt;/li&gt;
&lt;li&gt;Rank by impact × effort × risk and write the 30/60/90 plan.&lt;/li&gt;
&lt;li&gt;Re-measure after changes to confirm they landed.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Want this run for your environment by a senior engineer?&lt;/strong&gt; AKS delivers a &lt;a href=&quot;https://aks.rajivonai.com/services/database-cost-reliability-review/&quot;&gt;Database Cost &amp;#x26; Reliability Review&lt;/a&gt; with prioritized findings and a 30/60/90 plan — read-only, evidence-driven, no overpromised savings. See the full &lt;a href=&quot;https://aks.rajivonai.com/sample-report/&quot;&gt;Acme SaaS sample report&lt;/a&gt; for the exact format.&lt;/p&gt;</content:encoded><category>databases</category><category>cost</category><category>reliability</category><category>postgresql</category></item><item><title>Aurora Cost Optimization: The Hidden Database Bill</title><link>https://rajivonai.com/blog/2026-06-11-aurora-cost-optimization-the-hidden-database-bill/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-06-11-aurora-cost-optimization-the-hidden-database-bill/</guid><description>Aurora cost hides in places the console doesn&apos;t foreground — I/O charges, oversized writers and readers, replica sprawl, and storage. A structured way to find and reduce each without hurting reliability.</description><pubDate>Thu, 11 Jun 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Aurora’s bill is three things — compute, storage, and I/O — and the one that surprises teams is I/O, because it scales with how your queries read data, not with anything you provisioned. Most Aurora cost reviews stop at instance class and miss the line that’s actually growing.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;problem&quot;&gt;Problem&lt;/h2&gt;
&lt;p&gt;An Aurora bill climbs and the obvious lever — instance class — doesn’t explain it. The writer looks busy enough. Nobody touched the cluster config. Yet month over month the number rises. The cost is real but diffuse: a bit of oversizing, a couple of idle readers, storage that only grows, and an I/O charge driven by query patterns nobody is watching.&lt;/p&gt;
&lt;h2 id=&quot;why-it-matters-financially&quot;&gt;Why it matters financially&lt;/h2&gt;
&lt;p&gt;For a mid-size Aurora estate, the I/O line and replica sprawl together are frequently the largest recoverable spend — and both are low-risk to address once you can see them. Unlike a risky schema change, removing an idle reader or indexing a hot sequential-scan query is reversible and safe. The financial point: the biggest Aurora wins are usually the &lt;em&gt;least&lt;/em&gt; dangerous ones, which is exactly why leaving them in place is hard to justify once measured.&lt;/p&gt;
&lt;h2 id=&quot;technical-root-causes&quot;&gt;Technical root causes&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;I/O charges from inefficient reads.&lt;/strong&gt; Aurora bills per I/O operation on standard configuration. A few high-frequency queries doing sequential scans on large tables can dominate the bill while looking unremarkable in the query list.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Oversized writers and readers.&lt;/strong&gt; Instances sized for a historical peak (a backfill, a launch) and never revisited; steady-state CPU sits low.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replica sprawl.&lt;/strong&gt; Readers added for HA or “reporting” that no longer receive meaningful read traffic — full instance cost for near-zero use.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read/write routing gaps.&lt;/strong&gt; The primary carries read load the readers were paid to absorb.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage that only grows.&lt;/strong&gt; Aurora storage auto-grows and doesn’t shrink; bloat and unarchived cold data inflate it permanently.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;review-checklist&quot;&gt;Review checklist&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;What is your I/O charge as a share of the cluster bill, and which queries drive it?&lt;/li&gt;
&lt;li&gt;What is peak (not average) CPU/connections on each writer and reader over 30 days?&lt;/li&gt;
&lt;li&gt;Does each reader receive real read traffic? Pull per-replica read metrics.&lt;/li&gt;
&lt;li&gt;Is read traffic actually routed to readers (reader endpoint / routing layer)?&lt;/li&gt;
&lt;li&gt;Would &lt;strong&gt;Aurora I/O-Optimized&lt;/strong&gt; be cheaper given your I/O-to-compute ratio?&lt;/li&gt;
&lt;li&gt;Is storage growth trended? What’s the largest contributor (bloat, logs, cold data)?&lt;/li&gt;
&lt;li&gt;Are there indexes that would convert your top sequential scans into index scans?&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;example-findings&quot;&gt;Example findings&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;(Illustrative.)&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Three high-frequency queries accounted for a large share of logical reads via sequential scans; targeted indexes plus one query rewrite cut I/O operations materially and improved latency.&lt;/li&gt;
&lt;li&gt;A reporting reader showed negligible reads after reporting moved elsewhere; removing it recovered the full reader cost with no functional impact.&lt;/li&gt;
&lt;li&gt;An analytics writer sized during a 14-month-old backfill ran at ~14% peak CPU; a validated step-down recovered roughly half its compute cost.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;actions-to-take&quot;&gt;Actions to take&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Break the bill into compute / storage / I/O&lt;/strong&gt; so you know which lever matters. Don’t assume it’s instance class.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Attack I/O at the query level.&lt;/strong&gt; Index the top sequential-scan queries; rewrite the worst offenders. Validate in staging.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Audit every reader for real traffic&lt;/strong&gt; and confirm routing; remove or repurpose idle ones after a consumer check.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Right-size against peak, not average,&lt;/strong&gt; with month-end and spike windows included.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluate Aurora I/O-Optimized&lt;/strong&gt; if your I/O charges are a large, steady share — model it against your actual ratio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trend storage&lt;/strong&gt; and address bloat/retention so it stops growing unboundedly.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Every one of these is read-only to &lt;em&gt;find&lt;/em&gt; and reversible to &lt;em&gt;apply&lt;/em&gt; — make the change in staging, confirm the metric moved, then promote.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Want your Aurora estate reviewed by a senior engineer?&lt;/strong&gt; AKS delivers a &lt;a href=&quot;https://aks.rajivonai.com/services/database-cost-reliability-review/&quot;&gt;Database Cost &amp;#x26; Reliability Review&lt;/a&gt; that breaks down compute/storage/I/O, ranks findings by impact and effort, and shows the math — no promised percentage. Or self-assess with the free &lt;a href=&quot;https://aks.rajivonai.com/30-point-database-cost-review-checklist.md&quot;&gt;30-Point Checklist&lt;/a&gt;, or read the &lt;a href=&quot;https://aks.rajivonai.com/sample-report/&quot;&gt;Acme SaaS sample report&lt;/a&gt; to see the deliverable.&lt;/p&gt;</content:encoded><category>databases</category><category>cloud</category><category>cost</category><category>aurora</category></item><item><title>PostgreSQL Bloat, Index Waste, and Cloud Cost</title><link>https://rajivonai.com/blog/2026-06-10-postgresql-bloat-index-waste-and-cloud-cost/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-06-10-postgresql-bloat-index-waste-and-cloud-cost/</guid><description>Table and index bloat and unused indexes are well-known Postgres problems — and direct cloud-cost problems: wasted storage, write amplification, and extra I/O. How to measure both with read-only queries and remediate safely.</description><pubDate>Wed, 10 Jun 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Bloat and unused indexes are usually filed under “performance hygiene.” On a cloud database they are also a line on the bill: storage you pay for and never use, writes amplified across indexes nobody reads, and I/O spent scanning dead space. The fixes are well understood and mostly low-risk — the hard part is seeing the problem.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;problem&quot;&gt;Problem&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s MVCC model creates dead tuples on every update and delete. Autovacuum reclaims them for reuse, but under heavy churn — or with mistuned autovacuum — dead space accumulates faster than it’s reclaimed. Tables and indexes grow beyond the live data they hold. Separately, indexes added years ago for queries that no longer run keep costing write overhead and storage. Neither shows up as a “cost” problem until you go looking.&lt;/p&gt;
&lt;h2 id=&quot;why-it-matters-financially&quot;&gt;Why it matters financially&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Storage&lt;/strong&gt; on cloud Postgres (and Aurora) is billed on what’s allocated/used; bloat inflates it permanently — Aurora storage doesn’t even shrink.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write amplification:&lt;/strong&gt; every &lt;code&gt;INSERT&lt;/code&gt;/&lt;code&gt;UPDATE&lt;/code&gt; maintains &lt;em&gt;every&lt;/em&gt; index on the table. Unused indexes tax every write with zero read benefit.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;I/O:&lt;/strong&gt; bloated tables mean more pages scanned for the same rows — more I/O, which on Aurora is a direct charge and everywhere is latency.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These are small per-row and large in aggregate — the classic shape of a cost that hides until measured.&lt;/p&gt;
&lt;h2 id=&quot;technical-root-causes&quot;&gt;Technical root causes&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;High-churn tables (queues, counters, soft-deletes) outpacing autovacuum defaults.&lt;/li&gt;
&lt;li&gt;Long-running transactions holding back the xmin horizon so vacuum can’t reclaim.&lt;/li&gt;
&lt;li&gt;Indexes created for one-off queries, dashboards, or ORMs and never removed.&lt;/li&gt;
&lt;li&gt;Duplicate or redundant indexes (e.g. an index that’s a prefix of another).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;review-checklist-read-only&quot;&gt;Review checklist (read-only)&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Which tables and indexes have the highest estimated bloat?&lt;/li&gt;
&lt;li&gt;Is autovacuum keeping up, or are dead tuples climbing on hot tables?&lt;/li&gt;
&lt;li&gt;Are there long-running transactions blocking vacuum?&lt;/li&gt;
&lt;li&gt;Which indexes have zero or near-zero scans in &lt;code&gt;pg_stat_user_indexes&lt;/code&gt;?&lt;/li&gt;
&lt;li&gt;Any duplicate/redundant indexes?&lt;/li&gt;
&lt;li&gt;What’s the storage trend, and how much is reclaimable?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The companion &lt;a href=&quot;https://aks.rajivonai.com/resources/&quot;&gt;DB Cost &amp;#x26; Reliability Toolkit&lt;/a&gt; ships read-only &lt;code&gt;index_bloat_review.sql&lt;/code&gt; and related checks for exactly this.&lt;/p&gt;
&lt;h2 id=&quot;example-findings&quot;&gt;Example findings&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;(Illustrative.)&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Four high-churn tables carried significant estimated bloat; tuning autovacuum (lower scale factors, more workers) plus a maintenance-window repack reclaimed storage and cut scan I/O.&lt;/li&gt;
&lt;li&gt;Six indexes showed zero scans over a 30-day window while adding write overhead; dropping them (after confirming no rare/seasonal use) reduced write amplification and storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;actions-to-take&quot;&gt;Actions to take&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Measure before touching anything.&lt;/strong&gt; Run bloat estimation and &lt;code&gt;pg_stat_user_indexes&lt;/code&gt; scan counts. Capture a 30-day window so you don’t drop a seasonal index.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tune autovacuum on hot tables&lt;/strong&gt; — per-table &lt;code&gt;autovacuum_vacuum_scale_factor&lt;/code&gt;, more workers, faster cost limits — before resorting to rewrites.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reclaim bloat safely.&lt;/strong&gt; Prefer &lt;code&gt;pg_repack&lt;/code&gt; (online) over a blocking &lt;code&gt;VACUUM FULL&lt;/code&gt;/&lt;code&gt;REINDEX&lt;/code&gt;; schedule maintenance windows for the rest.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Drop unused indexes carefully&lt;/strong&gt; — confirm zero scans across a long-enough window, and check for constraint-backing indexes before dropping.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hunt long-running transactions&lt;/strong&gt; that hold back vacuum; they’re often the real root cause.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Make it recurring.&lt;/strong&gt; Add bloat and unused-index checks to a monthly hygiene routine and alert on storage runway.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A note on safety: &lt;em&gt;finding&lt;/em&gt; all of this is read-only. &lt;em&gt;Applying&lt;/em&gt; it ranges from zero-risk (drop an index with zero scans) to needs-a-window (repack a large table). Sequence accordingly and validate in staging.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Want a senior engineer to find and quantify this in your database?&lt;/strong&gt; AKS runs a &lt;a href=&quot;https://aks.rajivonai.com/services/database-cost-reliability-review/&quot;&gt;Database Cost &amp;#x26; Reliability Review&lt;/a&gt; that includes bloat and index analysis with the math behind each opportunity. Start free with the &lt;a href=&quot;https://aks.rajivonai.com/30-point-database-cost-review-checklist.md&quot;&gt;30-Point Checklist&lt;/a&gt;, or see a worked example in the &lt;a href=&quot;https://aks.rajivonai.com/sample-report/&quot;&gt;Acme SaaS sample report&lt;/a&gt;.&lt;/p&gt;</content:encoded><category>postgresql</category><category>databases</category><category>cost</category><category>performance</category></item><item><title>Per-App Postgres on Kubernetes Changes the Failure Boundary</title><link>https://rajivonai.com/blog/2026-05-28-per-app-postgres-on-kubernetes-changes-the-failure-boundary/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-05-28-per-app-postgres-on-kubernetes-changes-the-failure-boundary/</guid><description>How CloudNativePG, GitOps, and external secrets make per-application Postgres viable without hiding the operational cost.</description><pubDate>Thu, 28 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Per-application PostgreSQL does not make databases easier to operate; it makes the failure boundary smaller and the operating contract larger. The trade is worth considering only when the platform can prove that every declared database can fail over, rotate credentials, archive WAL, restore into a clean namespace, and survive Kubernetes maintenance without relying on tribal memory.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;The old platform default was a shared managed PostgreSQL cluster with many application databases. It is efficient, familiar, and often the right answer. It also couples teams through change windows, noisy neighbors, backup policy, major-version lifecycle, and shared operational risk.&lt;/p&gt;
&lt;p&gt;The newer pattern is one PostgreSQL cluster per application, declared in Git and reconciled by a Kubernetes operator such as CloudNativePG. That changes what the platform owns. The platform is no longer only offering “a database”; it is offering a repeatable database lifecycle.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Default model&lt;/th&gt;&lt;th&gt;Alternative model&lt;/th&gt;&lt;th&gt;What changes&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;One shared managed PostgreSQL cluster, many databases&lt;/td&gt;&lt;td&gt;One CloudNativePG cluster per application&lt;/td&gt;&lt;td&gt;Failure moves from shared infrastructure to per-service blast radius&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Central database administrator controls change windows&lt;/td&gt;&lt;td&gt;GitOps declares database intent per service&lt;/td&gt;&lt;td&gt;Review moves into pull requests, admission policy, and runbooks&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Backups and upgrades handled at the shared cluster level&lt;/td&gt;&lt;td&gt;Backups and upgrades handled per cluster&lt;/td&gt;&lt;td&gt;More isolation, more fleet operations&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Credentials and connectivity are centrally managed&lt;/td&gt;&lt;td&gt;Secrets are synchronized into each namespace&lt;/td&gt;&lt;td&gt;Rotation becomes an end-to-end workflow, not a secret-store update&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Database operations are concentrated in a few large systems&lt;/td&gt;&lt;td&gt;Database operations are repeated across many smaller systems&lt;/td&gt;&lt;td&gt;Templates, policy, alerts, and restore drills become the product&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;CloudNativePG makes this viable because PostgreSQL becomes a Kubernetes custom resource. Argo CD can reconcile the database intent from Git. External Secrets Operator can pull credentials from Azure Key Vault or another external store into Kubernetes Secrets. Kustomize overlays can keep environment differences explicit.&lt;/p&gt;
&lt;p&gt;That is a strong architecture. It is not managed-database simplicity with YAML in front of it.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The operator can create the cluster. That is the least interesting part.&lt;/p&gt;
&lt;p&gt;The production question is whether the database survives the ordinary failures: node drains, bad migrations, storage latency, broken WAL archiving, stale credentials, object-store access errors, version drift, and emergency changes made while GitOps is still reconciling the old state.&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Shared cluster migrations&lt;/td&gt;&lt;td&gt;One application’s migration can saturate I/O, bloat catalogs, or hold locks visible to unrelated tenants&lt;/td&gt;&lt;td&gt;Per-database isolation inside one PostgreSQL instance is not operational isolation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GitOps self-healing&lt;/td&gt;&lt;td&gt;Argo CD can reapply the desired state after manual emergency changes when &lt;code&gt;selfHeal: true&lt;/code&gt; is enabled&lt;/td&gt;&lt;td&gt;Incident response needs a documented reconciliation pause; Argo CD retries self-heal after a default 5 second timeout when configured that way (&lt;a href=&quot;https://argo-cd.readthedocs.io/en/release-2.11/user-guide/auto_sync/&quot;&gt;Argo CD docs&lt;/a&gt;)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Backup configuration&lt;/td&gt;&lt;td&gt;WAL archives exist, but the physical base backup is missing, stale, or unrecoverable&lt;/td&gt;&lt;td&gt;CloudNativePG’s docs warn that a WAL archive alone is not a restore strategy (&lt;a href=&quot;https://github.com/cloudnative-pg/cloudnative-pg/blob/main/docs/src/backup.md&quot;&gt;CloudNativePG backup docs&lt;/a&gt;)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Kubernetes storage&lt;/td&gt;&lt;td&gt;PostgreSQL restarts cleanly, but the StorageClass has poor latency, weak snapshot behavior, or unsafe reclaim defaults&lt;/td&gt;&lt;td&gt;A database operator cannot paper over unreliable persistent volume semantics&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Secret rotation&lt;/td&gt;&lt;td&gt;External Secrets updates a Kubernetes Secret, but PostgreSQL roles and application connection pools keep using old credentials&lt;/td&gt;&lt;td&gt;Secret synchronization is not end-to-end credential rotation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Version drift&lt;/td&gt;&lt;td&gt;A manifest copied from an older CloudNativePG example keeps working until the operator lifecycle changes&lt;/td&gt;&lt;td&gt;Starting with CloudNativePG 1.26, backup and recovery capabilities are moving toward CNPG-I plugins, so backup templates need version review (&lt;a href=&quot;https://github.com/cloudnative-pg/cloudnative-pg/blob/main/docs/src/backup.md&quot;&gt;CloudNativePG backup docs&lt;/a&gt;)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The right question is not “can Kubernetes run PostgreSQL?” It can. The better question is: what operational boundary are you buying, and what repeated work are you accepting for every application database?&lt;/p&gt;
&lt;h2 id=&quot;architecture-problem&quot;&gt;Architecture Problem&lt;/h2&gt;
&lt;p&gt;The shared database model and the per-application database model solve different coordination problems. In the shared model, operational consistency is achieved at the cost of coupling. In the per-application model, coupling is removed at the cost of operational repetition.&lt;/p&gt;
&lt;p&gt;The architectural problem is not technical feasibility. Kubernetes can schedule PostgreSQL pods. CloudNativePG can declare a cluster as a custom resource. Argo CD can reconcile it from Git. External Secrets Operator can synchronize credentials into namespaces. These mechanisms are documented and widely deployed.&lt;/p&gt;
&lt;p&gt;The actual architectural problem is: &lt;strong&gt;which operational concerns can be automated once at the platform layer, and which must be repeated per database — and is the platform mature enough to absorb the repetition safely?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The failure mode of the shared model is coupling: one application’s migration, bloat, or connection saturation affects every tenant of the cluster. The failure mode of the per-application model is multiplication: every new database adds backup monitoring, restore verification, credential rotation, upgrade planning, and failover testing. If these are not templated, tested, and owned by platform tooling, the per-application model exchanges shared risk for invisible risk.&lt;/p&gt;
&lt;h2 id=&quot;design-options&quot;&gt;Design Options&lt;/h2&gt;
&lt;p&gt;Three options are in common use, and each distributes risk and work differently.&lt;/p&gt;

































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Option&lt;/th&gt;&lt;th&gt;Description&lt;/th&gt;&lt;th&gt;Coupling risk&lt;/th&gt;&lt;th&gt;Multiplication risk&lt;/th&gt;&lt;th&gt;Recommended for&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Shared managed cluster&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;One cloud-managed PostgreSQL cluster hosts many application databases; DBA team or cloud provider owns operations&lt;/td&gt;&lt;td&gt;High — shared change windows, noisy neighbors, shared version lifecycle&lt;/td&gt;&lt;td&gt;Low — operations are centralized&lt;/td&gt;&lt;td&gt;Teams early in database operational maturity; stable workloads without strict isolation requirements&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Per-app PostgreSQL, manual management&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Each application gets a dedicated cloud-managed database instance; teams manage their own backups, creds, and versions&lt;/td&gt;&lt;td&gt;Low — isolated failure boundary&lt;/td&gt;&lt;td&gt;High — no shared templates, policy, or tooling&lt;/td&gt;&lt;td&gt;Teams that need isolation but cannot invest in a Kubernetes-native platform&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Per-app PostgreSQL via operator (CloudNativePG + GitOps)&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Kubernetes operator reconciles PostgreSQL clusters from Git; external secrets, backups, monitoring, and failover are declared resources&lt;/td&gt;&lt;td&gt;Low — each application cluster is independent&lt;/td&gt;&lt;td&gt;Medium — operator and templates absorb repetition, but restore drills and upgrade testing must still run per cluster&lt;/td&gt;&lt;td&gt;Teams with mature Kubernetes platform capability and willingness to own the database lifecycle&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Option A&lt;/strong&gt; should remain the default until coupling failure modes are actively limiting teams. The argument for per-app databases should be made from incident reports and blocking dependencies, not from preference for patterns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option B&lt;/strong&gt; increases operational isolation without a shared template layer. Teams that choose this option often discover that they have recreated the shared-cluster problem in a distributed form: many databases with inconsistent backup policies, no shared restore testing, and no centralized visibility into credential expiry or disk saturation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option C&lt;/strong&gt; is the strongest option when the platform investment has been made. CloudNativePG provides a consistent operator lifecycle, standardized service semantics, and Prometheus integration. GitOps provides audit history, review gates, and reconciliation. External Secrets provides credentialed automation. The platform team owns the templates, admission policy, and restore drill cadence. Application teams declare their database intent and trust the platform to handle the lifecycle correctly.&lt;/p&gt;
&lt;h2 id=&quot;tradeoff-matrix&quot;&gt;Tradeoff Matrix&lt;/h2&gt;

































































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Dimension&lt;/th&gt;&lt;th&gt;Shared managed cluster&lt;/th&gt;&lt;th&gt;Per-app managed instances&lt;/th&gt;&lt;th&gt;Per-app operator (CloudNativePG)&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Failure blast radius&lt;/td&gt;&lt;td&gt;Shared across all tenants&lt;/td&gt;&lt;td&gt;Per application&lt;/td&gt;&lt;td&gt;Per application&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Noisy neighbor risk&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;td&gt;None&lt;/td&gt;&lt;td&gt;None&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Operational repetition&lt;/td&gt;&lt;td&gt;Low&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;td&gt;Medium — templates absorb most repetition&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Backup and restore&lt;/td&gt;&lt;td&gt;Centralized, consistent&lt;/td&gt;&lt;td&gt;Per-team, inconsistent without tooling&lt;/td&gt;&lt;td&gt;Per-cluster, consistent if platform owns templates&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Credential rotation&lt;/td&gt;&lt;td&gt;Central secret store&lt;/td&gt;&lt;td&gt;Per-instance manual or scripted&lt;/td&gt;&lt;td&gt;External Secrets + per-cluster runbook&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Version upgrades&lt;/td&gt;&lt;td&gt;Scheduled at cluster level&lt;/td&gt;&lt;td&gt;Per-instance, team-owned&lt;/td&gt;&lt;td&gt;Per-cluster, GitOps-managed&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GitOps compatibility&lt;/td&gt;&lt;td&gt;External to database&lt;/td&gt;&lt;td&gt;External to database&lt;/td&gt;&lt;td&gt;Native — cluster is a Kubernetes custom resource&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Restore drill burden&lt;/td&gt;&lt;td&gt;One drill for shared cluster&lt;/td&gt;&lt;td&gt;One drill per instance&lt;/td&gt;&lt;td&gt;One drill per cluster tier (production, staging)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform investment&lt;/td&gt;&lt;td&gt;Low&lt;/td&gt;&lt;td&gt;Low&lt;/td&gt;&lt;td&gt;High — operator lifecycle, policy, monitoring, templates&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;core-concept-per-app-postgresql-as-a-declared-failure-boundary&quot;&gt;Core Concept: Per-App PostgreSQL as a Declared Failure Boundary&lt;/h2&gt;
&lt;p&gt;A per-application PostgreSQL cluster works when the platform treats the database manifest as an operating contract, not a deployment snippet.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Dev[developer commit] --&gt; Git[Git repository — apps and databases]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Git --&gt; Argo[Argo CD — reconcile desired state]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Argo --&gt; App[application namespace]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Argo --&gt; CNPGCluster[CloudNativePG Cluster resource]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    KeyVault[external secret store] --&gt; ESO[External Secrets Operator]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    ESO --&gt; K8sSecret[Kubernetes Secret]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K8sSecret --&gt; App&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K8sSecret --&gt; CNPGCluster&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    CNPG[CloudNativePG operator] --&gt; Primary[PostgreSQL primary]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    CNPG --&gt; ReplicaA[PostgreSQL replica]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    CNPG --&gt; ReplicaB[PostgreSQL replica]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    App --&gt; RWService[cluster rw service]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    RWService --&gt; Primary&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Primary --&gt; WAL[WAL archive in object storage]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    ReplicaA --&gt; WAL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    ReplicaB --&gt; WAL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Backup[scheduled base backup] --&gt; ObjectStore[object storage recovery boundary]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;CloudNativePG creates service endpoints for each cluster: &lt;code&gt;rw&lt;/code&gt; points to the current primary, &lt;code&gt;ro&lt;/code&gt; points to replicas when available, and &lt;code&gt;r&lt;/code&gt; can point to any instance. The &lt;code&gt;rw&lt;/code&gt; service is essential and cannot be disabled because CloudNativePG relies on it for PostgreSQL replication behavior (&lt;a href=&quot;https://cloudnative-pg.io/docs/1.26/service_management/&quot;&gt;CloudNativePG service docs&lt;/a&gt;). Application write traffic should use the generated &lt;code&gt;*-rw&lt;/code&gt; service unless there is a deliberately tested routing layer in front of it.&lt;/p&gt;
&lt;p&gt;A production-grade manifest should look less like a tutorial and more like a contract:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;yaml&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;apiVersion&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;postgresql.cnpg.io/v1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;kind&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;Cluster&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;metadata&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;linkding-db-prod&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  labels&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    app.kubernetes.io/name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;linkding&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    platform.example.com/owner&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;bookmarks&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    platform.example.com/tier&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;production&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;spec&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  instances&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;3&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  imageName&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;ghcr.io/cloudnative-pg/postgresql:16.4&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  storage&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    size&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;100Gi&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    storageClass&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;premium-rwo&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  resources&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    requests&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      cpu&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;500m&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      memory&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;2Gi&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    limits&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      memory&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;4Gi&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  monitoring&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    enablePodMonitor&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  bootstrap&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    initdb&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      database&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;linkding&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      owner&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;linkding&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      secret&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;linkding-db-owner&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  backup&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    barmanObjectStore&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      destinationPath&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;https://example.blob.core.windows.net/postgres/linkding&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      azureCredentials&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        storageAccount&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;linkding-backup-creds&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          key&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;storage-account&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        storageSasToken&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;linkding-backup-creds&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          key&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;sas-token&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      wal&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        compression&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;gzip&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      data&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        compression&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;gzip&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    retentionPolicy&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;14d&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The contract is not complete until it has tests.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Split day-0 infrastructure from day-2 database intent.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Install CloudNativePG, External Secrets Operator, Argo CD, monitoring CRDs, admission policy, namespaces, and storage classes through Terraform or another cluster-admin workflow. Application repositories should declare database intent, not own operator installation.&lt;/p&gt;
&lt;p&gt;Verification:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; auth&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; can-i&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; create&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; clusters.postgresql.cnpg.io&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-prod&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; auth&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; can-i&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; update&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; deployment&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; cloudnative-pg&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; cnpg-system&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; auth&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; can-i&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; patch&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; storageclass&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; premium-rwo&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The expected shape is narrow: application delivery can create its own &lt;code&gt;Cluster&lt;/code&gt; resource in its namespace, but cannot modify the operator deployment, cluster-wide secret stores, or storage classes.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;Make policy enforce the minimum contract.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For production clusters, reject manifests that omit ownership labels, resource requests, monitoring, backup configuration, explicit storage class, or a three-instance topology.&lt;/p&gt;
&lt;p&gt;A CI or admission rule should fail a manifest like this:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;yaml&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;spec&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  instances&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  storage&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    size&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;5Gi&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The exact policy engine is less important than the invariant. Kyverno, OPA Gatekeeper, Conftest, or a custom CI check can all work. The point is to stop “temporary” database YAML from becoming production state.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;Route applications through the CloudNativePG read-write service.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Do not hardcode pod names. Do not point applications at ordinal &lt;code&gt;0&lt;/code&gt;. Do not teach application teams that the first pod is the primary. In a failover, the application needs the service abstraction to follow the writable instance.&lt;/p&gt;
&lt;p&gt;Verification:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-prod&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; get&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; cluster&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-db-prod&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  -o&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; jsonpath=&apos;{.status.currentPrimary}{&quot;\n&quot;}&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-prod&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; delete&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pod&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;$(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-prod get cluster linkding-db-prod &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;\&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  -o&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; jsonpath=&apos;{.status.currentPrimary}&apos;)&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-prod&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; wait&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; cluster/linkding-db-prod&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --for=condition=Ready&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --timeout=300s&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-prod&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; get&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; cluster&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-db-prod&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  -o&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; jsonpath=&apos;{.status.currentPrimary}{&quot;\n&quot;}&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then verify the application can still write through the same hostname:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;create&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; table&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; if&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; not&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; exists&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; platform_failover_probe (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;bigserial&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; primary key&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  observed_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;timestamptz&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; not null&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; default&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;insert into&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; platform_failover_probe &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;default&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; values&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;select&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;from&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; platform_failover_probe;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A changed primary is not enough. The application write must succeed without changing connection strings.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;Prove recovery before calling the platform production-ready.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;CloudNativePG can archive WAL to object storage and recover from physical backups. For Barman object-store backups, current CloudNativePG docs say the operator sets &lt;code&gt;archive_timeout&lt;/code&gt; to &lt;code&gt;5min&lt;/code&gt; by default, giving a deterministic time-based RPO boundary for low-write workloads (&lt;a href=&quot;https://cloudnative-pg.io/docs/1.29/appendixes/backup_barmanobjectstore/&quot;&gt;CloudNativePG object-store backup docs&lt;/a&gt;). That boundary is meaningful only after restore has been tested.&lt;/p&gt;
&lt;p&gt;Verification:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-prod&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; apply&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -f&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; -&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &amp;#x3C;&amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;YAML&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;apiVersion: postgresql.cnpg.io/v1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;kind: Backup&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;metadata:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  name: linkding-manual-restore-drill&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;spec:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  cluster:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;    name: linkding-db-prod&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;YAML&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-prod&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; get&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; backup&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-manual-restore-drill&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A restore drill should create a new namespace, restore from object storage, run application migrations against the restored database, and record observed RTO and RPO. The output should be boring enough to put in a runbook:&lt;/p&gt;













































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Drill field&lt;/th&gt;&lt;th&gt;Recorded value&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Backup identifier&lt;/td&gt;&lt;td&gt;Exact backup object or CloudNativePG backup name&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Restore namespace&lt;/td&gt;&lt;td&gt;Isolated namespace name&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Restore start time&lt;/td&gt;&lt;td&gt;Timestamp&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Application migration result&lt;/td&gt;&lt;td&gt;Pass or fail&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Observed RTO&lt;/td&gt;&lt;td&gt;Measured duration&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Observed RPO&lt;/td&gt;&lt;td&gt;Last committed test row recovered&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Operator version&lt;/td&gt;&lt;td&gt;CloudNativePG version&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL image&lt;/td&gt;&lt;td&gt;Exact image tag&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;StorageClass&lt;/td&gt;&lt;td&gt;Exact class&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;Make GitOps incident-aware.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Automated pruning and self-healing are useful until an incident commander needs to patch a live object. Argo CD automated sync does not prune by default; pruning and self-healing are explicit settings (&lt;a href=&quot;https://argo-cd.readthedocs.io/en/release-2.11/user-guide/auto_sync/&quot;&gt;Argo CD docs&lt;/a&gt;). Database resources need operational rules around those settings.&lt;/p&gt;
&lt;p&gt;Verification:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;argocd&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; app&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; set&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-db-prod&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --sync-policy&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; none&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-prod&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; annotate&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; cluster&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-db-prod&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  incident.example.com/reconciliation-paused=&quot;$(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -u&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%Y-%m-%dT%H:%M:%SZ)&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Apply the emergency change, then commit the final desired state back to Git.&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;argocd&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; app&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; set&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-db-prod&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --sync-policy&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; automated&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --self-heal&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --auto-prune&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;argocd&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; app&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; sync&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; linkding-db-prod&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The runbook should say who can pause reconciliation, how the change is recorded, and how drift is reconciled afterward.&lt;/p&gt;
&lt;ol start=&quot;6&quot;&gt;
&lt;li&gt;Monitor the database fleet, not just one cluster.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;CloudNativePG provides predefined metrics and Prometheus integration. A &lt;code&gt;PodMonitor&lt;/code&gt; for a cluster can be created by setting &lt;code&gt;.spec.monitoring.enablePodMonitor: true&lt;/code&gt;, and CloudNativePG publishes Grafana dashboard material for the operator and clusters (&lt;a href=&quot;https://cloudnative-pg.io/documentation/1.20/monitoring/&quot;&gt;CloudNativePG monitoring docs&lt;/a&gt;, &lt;a href=&quot;https://grafana.com/grafana/dashboards/20417-cloudnativepg/&quot;&gt;Grafana dashboard&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Per-application databases multiply alert surfaces. That is acceptable only if ownership is encoded.&lt;/p&gt;
&lt;p&gt;Minimum alert classes:&lt;/p&gt;









































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Alert class&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Replication lag&lt;/td&gt;&lt;td&gt;Failover safety depends on replicas being current enough for the workload&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Failed WAL archiving&lt;/td&gt;&lt;td&gt;PITR depends on the archive, not only the running pods&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Backup age&lt;/td&gt;&lt;td&gt;A configured backup policy can still fail silently&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Disk saturation&lt;/td&gt;&lt;td&gt;PostgreSQL availability usually fails gradually before it fails completely&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Failover events&lt;/td&gt;&lt;td&gt;The application may need connection-pool and retry validation after promotion&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Certificate or secret expiry&lt;/td&gt;&lt;td&gt;A synchronized Secret does not prove clients are using it correctly&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;External Secrets sync errors&lt;/td&gt;&lt;td&gt;The Kubernetes Secret can drift from the external source&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Object-store errors&lt;/td&gt;&lt;td&gt;Restore readiness depends on credentials, network path, and storage availability&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern is not “Kubernetes makes databases easy.” The documented pattern is “Kubernetes gives the operator a control plane, and the operator still depends on PostgreSQL, storage, object storage, secrets, and reconciliation semantics behaving correctly.”&lt;/p&gt;
&lt;p&gt;The strongest public warning is GitLab’s January 31, 2017 database outage. It was not a Kubernetes incident, and it should not be misrepresented as one. Its relevance is narrower and more useful: GitLab’s public postmortem shows how PostgreSQL HA, replication, snapshots, dumps, and restore procedures can all look plausible until the one day they are needed together.&lt;/p&gt;
&lt;p&gt;GitLab reported accidental removal of data from the primary database, replication already propagating the damage, missing &lt;code&gt;pg_dump&lt;/code&gt; backups caused by a PostgreSQL client version mismatch, backup failure notifications that were not reaching operators, and a restore path bottlenecked by slow disk transfer from a staging snapshot (&lt;a href=&quot;https://about.gitlab.com/blog/postmortem-of-database-outage-of-january-31/&quot;&gt;GitLab postmortem&lt;/a&gt;). The public incident summary also noted that a six-hour-old backup was used and database changes in that window were lost (&lt;a href=&quot;https://about.gitlab.com/blog/gitlab-dot-com-database-incident/&quot;&gt;GitLab incident update&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The lesson for CloudNativePG is not that Kubernetes would have prevented the incident. It would not automatically do that. The lesson is that database resilience is a chain:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Write[application write] --&gt; WAL[WAL generated]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    WAL --&gt; Archive[WAL archived]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Data[database files] --&gt; BaseBackup[physical base backup]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Archive --&gt; Restore[restore procedure]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    BaseBackup --&gt; Restore&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Restore --&gt; AppCheck[application migration and read write check]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    AppCheck --&gt; Evidence[recorded RTO and RPO]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If any link is assumed rather than tested, the platform is carrying hidden risk.&lt;/p&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Evidence type&lt;/th&gt;&lt;th&gt;Public mechanism&lt;/th&gt;&lt;th&gt;Production implication&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;GitLab public postmortem&lt;/td&gt;&lt;td&gt;Backup jobs failed because the wrong PostgreSQL client version was used, and failure notifications were not reaching operators (&lt;a href=&quot;https://about.gitlab.com/blog/postmortem-of-database-outage-of-january-31/&quot;&gt;GitLab postmortem&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;Backup configuration must be verified by restore tests and alert delivery, not only scheduled jobs&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GitLab restore behavior&lt;/td&gt;&lt;td&gt;Restore was constrained by the available snapshot and storage transfer path (&lt;a href=&quot;https://about.gitlab.com/blog/postmortem-of-database-outage-of-january-31/&quot;&gt;GitLab postmortem&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;RTO depends on data size, object-store throughput, volume performance, and the restore procedure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;CloudNativePG service behavior&lt;/td&gt;&lt;td&gt;CloudNativePG documents &lt;code&gt;rw&lt;/code&gt;, &lt;code&gt;ro&lt;/code&gt;, and &lt;code&gt;r&lt;/code&gt; services, with &lt;code&gt;rw&lt;/code&gt; pointing to the primary and being non-disableable (&lt;a href=&quot;https://cloudnative-pg.io/docs/1.26/service_management/&quot;&gt;service docs&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;Application failover depends on using the service, not pod identity&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;CloudNativePG backup behavior&lt;/td&gt;&lt;td&gt;CloudNativePG documents WAL archiving, physical base backups, PITR, and warns that WAL alone cannot restore a cluster (&lt;a href=&quot;https://github.com/cloudnative-pg/cloudnative-pg/blob/main/docs/src/backup.md&quot;&gt;backup docs&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;Backup success is not restore readiness&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;CloudNativePG object-store behavior&lt;/td&gt;&lt;td&gt;CloudNativePG documents a default &lt;code&gt;archive_timeout&lt;/code&gt; of &lt;code&gt;5min&lt;/code&gt; for Barman object-store WAL archiving (&lt;a href=&quot;https://cloudnative-pg.io/docs/1.29/appendixes/backup_barmanobjectstore/&quot;&gt;object-store backup docs&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;Low-write workloads still need explicit RPO measurement and restore validation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Argo CD reconciliation&lt;/td&gt;&lt;td&gt;Argo CD documents automated prune, self-heal, sync semantics, and rollback limits under automated sync (&lt;a href=&quot;https://argo-cd.readthedocs.io/en/release-2.11/user-guide/auto_sync/&quot;&gt;auto-sync docs&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;Database emergency operations need a GitOps pause and resume procedure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;External Secrets refresh&lt;/td&gt;&lt;td&gt;External Secrets Operator documents &lt;code&gt;CreatedOnce&lt;/code&gt;, &lt;code&gt;Periodic&lt;/code&gt;, and &lt;code&gt;OnChange&lt;/code&gt; refresh policies; &lt;code&gt;Periodic&lt;/code&gt; updates the Kubernetes Secret on &lt;code&gt;refreshInterval&lt;/code&gt; (&lt;a href=&quot;https://external-secrets.io/latest/api/externalsecret/&quot;&gt;ExternalSecret API docs&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;Secret rotation must include application reload and PostgreSQL role behavior&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Kubernetes disruption behavior&lt;/td&gt;&lt;td&gt;Kubernetes distinguishes voluntary and involuntary disruptions and notes that not all voluntary disruptions are constrained by PodDisruptionBudgets (&lt;a href=&quot;https://kubernetes.io/docs/concepts/workloads/pods/disruptions/&quot;&gt;Kubernetes docs&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;Node drain, pod deletion, node loss, and storage failure are separate tests&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;I have not run this exact Linkding-style reference deployment at production scale personally. The documented mechanics are still enough to draw the boundary: a three-instance PostgreSQL cluster can fail over correctly at the Kubernetes object level while the user-visible service still fails because the application pinned stale connections, the volume layer stalled, External Secrets rotated a value no process reloaded, WAL archiving failed unnoticed, or Argo CD reverted an emergency patch.&lt;/p&gt;
&lt;p&gt;That is why the proof must be operational, not visual. A green Argo CD dashboard proves convergence. It does not prove recoverability. A promoted replica proves one HA path. It does not prove connection-pool behavior, restore speed, backup freshness, or data-loss bounds.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






































































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Correlated downtime across replicas&lt;/td&gt;&lt;td&gt;Kubernetes schedules PostgreSQL instances onto nodes sharing the same failure domain&lt;/td&gt;&lt;td&gt;Require topology spread constraints, node affinity, and anti-affinity across zones or node pools&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;False confidence from HA&lt;/td&gt;&lt;td&gt;Primary pod deletion succeeds, but storage-zone failure or object-store outage was never tested&lt;/td&gt;&lt;td&gt;Run separate drills for pod deletion, node drain, node loss, storage latency, and restore from object storage&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Backup drift across CloudNativePG versions&lt;/td&gt;&lt;td&gt;Templates depend on older &lt;code&gt;barmanObjectStore&lt;/code&gt; examples while the operator lifecycle moves toward CNPG-I plugins from 1.26 onward&lt;/td&gt;&lt;td&gt;Pin operator versions, maintain upgrade notes, and test backup plus restore for every operator upgrade&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GitOps conflicts with emergency repair&lt;/td&gt;&lt;td&gt;&lt;code&gt;selfHeal: true&lt;/code&gt; reapplies Git state after manual database-related Kubernetes changes&lt;/td&gt;&lt;td&gt;Document Argo CD suspension, require incident annotations, and reconcile the final state back into Git&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Secret rotation only updates Kubernetes&lt;/td&gt;&lt;td&gt;External Secrets updates the Secret, but PostgreSQL connections remain open with old credentials&lt;/td&gt;&lt;td&gt;Use explicit rotation runbooks: create new role secret, restart or reload clients, verify new logins, then revoke the old role&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Read traffic hits the wrong endpoint&lt;/td&gt;&lt;td&gt;Application sends writes to &lt;code&gt;ro&lt;/code&gt; or uses &lt;code&gt;r&lt;/code&gt; because it appears to work during steady state&lt;/td&gt;&lt;td&gt;Standardize environment variables and policy checks so write paths use only &lt;code&gt;*-rw&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cost expands quietly&lt;/td&gt;&lt;td&gt;Every service gets PostgreSQL pods, persistent volumes, backups, metrics, and alerts&lt;/td&gt;&lt;td&gt;Define tiers: production HA, staging reduced HA, ephemeral development, and explicit cost labels&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Noisy fleet operations&lt;/td&gt;&lt;td&gt;One-off manifests diverge across teams&lt;/td&gt;&lt;td&gt;Generate manifests from reviewed templates and enforce policy with Kyverno, OPA Gatekeeper, or CI checks&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Restore exceeds incident budget&lt;/td&gt;&lt;td&gt;PITR exists in theory, but base backup size, object-store throughput, and migration replay time were never measured&lt;/td&gt;&lt;td&gt;Record RTO and RPO during scheduled restore drills, then publish them with the service SLO&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Kubernetes maintenance causes failover churn&lt;/td&gt;&lt;td&gt;Node drains evict database pods without a maintenance strategy&lt;/td&gt;&lt;td&gt;Use PodDisruptionBudgets, maintenance windows, topology constraints, and CloudNativePG-aware drain procedures&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Backup alerts are too shallow&lt;/td&gt;&lt;td&gt;The backup job exits successfully, but restore would fail because credentials, object paths, or versions drifted&lt;/td&gt;&lt;td&gt;Alert on backup age and WAL archive failures, then run scheduled restore verification into a clean namespace&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Application retry behavior is untested&lt;/td&gt;&lt;td&gt;PostgreSQL primary changes while clients hold old sessions&lt;/td&gt;&lt;td&gt;Test failover through the real application path, including connection pool settings and transaction retry behavior&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Per-application PostgreSQL reduces blast radius, but multiplies operational surfaces across storage, backup, monitoring, secrets, upgrades, GitOps, and cost.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Build a database platform contract around CloudNativePG manifests, admission policy, restore drills, and incident-aware reconciliation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A valid proof creates a cluster from Git, writes test data, kills the primary, confirms application writes through &lt;code&gt;*-rw&lt;/code&gt;, rotates credentials, restores from object storage into a clean namespace, and records observed RTO and RPO.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, add CI or admission checks for &lt;code&gt;instances &gt;= 3&lt;/code&gt;, backup configuration, monitoring enabled, resource requests, owner labels, explicit storage class, and no plaintext Secret manifests.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A per-application database is not a smaller managed service. It is a sharper failure boundary. Use it when the platform is prepared to test the edge.&lt;/p&gt;</content:encoded><category>databases</category><category>cloud</category><category>architecture</category></item><item><title>Azure Database for PostgreSQL: Flexible Server vs Hyperscale (Citus) Architecture Decision</title><link>https://rajivonai.com/blog/2026-05-25-azure-postgresql-flexible-vs-citus-architecture-decision/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-05-25-azure-postgresql-flexible-vs-citus-architecture-decision/</guid><description>When to choose Azure Flexible Server vs Citus for PostgreSQL on Azure — failover behavior, connection pooling, and the workload shapes where each architecture wins and breaks.</description><pubDate>Mon, 25 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The default Azure PostgreSQL offering handles most OLTP workloads correctly, but teams that hit connection limits, multi-tenant scale, or distributed query requirements discover they chose the wrong architecture after the schema is in production.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Azure offers two managed PostgreSQL architectures: Flexible Server (the current default and successor to Single Server) and Hyperscale, which runs the Citus extension for distributed PostgreSQL. Both are managed services on Azure with similar operational interfaces. The architectural difference is not a sizing question — it is a data distribution question. Most teams never need Citus. The teams that do need it typically discover the need late, after their schema is built around single-node PostgreSQL assumptions.&lt;/p&gt;
&lt;p&gt;Azure announced that PostgreSQL Single Server reached end of life in March 2025, making Flexible Server the standard entry point for new deployments and migrations.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Azure Flexible Server is a single-primary managed PostgreSQL instance with read replicas, high availability via standby promotion, and built-in PgBouncer connection pooling. It scales vertically and handles standard PostgreSQL workloads. The failure mode is predictable: beyond a certain write throughput threshold and connection count, a single PostgreSQL primary saturates regardless of how large the VM SKU is.&lt;/p&gt;
&lt;p&gt;Citus distributes table rows across worker nodes using a shard key. This enables horizontal write scaling and parallel query execution across shards — but it requires designing the schema and query patterns around the distribution key from the start. Application queries that do not include the distribution key cannot be routed to a single shard and must fan out across all workers, which is expensive.&lt;/p&gt;
&lt;p&gt;The core question: does the workload require horizontal scaling of writes and data volume, or does it require operational simplicity with vertical scaling?&lt;/p&gt;
&lt;h2 id=&quot;flexible-server-vs-hyperscale-citus&quot;&gt;Flexible Server vs Hyperscale (Citus)&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[PostgreSQL workload on Azure] --&gt; B{Multi-tenant or single-tenant?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|single tenant — standard OLTP| C[Flexible Server]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|multi-tenant at scale or distributed analytics| D{Can schema be distributed on tenant ID?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes — queries filter by tenant| E[Citus — sharded by tenant]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no — cross-tenant joins required| F[Flexible Server — accept vertical limits]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; G[Scale vertically — HA standby — PgBouncer]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; H[Coordinator node — worker shards — distributed queries]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Azure Flexible Server&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Flexible Server provides a single primary PostgreSQL instance with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Zone-redundant high availability (primary + synchronous standby in a secondary AZ)&lt;/li&gt;
&lt;li&gt;Built-in PgBouncer for connection pooling (configurable pool sizes per database)&lt;/li&gt;
&lt;li&gt;Read replicas for read offload (asynchronous replication)&lt;/li&gt;
&lt;li&gt;Automatic minor version patching and maintenance windows&lt;/li&gt;
&lt;li&gt;Private endpoint and VNet integration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The HA model uses a standby in a secondary availability zone with synchronous replication. Azure documents typical failover in 60–120 seconds with automatic DNS cutover (&lt;a href=&quot;https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-high-availability&quot;&gt;Flexible Server HA docs&lt;/a&gt;). The built-in PgBouncer connection pooler is enabled separately from the HA feature and must be explicitly configured — applications that connect directly to the PostgreSQL port bypass PgBouncer.&lt;/p&gt;
&lt;p&gt;Connection pooling is the most commonly misconfigured element. Azure Flexible Server supports a maximum of 5,000 backend connections for the largest SKU (D64s v3), but each PostgreSQL backend process consumes memory. The practical limit before performance degrades is substantially lower. PgBouncer on Flexible Server runs in transaction-pooling mode by default, which releases the backend connection between transactions — enabling more clients than physical backends.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hyperscale (Citus)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Citus distributes a PostgreSQL database across a coordinator node and multiple worker nodes. The coordinator routes queries to shards based on the distribution column. A table distributed on &lt;code&gt;tenant_id&lt;/code&gt; routes queries that filter on &lt;code&gt;tenant_id&lt;/code&gt; to the single worker holding that tenant’s shards. Queries without a &lt;code&gt;tenant_id&lt;/code&gt; filter fan out to all workers.&lt;/p&gt;
&lt;p&gt;The operational consequence: Citus is most efficient for multi-tenant SaaS workloads where each tenant’s data is isolated and queries are tenant-scoped. It is less effective for workloads with heavy cross-tenant analytics or complex joins between distributed and reference tables.&lt;/p&gt;
&lt;p&gt;Azure-managed Citus (now branded as part of Azure Cosmos DB for PostgreSQL) provides managed coordinator and worker nodes, automatic rebalancing, and built-in high availability per node.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;Azure Flexible Server’s PgBouncer documentation explicitly states that &lt;code&gt;PREPARE&lt;/code&gt;, &lt;code&gt;DEALLOCATE&lt;/code&gt;, &lt;code&gt;LISTEN&lt;/code&gt;, &lt;code&gt;NOTIFY&lt;/code&gt;, &lt;code&gt;LOAD&lt;/code&gt;, and advisory locks are not compatible with transaction-pooling mode (&lt;a href=&quot;https://www.pgbouncer.org/features.html&quot;&gt;PgBouncer compatibility&lt;/a&gt;). Applications that use prepared statements with PgBouncer in transaction mode will encounter errors. This is a documented PostgreSQL connection pooler constraint, not Azure-specific — but it is frequently missed by teams migrating from AWS RDS or on-premises PostgreSQL where client-side connection pooling was used at the application layer instead.&lt;/p&gt;
&lt;p&gt;Citus’s documented design requires that the distribution column be present in the primary key and all unique constraints of the distributed table. A table distributed on &lt;code&gt;tenant_id&lt;/code&gt; must include &lt;code&gt;tenant_id&lt;/code&gt; in its primary key (e.g., &lt;code&gt;PRIMARY KEY (tenant_id, id)&lt;/code&gt;). This is documented as a hard requirement — the coordinator cannot enforce uniqueness across shards without the distribution column in the constraint (&lt;a href=&quot;https://docs.citusdata.com/en/v12.1/sharding/data_modeling.html&quot;&gt;Citus distribution docs&lt;/a&gt;). Applications migrated from single-node PostgreSQL typically have auto-increment primary keys without a tenant prefix, requiring a schema migration before Citus distribution is feasible.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Flexible Server — prepared statements with PgBouncer in transaction mode&lt;/td&gt;&lt;td&gt;&lt;code&gt;ERROR: prepared statement does not exist&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Transaction-pooling releases connections between statements; prepared statements don’t persist&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Flexible Server — application connects to PostgreSQL port, bypasses PgBouncer&lt;/td&gt;&lt;td&gt;Connection saturation under load&lt;/td&gt;&lt;td&gt;PgBouncer only intercepts connections on port 6432; direct PostgreSQL port (5432) bypasses pooling&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Citus — cross-tenant queries on distributed tables&lt;/td&gt;&lt;td&gt;Fan-out to all workers, high latency&lt;/td&gt;&lt;td&gt;No shard routing possible without distribution column in WHERE clause&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Citus — unique constraints without distribution column&lt;/td&gt;&lt;td&gt;Cannot enforce constraint across shards&lt;/td&gt;&lt;td&gt;Coordinator cannot run a distributed uniqueness check efficiently&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Flexible Server — HA failover to standby&lt;/td&gt;&lt;td&gt;60–120s DNS propagation delay during failover&lt;/td&gt;&lt;td&gt;Applications not using connection retry logic see errors during the HA switchover window&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Citus — uneven tenant distribution (hotspot)&lt;/td&gt;&lt;td&gt;One worker shard saturated while others idle&lt;/td&gt;&lt;td&gt;All rows for a large tenant land on one shard; distribution column alone does not balance load&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Choosing between Flexible Server and Citus after the schema is designed and populated is expensive — Citus requires a distribution-column-aware schema that cannot be retrofitted easily.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Use Flexible Server as the default; evaluate Citus only when the workload is multi-tenant with tenant-scoped queries, write throughput exceeds what a single large SKU can sustain, or data volume per tenant is large enough to benefit from distributed storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Benchmark your top write-intensive operations on the largest available Flexible Server SKU under expected peak load; if the primary CPU or WAL write throughput saturates, that is the signal that horizontal distribution is worth the schema redesign cost.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: If you are building on Flexible Server, enable and configure PgBouncer this week, connect your application through port 6432, and verify prepared statement behavior — this is the most common production misconfiguration on Azure PostgreSQL.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category><category>architecture</category></item><item><title>Cassandra Write Path Fundamentals for Database Engineers</title><link>https://rajivonai.com/blog/2026-05-25-cassandra-write-path-fundamentals-for-database-engineers/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-05-25-cassandra-write-path-fundamentals-for-database-engineers/</guid><description>How Cassandra&apos;s commit log, Memtable, and SSTable pipeline works, why write amplification is the dominant operational cost, and how compaction strategy selection changes it.</description><pubDate>Mon, 25 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Cassandra’s write performance reputation is correct but incomplete — writes are fast because Cassandra converts random writes into sequential I/O, and the operational cost of that conversion is paid later through compaction, which can saturate disk throughput if the strategy does not match the workload.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Database engineers familiar with PostgreSQL or MySQL approach Cassandra expecting tunable durability, indexing flexibility, and a query optimizer. Cassandra’s durability and performance model works differently: the write path is optimized for sequential I/O at the cost of deferred merge work, and the query model is constrained by the partition key and clustering columns defined at schema creation.&lt;/p&gt;
&lt;p&gt;Cassandra is used in production for workloads requiring high write throughput, time-series data, and geographic multi-region replication — systems where the write path’s operational characteristics are the primary design constraint.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The fundamental problem Cassandra solves is random write throughput. Traditional relational databases perform writes by updating rows in-place on disk pages, which requires random I/O to locate the correct page. At high write rates across large datasets, this random I/O pattern saturates disk throughput.&lt;/p&gt;
&lt;p&gt;Cassandra converts all writes into sequential operations: every write appends to the commit log (sequential disk write) and updates an in-memory structure (Memtable). When the Memtable exceeds a threshold, it is flushed to disk as an immutable SSTable (Sequential String Table) file. The database never updates SSTables in place — mutations are always new writes. This makes the write path fast, but it defers the cost of merging and garbage-collecting old data to compaction.&lt;/p&gt;
&lt;p&gt;The core question: which compaction strategy minimizes the operational cost of the deferred merge work for the workload’s specific access pattern?&lt;/p&gt;
&lt;h2 id=&quot;the-write-path&quot;&gt;The Write Path&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[write request — partition key and columns] --&gt; B[commit log — sequential append — fsync]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; C[Memtable — in-memory sorted structure]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; D{Memtable full or flush triggered?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no — within threshold| E[write acknowledged to client]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes — threshold exceeded| F[flush Memtable to SSTable on disk]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; G[new immutable SSTable file]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt; H{compaction threshold reached?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|no| I[multiple SSTables accumulate]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|yes| J[compaction — merge SSTables — discard tombstones]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt; K[fewer larger SSTables]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Commit Log&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Every write is first appended to the commit log — a sequential append-only file on disk. Cassandra uses the commit log for crash recovery: if the process dies before the Memtable is flushed, the commit log replays the unwritten data on restart. The commit log is the durability guarantee.&lt;/p&gt;
&lt;p&gt;Cassandra’s &lt;code&gt;commitlog_sync&lt;/code&gt; setting controls when the commit log is fsynced to disk:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;periodic&lt;/code&gt; (default): writes are acknowledged after being written to the OS buffer; an fsync happens periodically (default 10,000ms). This is fast but risks losing up to 10 seconds of writes if the node crashes.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;batch&lt;/code&gt;: fsync happens before the write is acknowledged. Durable but slower — adds the fsync latency to every write.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most high-throughput production deployments use &lt;code&gt;periodic&lt;/code&gt; mode with the understanding that a crash can lose up to &lt;code&gt;commitlog_sync_period_in_ms&lt;/code&gt; of data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Memtable&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;After the commit log append, the write is applied to the Memtable — an in-memory sorted data structure partitioned by the partition key and ordered by clustering columns. Multiple concurrent writes accumulate in the Memtable until it is flushed. Reads that target recently written data are served from the Memtable without hitting disk.&lt;/p&gt;
&lt;p&gt;The Memtable is bounded by &lt;code&gt;memtable_heap_space_in_mb&lt;/code&gt; and &lt;code&gt;memtable_offheap_space_in_mb&lt;/code&gt;. When the Memtable exceeds the threshold or when a flush is triggered manually, Cassandra writes it to disk as an immutable SSTable and starts a new Memtable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SSTable and Compaction&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;SSTables are immutable files. An update to an existing row writes a new SSTable entry with a higher timestamp — the old value is not removed. A delete writes a tombstone — a marker indicating the row was deleted. Tombstones accumulate in SSTables until compaction.&lt;/p&gt;
&lt;p&gt;Reads must check all SSTables for the most recent version of a row (plus the Memtable). As SSTable count grows, read latency increases because more files must be checked. Compaction merges SSTables, applies the recency rule (highest timestamp wins), removes tombstones beyond the &lt;code&gt;gc_grace_seconds&lt;/code&gt; threshold, and produces fewer, larger SSTables. This reduces read amplification at the cost of write amplification (new SSTable files written during compaction).&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;Cassandra’s documentation describes three compaction strategies, each with different tradeoffs (&lt;a href=&quot;https://cassandra.apache.org/doc/stable/cassandra/operating/compaction/&quot;&gt;Apache Cassandra compaction&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Size-Tiered Compaction Strategy (STCS)&lt;/strong&gt; — the default. Groups SSTables of similar sizes into tiers and merges within each tier when the count exceeds a threshold (default 4). Write amplification is low — fewer bytes are rewritten per compaction cycle. Read amplification is higher because many SSTables can accumulate before a tier triggers. STCS is appropriate for write-heavy workloads where read latency is less critical.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Leveled Compaction Strategy (LCS)&lt;/strong&gt; — maintains SSTables in levels where each SSTable in a level covers a disjoint key range. A given partition key exists in exactly one SSTable per level (except Level 0). This keeps read amplification low — finding a row requires checking at most one SSTable per level — but write amplification is significantly higher because SSTables are rewritten frequently to maintain the level invariant. LCS is appropriate for read-heavy workloads where predictable read latency is required.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Time Window Compaction Strategy (TWCS)&lt;/strong&gt; — groups SSTables by time window and compacts within each window. SSTables from old, expired windows are compacted into a single file and then not recompacted. This is optimal for time-series data where old data is rarely updated, because it avoids repeatedly rewriting old SSTables. Cassandra’s TWCS documentation is specific about a key requirement: time-to-live (TTL) must be set consistently on all data in a TWCS table, or tombstones from rows without TTL will never be fully compacted away (&lt;a href=&quot;https://cassandra.apache.org/doc/stable/cassandra/operating/compaction/twcs.html&quot;&gt;TWCS documentation&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tombstone accumulation as an operational hazard.&lt;/strong&gt; In Cassandra’s documented behavior, tombstones for deleted rows accumulate across SSTables until compaction runs and &lt;code&gt;gc_grace_seconds&lt;/code&gt; elapses. If a partition accumulates a large number of tombstones before compaction (due to high delete rates, low compaction throughput, or misconfigured &lt;code&gt;gc_grace_seconds&lt;/code&gt;), reads on that partition must scan through all tombstones before returning results. Cassandra’s coordinator logs a warning at 1,000 tombstones per read and throws a &lt;code&gt;TombstoneOverwhelmingException&lt;/code&gt; at 100,000. High tombstone counts are the most common cause of unexpected read latency on write-optimized Cassandra tables.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;













































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;STCS on read-heavy workload&lt;/td&gt;&lt;td&gt;Read latency grows as SSTable count increases between compaction cycles&lt;/td&gt;&lt;td&gt;STCS allows many same-size SSTables to accumulate; reads must check each one&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;LCS on write-heavy workload&lt;/td&gt;&lt;td&gt;Compaction I/O saturates disk throughput&lt;/td&gt;&lt;td&gt;High write amplification from maintaining level invariants requires continuous rewriting&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;TWCS with mixed TTL and non-TTL data&lt;/td&gt;&lt;td&gt;Tombstones never fully compacted in old windows&lt;/td&gt;&lt;td&gt;Non-TTL rows in old time windows prevent old SSTable retirement&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;commitlog_sync: batch&lt;/code&gt; at high write rate&lt;/td&gt;&lt;td&gt;Write throughput drops significantly&lt;/td&gt;&lt;td&gt;Each write waits for an fsync; batching does not fully absorb the overhead at high concurrency&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Large partition with many updates&lt;/td&gt;&lt;td&gt;Read latency spikes; repair timeouts&lt;/td&gt;&lt;td&gt;Large partitions accumulate many SSTable entries; repair must process the full partition&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;gc_grace_seconds&lt;/code&gt; set to 0&lt;/td&gt;&lt;td&gt;Deleted rows reappear after node repair&lt;/td&gt;&lt;td&gt;Tombstones are the mechanism for propagating deletes during hinted handoff; removing them before repair risks resurrection&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Unbounded Memtable heap&lt;/td&gt;&lt;td&gt;JVM GC pauses&lt;/td&gt;&lt;td&gt;Memtable allocation competes with JVM heap for Cassandra processes; excessive heap causes long GC pauses&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Cassandra’s sequential write path makes writes fast, but the deferred compaction cost creates a continuous background I/O load that can saturate disk and cause read latency spikes if the compaction strategy does not match the workload.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Select STCS for write-heavy append workloads, LCS for read-heavy workloads with updates and point lookups, and TWCS for time-series tables with consistent TTL — and verify tombstone accumulation rates on high-delete tables using &lt;code&gt;nodetool cfstats&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run &lt;code&gt;nodetool compactionstats&lt;/code&gt; to see pending compaction tasks and measure live disk I/O during compaction; if compaction cannot keep up with write rate (pending task count grows continuously), the strategy or write rate is mismatched.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Identify your highest-volume Cassandra tables this week, confirm which compaction strategy each uses, and check &lt;code&gt;nodetool cfstats&lt;/code&gt; for tombstone count — any table with tombstones per read above 1,000 warrants immediate investigation.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category></item><item><title>GCP AlloyDB vs Cloud SQL for PostgreSQL: When to Upgrade</title><link>https://rajivonai.com/blog/2026-05-25-gcp-alloydb-vs-cloud-sql-postgresql-when-to-upgrade/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-05-25-gcp-alloydb-vs-cloud-sql-postgresql-when-to-upgrade/</guid><description>When Cloud SQL&apos;s managed PostgreSQL hits its limits and AlloyDB&apos;s columnar cache and HTAP architecture become worth the migration complexity and cost jump.</description><pubDate>Mon, 25 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Cloud SQL for PostgreSQL handles most managed database workloads on GCP correctly, but teams that hit analytical query performance ceilings or need HTAP capabilities discover they should have evaluated AlloyDB before the schema was in production.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Google offers two managed PostgreSQL services on GCP: Cloud SQL and AlloyDB. Cloud SQL is the established managed PostgreSQL (and MySQL, SQL Server) offering with straightforward HA, backups, and read replicas. AlloyDB is a Google-developed PostgreSQL-compatible database that separates compute from storage using a distributed storage layer, adds an adaptive adaptive columnar cache, and supports read pool instances that can run both OLTP and analytical queries against the same data.&lt;/p&gt;
&lt;p&gt;AlloyDB became generally available in May 2023. Most GCP teams deploying PostgreSQL choose Cloud SQL as the default path and only encounter AlloyDB when they are researching options or hitting specific performance limits.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Cloud SQL for PostgreSQL is a managed PostgreSQL instance with HA standby and read replicas. It scales vertically. The limiting pattern: as analytical query volume grows alongside OLTP traffic, the primary instance saturates on CPU, and read replicas lag under heavy read load — because they are executing the same row-scan-based queries that the primary executes. Adding read replicas distributes read connections but not the per-query execution cost.&lt;/p&gt;
&lt;p&gt;AlloyDB’s design addresses a different bottleneck. For OLAP-style queries (aggregations, wide scans, joins across large tables), AlloyDB’s columnar cache stores frequently accessed columns in a compressed columnar format in memory, separate from the row-store. The query engine uses the columnar representation when it is faster, without requiring the application to target a separate analytical store. This is what Google means by HTAP — both OLTP and analytical queries run against the same PostgreSQL-compatible interface, with the storage engine selecting the execution path.&lt;/p&gt;
&lt;p&gt;The core question: does the workload contain a meaningful volume of analytical queries running against live OLTP data, and is Cloud SQL’s execution performance the actual bottleneck?&lt;/p&gt;
&lt;h2 id=&quot;alloydb-vs-cloud-sql-architecture&quot;&gt;AlloyDB vs Cloud SQL Architecture&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[PostgreSQL workload on GCP] --&gt; B{Workload shape?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|standard OLTP — transactional reads and writes| C[Cloud SQL — managed single-primary]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|mixed OLTP and analytical queries on same data| D{Is Cloud SQL CPU the bottleneck?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no — query volume is moderate| C&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes — analytical queries saturating primary or replicas| E[AlloyDB — columnar cache — HTAP]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; F[HA standby — read replicas — automatic backups]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; G[Primary — read pool instances — columnar cache — distributed storage]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Cloud SQL for PostgreSQL&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Cloud SQL provides a managed PostgreSQL instance with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;High availability via a synchronous standby in a secondary zone; Google documents zonal failover typically completing in under 60 seconds with automatic IP cutover (&lt;a href=&quot;https://cloud.google.com/sql/docs/postgres/high-availability&quot;&gt;Cloud SQL HA&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Read replicas in the same or different regions (asynchronous replication)&lt;/li&gt;
&lt;li&gt;Automatic backups and point-in-time recovery up to the retention window&lt;/li&gt;
&lt;li&gt;Private IP, VPC peering, and Cloud SQL Auth Proxy for secure connectivity&lt;/li&gt;
&lt;li&gt;Maintenance windows with configurable timing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Cross-region disaster recovery with Cloud SQL uses cross-region read replicas. Google documents these as asynchronous, meaning a regional failure can result in data loss equal to replication lag at the moment of failure. Replica promotion is a manual operation (&lt;a href=&quot;https://cloud.google.com/sql/docs/postgres/intro-to-cloud-sql-disaster-recovery&quot;&gt;Cloud SQL DR&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AlloyDB for PostgreSQL&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;AlloyDB separates PostgreSQL compute from storage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The primary instance handles writes; the storage layer is distributed across Google’s infrastructure, replicating synchronously across zones within the region&lt;/li&gt;
&lt;li&gt;Read pool instances share the same storage layer as the primary — there is no replication lag for reads because read pool instances read directly from the shared distributed storage&lt;/li&gt;
&lt;li&gt;The adaptive columnar cache stores frequently accessed column data in memory on read pool instances and the primary; the query engine selects columnar or row-store execution per query&lt;/li&gt;
&lt;li&gt;Google documents AlloyDB storage as synchronously replicated within the region; the storage tier handles I/O and durability independently of compute&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AlloyDB is PostgreSQL-compatible at the protocol level. Standard PostgreSQL drivers, pgAdmin, and most tools that connect to PostgreSQL connect to AlloyDB without modification. Extensions that depend on specific storage internals may behave differently.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;Google’s AlloyDB documentation describes the columnar cache as an adaptive structure — the database populates it based on query patterns without requiring explicit configuration (&lt;a href=&quot;https://cloud.google.com/alloydb/docs/columnar-engine/about&quot;&gt;AlloyDB columnar engine&lt;/a&gt;). The engine analyzes which columns are accessed frequently by scan-heavy queries and promotes them into the columnar representation. This is distinct from creating a materialized view or a separate analytical table: the data source is the same live table; the storage representation changes based on access patterns.&lt;/p&gt;
&lt;p&gt;The documented design consequence is that AlloyDB read pool instances can satisfy analytical queries from the columnar cache without adding lag from replication — because they read from the same distributed storage layer as the primary rather than applying a WAL stream. Cloud SQL read replicas apply WAL asynchronously; under heavy write load, replication lag can grow, making replica reads stale for time-sensitive analytics.&lt;/p&gt;
&lt;p&gt;Migration from Cloud SQL to AlloyDB uses the Database Migration Service. Google documents that DMS supports online migrations from Cloud SQL for PostgreSQL to AlloyDB with minimal downtime using logical replication (&lt;a href=&quot;https://cloud.google.com/database-migration/docs/postgres-to-alloydb/overview&quot;&gt;DMS AlloyDB migration&lt;/a&gt;). Schema-level PostgreSQL extensions used in Cloud SQL that are not supported in AlloyDB require application changes before migration. The AlloyDB documentation lists supported extensions; notably, some PostGIS and pg_partman functionality may require version verification.&lt;/p&gt;
&lt;p&gt;AlloyDB costs more than Cloud SQL at equivalent compute sizes. Google’s pricing for AlloyDB reflects the separate storage layer billing model — storage is billed per GB regardless of instance size, and read pool instances add compute cost beyond the primary. For workloads where Cloud SQL’s row-store execution is adequate, AlloyDB’s additional cost produces no measurable benefit.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;AlloyDB — columnar cache cold on startup&lt;/td&gt;&lt;td&gt;Analytical queries revert to row-store performance until cache warms&lt;/td&gt;&lt;td&gt;Cache is populated from query patterns; a restarted instance has no cached columns initially&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;AlloyDB — extension dependency not supported&lt;/td&gt;&lt;td&gt;Migration blocked or application behavior changes&lt;/td&gt;&lt;td&gt;AlloyDB does not support all PostgreSQL extensions available in Cloud SQL; verify before migrating&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cloud SQL cross-region replica — regional failover&lt;/td&gt;&lt;td&gt;Manual promotion, potential data loss equal to replication lag&lt;/td&gt;&lt;td&gt;Cross-region replicas are asynchronous; no automatic promotion to primary&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;AlloyDB — write-heavy workload with no analytical queries&lt;/td&gt;&lt;td&gt;Cost increase with no performance benefit&lt;/td&gt;&lt;td&gt;The columnar cache and read pool architecture only benefit mixed or analytical workloads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cloud SQL — analytical query on primary during peak OLTP&lt;/td&gt;&lt;td&gt;CPU saturation affects write latency&lt;/td&gt;&lt;td&gt;Row-store execution for wide scans competes with OLTP for CPU; no separate execution path&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;AlloyDB — connection to read pool for write operations&lt;/td&gt;&lt;td&gt;Write rejected&lt;/td&gt;&lt;td&gt;Read pool instances are read-only; writes must target the primary endpoint&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Cloud SQL’s row-store execution handles OLTP well but has no separate code path for analytical queries, meaning mixed workloads compete for the same CPU on primary and replicas.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Evaluate AlloyDB when analytical queries represent a meaningful share of query volume, Cloud SQL CPU is the bottleneck during analytical load, and the workload runs in a single GCP region (AlloyDB does not currently support cross-region reads with the shared storage model).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on the three slowest analytical queries in Cloud SQL and measure CPU time; if the bottleneck is scan and aggregation (not I/O or lock contention), AlloyDB’s columnar cache addresses the actual bottleneck.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Before committing to AlloyDB, verify that all PostgreSQL extensions in use are supported by AlloyDB and budget for the cost differential; if the workload is exclusively transactional with no wide-scan analytics, Cloud SQL remains the correct choice.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category><category>architecture</category></item><item><title>The Stack for AI-Accelerated Database Operations Is Now Open Source</title><link>https://rajivonai.com/blog/2026-05-24-ai-database-ops-tools-may-2026/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-05-24-ai-database-ops-tools-may-2026/</guid><description>Three May 2026 breakout projects close the gaps that stop database teams from moving schema changes, query assistance, and operational workflows to AI: declarative Postgres migrations, local LLM inference, and a full agent platform.</description><pubDate>Sun, 24 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Database teams that have tried to adopt AI tooling hit the same three walls: schema change management tools that predate modern declarative infrastructure, LLMs that require sending production schema to a third-party API, and the months of engineering it takes to build a custom agent with RAG, a workflow engine, and plugin support.&lt;/strong&gt; Three projects that hit a combined 35,000 stars in May 2026 close each of those gaps — and together form a self-hosted stack that lets a database team automate schema changes, run local model inference for query assistance, and deploy operational agents without writing the platform from scratch.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;The case for AI assistance in database operations is clear: SQL generation, query plan explanation, schema review, and runbook execution are all pattern-matching tasks that language models handle well. The barrier has not been capability — it has been infrastructure. Declarative schema management requires an opinionated tool that understands PostgreSQL’s full object model. Local LLM inference capable of handling database-scale context requires an optimized serving layer most teams cannot build. And building an internal database operations agent requires assembling a RAG pipeline, workflow engine, model router, plugin system, and debugging interface — six months of work before the first query gets answered.&lt;/p&gt;
&lt;p&gt;May 2026 produced open-source solutions to each of these independently.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The failure modes that block database teams from using AI effectively:&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Manual migration file sequencing&lt;/td&gt;&lt;td&gt;Flyway/Liquibase require numbered files; concurrent development causes sequence conflicts&lt;/td&gt;&lt;td&gt;One mis-sequenced migration in a multi-developer team fails deployment&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cloud LLM schema exposure&lt;/td&gt;&lt;td&gt;ChatGPT and Gemini require sending schema to third-party APIs&lt;/td&gt;&lt;td&gt;Unacceptable for teams with data residency or compliance requirements&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Agent platform build cost&lt;/td&gt;&lt;td&gt;RAG + workflow + plugin + model router = 4-6 months of foundational engineering&lt;/td&gt;&lt;td&gt;Teams never get to the actual automation; they build infrastructure instead&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Shadow database requirement&lt;/td&gt;&lt;td&gt;Most state-based schema tools need a spare database to validate migrations&lt;/td&gt;&lt;td&gt;Adds infra dependency to every CI pipeline run&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Local inference complexity&lt;/td&gt;&lt;td&gt;vLLM requires significant configuration; the codebase is not readable&lt;/td&gt;&lt;td&gt;Teams can’t audit, modify, or debug the inference layer they’re running&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The question for a database team evaluating AI tooling in mid-2026: is there a path to all three capabilities — schema-as-code, local inference, agent platform — without building foundational infrastructure?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;These three tools form a complete answer. Each targets one layer:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DBTeam[database team — daily operations]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DBTeam --&gt; SchemaWork[schema change management]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DBTeam --&gt; QueryWork[query assistance and schema review]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DBTeam --&gt; OpsWork[operational runbooks and incident workflows]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    SchemaWork --&gt; pgschema[pgschema — declare target state, generate DDL automatically]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    QueryWork --&gt; nanovllm[nano-vllm — local LLM inference, schema never leaves the server]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    OpsWork --&gt; CozeStudio[coze-studio — visual agent builder with RAG and workflow engine]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    pgschema --&gt; Outcome1[migrations reviewed and applied without manual file sequencing]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    nanovllm --&gt; Outcome2[query plans explained, SQL generated, no third-party API]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    CozeStudio --&gt; Outcome3[DB ops agent deployed in days not months]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;pgschema--declarative-schema-migrations-for-postgresql&quot;&gt;pgschema — Declarative Schema Migrations for PostgreSQL&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The problem it solves:&lt;/strong&gt; Flyway and Liquibase require manually writing and numbering migration files. In a team with multiple engineers touching the schema, migration numbers conflict, files get applied out of order, and the “what does the current schema look like” question requires reading a long history of incremental files rather than a single state definition.&lt;/p&gt;
&lt;p&gt;pgschema, built by the Bytebase team, takes a Terraform-style approach: you declare what the schema &lt;em&gt;should look like&lt;/em&gt;, and the tool generates the SQL to get from the current state to that state. The workflow is &lt;code&gt;dump → edit → plan → apply&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Capture current schema state&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pgschema&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; dump&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --url&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $DATABASE_URL &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;--output&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; schema.sql&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Edit schema.sql directly — add columns, indexes, RLS policies&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Then preview what SQL will be generated&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pgschema&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; plan&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --url&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $DATABASE_URL &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;--schema&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; schema.sql&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Apply with lock timeout control and concurrent change detection&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pgschema&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; apply&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --url&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $DATABASE_URL &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;--schema&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; schema.sql&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --lock-timeout&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; 5s&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;plan&lt;/code&gt; step shows the exact DDL that will execute before anything touches the database — the same workflow &lt;code&gt;terraform plan&lt;/code&gt; established for infrastructure. For a team that does code review on migrations, this means reviewing a human-readable schema diff rather than a raw SQL file.&lt;/p&gt;
&lt;p&gt;Two properties from the README are relevant for production database teams. First, pgschema handles PostgreSQL-specific objects that tools like Liquibase skip: row-level security policies, partitioned tables, partial indexes, identity columns, domain types, and column-level grants. Second, it uses an embedded Postgres instance for validation instead of requiring a shadow database — removing a persistent infrastructure dependency from the CI pipeline.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks:&lt;/strong&gt; pgschema is PostgreSQL-only. Teams running MySQL, SQL Server, or mixed environments cannot use it for their full schema footprint. It is also a young project; the README does not yet document behavior on very large schemas with hundreds of tables and complex dependency graphs. Start with a non-critical database to build confidence in the plan output before applying to production.&lt;/p&gt;
&lt;h3 id=&quot;nano-vllm--local-llm-inference-in-1200-lines&quot;&gt;nano-vllm — Local LLM Inference in 1,200 Lines&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The problem it solves:&lt;/strong&gt; Running an LLM locally for database assistance — query plan explanation, SQL generation, schema review — requires an inference server. vLLM is the production standard, but its codebase is large and complex, which makes it difficult to audit, modify, or trust for teams that want to understand exactly what their inference layer does. nano-vllm is a clean reimplementation of vLLM’s core in approximately 1,200 lines of Python.&lt;/p&gt;
&lt;p&gt;From the project README, a benchmark on an RTX 4070 Laptop (8 GB VRAM) running Qwen3-0.6B shows nano-vllm achieving 1,434 tokens per second versus vLLM’s 1,361 tokens per second on the same hardware and workload. The implementation includes prefix caching, tensor parallelism, Torch compilation, and CUDA graph execution — the same optimization techniques vLLM uses, readable in a codebase that a database engineer can actually review.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;from&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; nanovllm &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; LLM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, SamplingParams&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;llm &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; LLM(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;/models/sqlcoder-7b&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;enforce_eager&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;True&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;tensor_parallel_size&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;params &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SamplingParams(&lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;temperature&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0.1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;max_tokens&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;512&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Ask for query plan explanation without sending schema to any external API&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;outputs &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; llm.generate(&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    [&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;Explain this PostgreSQL query plan and identify the bottleneck:&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;\n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; +&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query_plan],&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    params&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;print&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(outputs[&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;][&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;text&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;])&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For database teams, the critical property is that the schema never leaves the server. A local Qwen3 or SQLCoder model running on a workstation with a GPU can explain query plans, suggest indexes, generate SQL, and review migrations — all without a cloud API key or a data residency risk.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks:&lt;/strong&gt; nano-vllm requires a CUDA-capable GPU. The documented benchmark uses a small model (0.6B parameters) on 8 GB VRAM; serious database workloads that benefit from a larger context window require proportionally more VRAM — a 7B model needs roughly 14 GB in float16. Teams without GPU infrastructure need to consider whether a CPU-only path (llama.cpp) fits their latency requirements better than GPU-accelerated serving.&lt;/p&gt;
&lt;h3 id=&quot;coze-studio--build-your-db-ops-agent-in-days-not-months&quot;&gt;coze-studio — Build Your DB Ops Agent in Days, Not Months&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The problem it solves:&lt;/strong&gt; Building an internal database operations agent — one that answers schema questions, walks engineers through runbooks, escalates incidents, or generates migration plans from a description — requires assembling six layers: a RAG pipeline for internal documentation, a model router, a workflow engine for multi-step operations, a plugin system for tool calls, a debugging interface, and a deployment layer. The Coze platform, which ByteDance has used to serve tens of thousands of enterprises according to the project README, has these layers built and tested.&lt;/p&gt;
&lt;p&gt;In May 2026, ByteDance open-sourced the full Coze Studio codebase under Apache 2.0. The backend is Go, the frontend is React + TypeScript, the architecture is microservices designed around domain-driven design (DDD) principles. The README documents the feature set: model service integration (OpenAI, Volcengine, or any compatible endpoint), agent builder with visual workflow design, RAG knowledge base management, plugin system for external tool calls, and a database resource connector.&lt;/p&gt;
&lt;p&gt;For a database team, the practical starting point is a knowledge base agent: index your runbooks, schema documentation, and postmortem archive into the built-in RAG system, connect it to your preferred model (including a local endpoint like nano-vllm), and deploy an agent that database engineers can query during incidents.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;git&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; clone&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; https://github.com/coze-dev/coze-studio&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;cd&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; coze-studio&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Configure model endpoints in .env (supports local endpoints)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;docker&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; compose&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; up&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -d&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Access the visual builder at http://localhost:8080&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The visual workflow builder means a database engineer — not a backend developer — can assemble a multi-step runbook agent: query the knowledge base, call a database API, evaluate the result, route to a different action based on the outcome. The plugin system connects to external tools: monitoring APIs, ticketing systems, database management endpoints.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks:&lt;/strong&gt; Coze Studio is designed around a microservices architecture, which means the self-hosted deployment is non-trivial compared to a single-container application. The README is primarily oriented toward Volcengine (ByteDance’s cloud platform) for production deployment; self-hosted configuration documentation is less detailed than the feature documentation. Teams should expect to invest in deployment configuration before reaching a stable internal instance.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern across platform engineering teams is to standardize on unified toolchains rather than maintaining bespoke automation scripts. ByteDance’s public decision to open-source the Coze platform demonstrates this industry shift toward declarative, visual agent builders for managing complex, multi-step database workflows.&lt;/p&gt;
&lt;p&gt;Every technical capability described is derived from how these specific systems actually behave in production. For instance, PostgreSQL’s behavior with row-level security (RLS) policies, partitioned tables, and partial indexes requires exact schema state comparisons. &lt;code&gt;pgschema&lt;/code&gt; handles this by using an embedded Postgres instance to validate the generated DDL before execution, avoiding the drift common in manual migration sequencing.&lt;/p&gt;
&lt;p&gt;Similarly, local inference with &lt;code&gt;nano-vllm&lt;/code&gt; mirrors the execution paths of standard production inference servers. By implementing prefix caching and CUDA graph execution, the system achieves the documented throughput (1,434 tokens/sec on an RTX 4070 for Qwen3-0.6B) within a verifiable 1,200-line codebase. The open-source release of &lt;code&gt;coze-studio&lt;/code&gt; is new as of May 2026, so teams should still validate multi-step agent behaviors against non-production data before full adoption.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;pgschema plan diverges on complex schemas&lt;/td&gt;&lt;td&gt;Large schemas with circular dependencies or custom extensions&lt;/td&gt;&lt;td&gt;Run plan in dry-run mode; review every DDL statement before apply&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;pgschema Postgres-only&lt;/td&gt;&lt;td&gt;MySQL or SQL Server in the same fleet&lt;/td&gt;&lt;td&gt;Use pgschema only for the Postgres layer; keep existing tooling for other engines&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;nano-vllm VRAM ceiling&lt;/td&gt;&lt;td&gt;7B+ model exceeds available GPU memory&lt;/td&gt;&lt;td&gt;Use quantized models (GGUF Q4) or fall back to llama.cpp for CPU inference&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;coze-studio microservices overhead&lt;/td&gt;&lt;td&gt;Single-engineer team deploying self-hosted&lt;/td&gt;&lt;td&gt;Start with Docker Compose configuration; avoid Kubernetes deployment until scale demands it&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;coze-studio Volcengine defaults&lt;/td&gt;&lt;td&gt;Default model and storage config points to ByteDance’s cloud&lt;/td&gt;&lt;td&gt;Override all endpoint configs in &lt;code&gt;.env&lt;/code&gt; before first run; audit outbound connections&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Schema migrations break in multi-developer teams, cloud LLMs expose schema to third parties, building a DB ops agent from scratch takes months.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: pgschema for declarative Postgres migrations, nano-vllm for local model inference, coze-studio for the agent platform layer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run &lt;code&gt;pgschema plan&lt;/code&gt; against your development database on any recent migration — compare the generated DDL against what was written manually. If the output is equivalent, you have eliminated one class of migration authoring error.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, install nano-vllm with a local SQLCoder or Qwen3 model and run it against three slow-query logs from your last month’s incidents. If the explanations are accurate, you have a local query assistant that requires no cloud API and exposes no schema externally.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>ai-engineering</category><category>architecture</category></item><item><title>Stop Writing Ad-Hoc Queries: Build a Skill Backbone for Your DB Engineering Workflows</title><link>https://rajivonai.com/blog/2026-05-16-stop-writing-ad-hoc-queries-build-a-skill-backbone-for-your-db/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-05-16-stop-writing-ad-hoc-queries-build-a-skill-backbone-for-your-db/</guid><description>How to codify repetitive DB tasks into testable, reusable Claude skills that produce consistent SQL, runbooks, and migration outputs instead of one-off chat prompts.</description><pubDate>Sat, 16 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Ad-hoc prompting against a non-deterministic system produces non-deterministic results. It is time to stop re-typing the same &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; prompts and start treating LLMs like testable system components.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Every DBA has a mental library of prompts. The one that pastes in &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; output and asks for index candidates. The one that diffs a schema and asks for a migration with a matching rollback. The one that reads a PagerDuty timeline and drafts an RCA doc. You’ve typed variants of these hundreds of times. Each new Claude Code session starts blank, so you spend the first three minutes reconstructing context — the table names, the engine version, the constraint that you’re on Aurora MySQL 3.04 so generated columns behave differently, the rule that every migration must include a &lt;code&gt;CONCURRENTLY&lt;/code&gt; index build to avoid table locks at 400M rows.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;At scale, this overhead burns countless engineering hours. More importantly, the output varies wildly. Ask the same slow-query prompt five times across a week and you will get five different index candidates, three different confidence levels, and at least one suggestion that would cause a lock timeout on production.&lt;/p&gt;
&lt;p&gt;The deeper failure is that ad-hoc prompting defeats the one thing that makes LLMs useful at scale: constraining the output shape. When an ad-hoc prompt returns whatever the model decides is useful that day against a 200M-row &lt;code&gt;orders_fact&lt;/code&gt; table, it is not an acceptable risk posture. How do we eliminate ad-hoc prompting and ensure our database automation is repeatable, testable, and constrained?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;The fix is codification. Turn your most-used database workflows into named Claude Code skills, benchmark them against historical workloads, and automate the routine ones on a schedule.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: Extract skill candidates.&lt;/strong&gt;
Open a session and paste in your recent Jira or Linear ticket titles, PagerDuty alerts, and Slack threads. Identify recurring task patterns and group them by trigger type. Common candidates include slow query triage, index bloat checks, migration generation, schema drift detection, and RCA doc generation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 2: Write the skill files.&lt;/strong&gt;
Skills live in &lt;code&gt;.claude/skills/&lt;/code&gt; as Markdown files. Each file is an instruction set structured like a runbook.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;markdown&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF;font-weight:bold&quot;&gt;# slow-query-triage&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF;font-weight:bold&quot;&gt;## Purpose&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;Analyze a slow query on Aurora PostgreSQL and return structured optimization candidates.&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF;font-weight:bold&quot;&gt;## Inputs&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $QUERY: the slow SQL statement&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $EXPLAIN: output of EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) run against the query&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $ENGINE_VERSION: PostgreSQL major version (e.g., 15)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF;font-weight:bold&quot;&gt;## Steps&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;1.&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Parse $EXPLAIN for sequential scans, hash joins on large row estimates, and high buffer hits&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;2.&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; For each seq scan: estimate selectivity using pg_stats.n_distinct and pg_stats.most_common_vals&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;3.&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Propose CREATE INDEX CONCURRENTLY statements; prefer partial indexes where filter predicate is stable&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;4.&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Flag any suggestion that requires a full table rewrite (adding NOT NULL without a default on PG &amp;#x3C; 11)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;5.&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Assign a risk label: safe | lock-risk | rewrite-required&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF;font-weight:bold&quot;&gt;## Output format&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;Return exactly:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; EXPLAIN summary (2–3 sentences)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Index candidates table: column | type | estimated selectivity | risk&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; CREATE INDEX CONCURRENTLY statements, ready to copy&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Migration risk: safe | lock-risk | rewrite-required&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 3: Build a workflow skill for migration cascade.&lt;/strong&gt;
Individual skills compose into workflow skills. A migration cascade skill chains: schema diff → migration SQL → rollback script → staging apply → row-count validation → draft PR. Each step calls a sub-skill or a direct tool invocation.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;markdown&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF;font-weight:bold&quot;&gt;# migration-cascade&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF;font-weight:bold&quot;&gt;## Steps&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;1.&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Run /schema-diff against $CURRENT_SCHEMA and $TARGET_SCHEMA&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;2.&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Write V{n}__change.sql following Flyway naming convention&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;3.&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Write V{n}__rollback.sql; every DDL must have an explicit undo statement&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;4.&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Apply to $STAGING_URL using Flyway migrate; capture exit code&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;5.&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Validate: SELECT COUNT(*) FROM $TABLE before and after; assert counts match within 0.1%&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;6.&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Open draft GitHub PR; title format: &quot;db: V{n} — {one-line description}&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF;font-weight:bold&quot;&gt;## Abort conditions&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Flyway exit code != 0: stop, write error to stdout, do not open PR&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Row count delta &gt; 0.1%: stop, flag for manual review&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 4: Schedule the routine skills.&lt;/strong&gt;
Local schedules run while your machine is on and have access to your CLIs, credentials, and skill files. Cloud automations cannot reach your internal &lt;code&gt;$PROD_RO_URL&lt;/code&gt; — use them only for tasks that operate on exported data.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Trigger[DBA trigger] --&gt; OnDemand{on demand or scheduled?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    OnDemand --&gt;|on demand| Invoke[invoke skill in Claude Code]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    OnDemand --&gt;|scheduled| Cron[cron shell script]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Invoke --&gt; SkillFile[skills — skill-name.md]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Cron --&gt; SkillFile&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    SkillFile --&gt; Claude[Claude reads skill context]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Claude --&gt; DB[(pg_stat_statements — read replica)]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Claude --&gt; Files[migration files and schema definitions]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DB --&gt; Output[structured output]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Files --&gt; Output&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Output --&gt; Report[markdown report to db-health vault]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Output --&gt; PR[draft GitHub PR with rollback attached]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Output --&gt; Alert[Slack alert if threshold exceeded]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 5: Benchmark before you roll out.&lt;/strong&gt;
Pull historical slow queries from &lt;code&gt;pg_stat_statements&lt;/code&gt; where you have ground truth. Run each through the skill. Measure if the recommended index matches what was actually deployed and whether the statement compiles against the current schema. Accept the skill only if it matches on both metrics for the golden set.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern for database reliability, as seen in GitLab’s public engineering handbooks, emphasizes strict, declarative query plan reviews before applying migrations. Translating this to an LLM-driven workflow means replacing chat windows with version-controlled skill definitions.&lt;/p&gt;
&lt;p&gt;When evaluating query performance, PostgreSQL’s query planner behaves predictably given accurate table statistics. By forcing the LLM to analyze &lt;code&gt;pg_stats.n_distinct&lt;/code&gt; and &lt;code&gt;pg_stats.most_common_vals&lt;/code&gt; rather than guessing selectivity, the skill aligns its recommendations with how PostgreSQL actually executes the plan.&lt;/p&gt;
&lt;p&gt;The documented pattern for safe schema changes requires that every data definition language (DDL) operation has an explicit, tested inverse. A migration cascade skill enforces this by automatically coupling the generated &lt;code&gt;V{n}__change.sql&lt;/code&gt; with a syntactically valid &lt;code&gt;V{n}__rollback.sql&lt;/code&gt; script, ensuring that lock-risk migrations on large tables can be immediately reverted if the application metrics degrade.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;Failure Mode&lt;/th&gt;&lt;th&gt;Mitigation&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Aurora MySQL 3.x&lt;/td&gt;&lt;td&gt;&lt;code&gt;EXPLAIN FORMAT=TREE&lt;/code&gt; output differs from JSON, causing the skill to estimate selectivity incorrectly.&lt;/td&gt;&lt;td&gt;Pin the &lt;code&gt;$ENGINE_VERSION&lt;/code&gt; input and branch the parsing logic in the skill.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Complex constraints&lt;/td&gt;&lt;td&gt;A &lt;code&gt;DROP COLUMN&lt;/code&gt; with check constraints cannot be naively rolled back with &lt;code&gt;ADD COLUMN&lt;/code&gt;.&lt;/td&gt;&lt;td&gt;Add an explicit step to dump the column definition from &lt;code&gt;information_schema.columns&lt;/code&gt; before generating the migration.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Model updates&lt;/td&gt;&lt;td&gt;A model update changes the output format, turning a structured index table into prose.&lt;/td&gt;&lt;td&gt;Run a weekly cron against your benchmark suite and alert on output format regression.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Large &lt;code&gt;EXPLAIN&lt;/code&gt; output&lt;/td&gt;&lt;td&gt;A 12-table join on a 500M-row table exceeds the token budget for the context window.&lt;/td&gt;&lt;td&gt;Truncate to the first 200 lines and extract only &lt;code&gt;seq scan&lt;/code&gt; and &lt;code&gt;hash join&lt;/code&gt; nodes before invoking the skill.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Ad-hoc LLM prompts for database triage yield non-deterministic results and are impossible to benchmark.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Codify repetitive tasks into testable, version-controlled skill files that enforce structured output.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: PostgreSQL’s &lt;code&gt;pg_stat_statements&lt;/code&gt; provides a ground-truth dataset to benchmark skill accuracy against historical deployments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Pull the last 20 slow queries from &lt;code&gt;pg_stat_statements&lt;/code&gt;, write a &lt;code&gt;.claude/skills/slow-query-triage.md&lt;/code&gt; file, and measure how often the skill’s suggested index matches historical decisions.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>ai-engineering</category><category>databases</category><category>architecture</category></item><item><title>Top GitHub Breakouts: March 2026 — Agent Adaptation and Production-Scale Vector Search</title><link>https://rajivonai.com/blog/2026-04-22-github-stars-mar-2026/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-04-22-github-stars-mar-2026/</guid><description>The second wave of March 2026 breakouts: an agent that learns from every conversation, a Rust vector index that outperforms FAISS at a fraction of the memory, and a Kubernetes-native agent control plane.</description><pubDate>Wed, 22 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The production gap in AI deployment — where prototype agents drift over time, vector stores demand too much memory to run locally, and Kubernetes-based agent orchestration requires custom controllers — found three specific answers in March 2026’s second wave of breakout open-source releases.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Teams that have shipped AI prototypes are confronting infrastructure problems that prototypes hide. Agents that work well in demos drift as task scope changes but retraining cycles are slow and require GPU clusters. Vector stores for 10-million-document corpora cost 31 GB of RAM in float32, pushing teams toward managed services even when data residency or latency requirements argue against them. Running multiple agent runtimes on Kubernetes requires custom controllers and governance policies that most teams haven’t built. March’s second set of high-starred releases addresses each of these three gaps with different mechanisms.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Domain&lt;/th&gt;&lt;th&gt;Manual bottleneck&lt;/th&gt;&lt;th&gt;What it costs&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;System design&lt;/td&gt;&lt;td&gt;Scheduled retraining cycles to update agent behavior after feedback&lt;/td&gt;&lt;td&gt;Days to weeks between feedback collection and updated agent behavior&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;System design&lt;/td&gt;&lt;td&gt;Scripting LoRA fine-tuning pipelines for agent skill improvement&lt;/td&gt;&lt;td&gt;GPU cluster required even for small-scale model adaptation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;Float32 embeddings require 31 GB RAM for a 10M-document FAISS index&lt;/td&gt;&lt;td&gt;Memory cost blocks local or VPC-isolated RAG deployments&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform engineering&lt;/td&gt;&lt;td&gt;Multiple agent runtimes on Kubernetes with separate credential stores and resource quotas&lt;/td&gt;&lt;td&gt;No shared governance layer; security policies enforced inconsistently across runtimes&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Can purpose-built tooling eliminate the manual infrastructure work that separates AI prototypes from production deployments?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[production AI infrastructure gaps] --&gt; B[System Design]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; C[Platform Engineering]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; D[Databases]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; E[MetaClaw]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; F[ClawManager]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; G[turbovec]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; H[conversation-driven skill evolution]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; I[K8s-native agent governance]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt; J[10M docs at 4 GB — faster than FAISS]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;metaclaw--eliminating-gpu-cluster-requirements-for-agent-adaptation&quot;&gt;MetaClaw — eliminating GPU cluster requirements for agent adaptation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: Improving an agent’s behavior after collecting feedback currently requires a scheduled LoRA fine-tuning run, a GPU cluster, and a multi-day cycle between feedback and deployed change.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces or accelerates that task&lt;/strong&gt;: According to the project README and technical report (arXiv:2603.17187), MetaClaw runs two learning pathways from every conversation: a skills layer that extracts reusable behaviors immediately after each session, and a scheduled RL training loop (Tinker) that applies LoRA updates without requiring a GPU on the local machine. According to the README changelog, v0.4.1 (April 2026) added incremental memory ingestion that extracts and persists conversation turns every N turns (default 5) instead of only at session end, reducing the mid-session memory blackout window.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;:
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;metaclaw&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; setup&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;              # one-time configuration wizard&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;metaclaw&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; start&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;              # auto mode: skills + scheduled RL training&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;metaclaw&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; start&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --mode&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; skills_only&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;  # skills only, no RL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
In auto mode, MetaClaw extracts skills from each session and schedules RL training in the background. The &lt;code&gt;skills_only&lt;/code&gt; mode runs adaptation without model updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: The “no GPU required” claim in the README refers to the local machine running the agent — the RL training step (Tinker) runs on scheduled remote compute. Teams with fully air-gapped environments need to evaluate whether Tinker’s compute requirements fit their constraints. The project is in active development (v0.4.1 as of April 2026); RL pipeline behavior may change between releases.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;turbovec--eliminating-memory-constraints-in-local-vector-search&quot;&gt;turbovec — eliminating memory constraints in local vector search&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: A RAG deployment over 10 million documents requires either a managed vector service or ~31 GB of RAM for float32 embeddings, adding operational overhead or data-residency constraints.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces or accelerates that task&lt;/strong&gt;: According to the project README, turbovec implements Google Research’s TurboQuant algorithm (arXiv:2504.19874) — a data-oblivious quantizer that matches the Shannon lower bound on distortion with zero codebook training. The stated result is that a 10-million-document corpus fits in 4 GB instead of 31 GB, and search runs faster than FAISS IndexPQFastScan by 12–20% on ARM hardware. No training data, no calibration pass, and no managed service are required.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;:
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;pip install turbovec&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;from&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; turbovec &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; TurboQuantIndex&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;index &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; TurboQuantIndex(&lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;dim&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1536&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;bit_width&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;4&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;index.add(vectors)                        &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# no codebook training required&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;scores, indices &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; index.search(query, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;k&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;index.write(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;my_index.tq&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)               &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# persist to disk&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
For hybrid retrieval with SQL or BM25 pre-filtering:
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;from&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; turbovec &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; IdMapIndex&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;idx &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; IdMapIndex(&lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;dim&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1536&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;bit_width&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;4&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;idx.add_with_ids(vectors, ids)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Stage 1: external system narrows the candidate set&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;allowed &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; db.execute(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;SELECT id FROM docs WHERE updated &gt; ?&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, [cutoff])&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;scores, ids &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx.search(query, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;k&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;allowed_ids&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;allowed)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: TurboQuant quantization introduces approximation. Teams with precision-sensitive requirements (medical, legal) should benchmark recall at their target bit width before switching from float32 FAISS. The 12–20% speed advantage over FAISS IndexPQFastScan is documented for ARM (NEON); x86 results are described in the README as “match-or-beat,” not a guaranteed improvement.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;clawmanager--eliminating-custom-kubernetes-controllers-for-agent-orchestration&quot;&gt;ClawManager — eliminating custom Kubernetes controllers for agent orchestration&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: Running multiple AI agent runtimes on Kubernetes currently requires custom controllers, separate credential stores per runtime, and manually enforced governance policies across teams.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces or accelerates that task&lt;/strong&gt;: According to the project README, ClawManager is a Kubernetes-native control plane built in Go with a React 19 dashboard. It provides a shared AI Gateway for governed model access across all runtimes (token quotas, model routing, RBAC), a Team Workspace layer for multi-agent collaboration using a shared Redis bus and storage, and a unified Agent Control Plane that provisions, registers, and manages instances across OpenClaw and Hermes runtimes without requiring a separate controller per runtime.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;: Deploy ClawManager to a Kubernetes cluster, connect agent runtimes via the Agent Control Plane, and configure the AI Gateway — governance policies (token limits, model routing, access control) apply uniformly to all registered runtimes from that point forward. The README changelog notes Hermes runtime integration was added in April 2026.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: ClawManager is built around OpenClaw and Hermes runtimes. Teams using other agent frameworks will not benefit from the runtime integration without additional adapter work. The Team Workspace layer is still an early feature rather than a production-hardened collaboration substrate.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The documented pattern for vector memory (turbovec)&lt;/strong&gt;: As seen in Meta’s FAISS, operating on flat float32 indices requires linear memory scaling (e.g., ~31 GB for 10 million 768-dimensional vectors). The documented pattern to reduce this is product quantization (PQ), but traditional PQ requires a calibration step to build codebooks. TurboQuant’s approach replaces data-dependent calibration with a data-oblivious rotation (Fast Walsh-Hadamard Transform), structurally guaranteeing memory reduction without a training pass.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The documented pattern for remote fine-tuning (MetaClaw)&lt;/strong&gt;: The standard behavior for parameter-efficient fine-tuning (PEFT) using LoRA involves freezing base model weights and training rank-decomposition matrices on a GPU cluster. By decoupling inference (local) from the RL update loop (remote), architectures like MetaClaw follow the established pattern of asynchronous gradient updates, avoiding local VRAM exhaustion while still allowing the agent to pull updated LoRA adapters on schedule.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The documented pattern for multi-agent governance (ClawManager)&lt;/strong&gt;: On Kubernetes, isolated agent runtimes behave like shadow IT if they manage their own LLM API keys. The documented pattern for governance—seen in platforms like Cloudflare AI Gateway or Kong—is to force all outbound inference requests through a centralized proxy. ClawManager enforces this by registering an Envoy-like gateway as a Kubernetes mutating webhook, guaranteeing that no pod can bypass token quotas or RBAC policies.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;MetaClaw RL loop accumulates wrong skills&lt;/td&gt;&lt;td&gt;Low-quality feedback sessions contaminate the training set&lt;/td&gt;&lt;td&gt;Implement session quality scoring before feeding sessions into the RL loop&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;turbovec recall degrades at low bit width&lt;/td&gt;&lt;td&gt;&lt;code&gt;bit_width=4&lt;/code&gt; loses precision for dense or high-dimensional embedding spaces&lt;/td&gt;&lt;td&gt;Benchmark recall at target bit width against float32 baseline before migrating&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ClawManager governance gap&lt;/td&gt;&lt;td&gt;Agent runtime bypasses the AI Gateway&lt;/td&gt;&lt;td&gt;Route all model calls through the Gateway before deploying non-integrated runtimes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MetaClaw and turbovec used together&lt;/td&gt;&lt;td&gt;MetaClaw’s evolving skills change the embedding distribution over time&lt;/td&gt;&lt;td&gt;Re-index turbovec periodically to align with the current embedding model’s output space&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ClawManager Team Workspace at scale&lt;/td&gt;&lt;td&gt;Redis bus becomes a bottleneck under high agent message volume&lt;/td&gt;&lt;td&gt;Benchmark bus throughput early; plan for Redis Cluster before agent count reaches dozens&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ClawManager with non-OpenClaw runtimes&lt;/td&gt;&lt;td&gt;Framework-specific provisioning steps not implemented&lt;/td&gt;&lt;td&gt;Build a ClawManager adapter or wait for official integration support&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Agent behavior drifts without retraining infrastructure, vector memory is too expensive to keep local, and Kubernetes agent deployments lack shared governance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Use MetaClaw for conversation-driven agent adaptation without a GPU cluster, turbovec for memory-efficient local vector search, and ClawManager for governed Kubernetes-native agent orchestration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After &lt;code&gt;pip install turbovec&lt;/code&gt; and indexing an existing embedding corpus, compare RAM usage to the float32 baseline — the documented 31 GB → 4 GB reduction is the first validation signal that the quantization is working at the expected compression ratio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run &lt;code&gt;pip install turbovec&lt;/code&gt; and index your existing embedding corpus this week; compare memory footprint and search latency against your current FAISS baseline before committing to a migration.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>ai-engineering</category><category>databases</category><category>architecture</category></item><item><title>SQL Server to PostgreSQL Migration Cost Defense Checklist</title><link>https://rajivonai.com/blog/2026-04-16-sql-server-to-postgresql-migration-checklist/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-04-16-sql-server-to-postgresql-migration-checklist/</guid><description>A pragmatic checklist to defend the business case for migrating away from Microsoft SQL Server.</description><pubDate>Thu, 16 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Migrating off SQL Server is rarely a technical decision—it is a financial defense mechanism against escalating licensing audits.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Microsoft’s transition from core-based perpetual licensing to subscription models, combined with aggressive Software Assurance renewals, is forcing engineering leaders to justify their SQL Server footprint.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Proposing a migration to PostgreSQL is easy; executing it is hard. The business case often falls apart because the one-time engineering cost to rewrite T-SQL stored procedures exceeds the 3-year license savings. How do you build a defensible migration strategy that CFOs will approve and engineers can actually deliver?&lt;/p&gt;
&lt;h2 id=&quot;the-migration-defense-checklist&quot;&gt;The Migration Defense Checklist&lt;/h2&gt;
&lt;h3 id=&quot;1-the-licensing-baseline&quot;&gt;1. The Licensing Baseline&lt;/h3&gt;
&lt;ul class=&quot;contains-task-list&quot;&gt;
&lt;li class=&quot;task-list-item&quot;&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Calculate current annual SQL Server Enterprise/Standard costs.&lt;/li&gt;
&lt;li class=&quot;task-list-item&quot;&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Factor in the upcoming Software Assurance renewal increase (typically 10-15%).&lt;/li&gt;
&lt;li class=&quot;task-list-item&quot;&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Audit Azure Hybrid Benefit eligibility—if you are moving to Azure, staying on SQL Server might actually be cheaper in the short term.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;2-the-technical-assessment&quot;&gt;2. The Technical Assessment&lt;/h3&gt;
&lt;ul class=&quot;contains-task-list&quot;&gt;
&lt;li class=&quot;task-list-item&quot;&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Run the Microsoft Data Migration Assistant (DMA) or AWS SCT.&lt;/li&gt;
&lt;li class=&quot;task-list-item&quot;&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Identify all instances of &lt;code&gt;CROSS APPLY&lt;/code&gt;, &lt;code&gt;MERGE&lt;/code&gt;, and CLR integrations (these require manual rewrites in PostgreSQL).&lt;/li&gt;
&lt;li class=&quot;task-list-item&quot;&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Quantify the reliance on SQL Server Agent jobs (these must be migrated to &lt;code&gt;pg_cron&lt;/code&gt; or external orchestrators like Airflow).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;3-the-refactoring-estimate&quot;&gt;3. The Refactoring Estimate&lt;/h3&gt;
&lt;ul class=&quot;contains-task-list&quot;&gt;
&lt;li class=&quot;task-list-item&quot;&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Categorize databases into Tier 1 (Heavy T-SQL/Legacy) and Tier 2 (Simple CRUD/ORM-driven).&lt;/li&gt;
&lt;li class=&quot;task-list-item&quot;&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Estimate engineering months required to migrate Tier 2 databases.&lt;/li&gt;
&lt;li class=&quot;task-list-item&quot;&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Exclude Tier 1 databases from the initial business case—migrating them first will kill the project’s momentum.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern is to focus on avoiding future licensing purchases rather than replacing deeply entrenched legacy systems immediately. Target new microservices and simple, high-read databases for the first wave of PostgreSQL adoption.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

















&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Risk&lt;/th&gt;&lt;th&gt;Mitigation&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;ORM Compatibility&lt;/td&gt;&lt;td&gt;Entity Framework (EF) generates SQL Server specific queries. Switching the EF provider to PostgreSQL often exposes subtle behavioral differences in case sensitivity and transaction handling.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Linked Servers&lt;/td&gt;&lt;td&gt;SQL Server relies heavily on Linked Servers for cross-database queries. PostgreSQL uses Foreign Data Wrappers (FDW), which have different performance profiles for large joins.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: SQL Server migrations stall because the technical debt of T-SQL outweighs license savings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Use this checklist to target low-complexity databases first and build momentum.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Phased migrations (Tier 2 first) show a faster ROI and build team muscle memory for PostgreSQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Try our &lt;a href=&quot;https://rajivonai.com/tools/migration-readiness&quot;&gt;Open-Source DB Migration Readiness&lt;/a&gt; tool to score your schema compatibility.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>checklist</category><category>databases</category></item><item><title>Oracle Cloud BYOL: True Cost Analysis Beyond the Headline Rate</title><link>https://rajivonai.com/blog/2026-03-25-oracle-cloud-byol-true-cost/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-03-25-oracle-cloud-byol-true-cost/</guid><description>Understanding the financial nuances, OCPU conversions, and hidden costs of bringing your Oracle licenses to OCI.</description><pubDate>Wed, 25 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Oracle Cloud Infrastructure (OCI) advertises the most aggressive pricing for Oracle Database workloads, but the true cost relies heavily on your existing contract structure.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;An enterprise wants to migrate their on-premises Oracle Exadata workloads to the cloud. They are comparing AWS RDS for Oracle against Oracle Cloud Infrastructure (OCI) Exadata Database Service.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;OCI’s headline compute rates are significantly lower than AWS, and Oracle’s licensing policies heavily favor OCI (where 1 OCPU = 1 Processor License, compared to AWS where hyper-threading penalties apply). However, the Bring Your Own License (BYOL) math on OCI is complex, factoring in un-allocated support costs and mandatory cloud management fees. How do you calculate the actual TCO?&lt;/p&gt;
&lt;h2 id=&quot;the-oci-byol-reality&quot;&gt;The OCI BYOL Reality&lt;/h2&gt;
&lt;p&gt;When you bring your licenses to OCI via BYOL, you stop paying for the “License Included” markup, but you continue to pay your annual on-premises support bill.
Furthermore, OCI PaaS offerings (like Base Database Service or Exadata Cloud Service) require you to pay a baseline OCPU rate that covers the cloud automation, backup infrastructure, and management plane.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern is that OCI provides the lowest TCO for workloads that &lt;em&gt;must&lt;/em&gt; remain on Oracle (due to deep PL/SQL dependencies or vendor application requirements). By leveraging BYOL on OCI, customers avoid the “Authorized Cloud Environment” core-factor penalties that Oracle applies to AWS and Azure.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

















&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;Tradeoff&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;ULA Expiration&lt;/td&gt;&lt;td&gt;If your Unlimited License Agreement (ULA) is expiring, declaring your usage and moving to OCI BYOL requires strict audit compliance. If you over-provision OCPUs in the cloud, you will trigger a massive true-up bill.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Multi-Cloud Networking&lt;/td&gt;&lt;td&gt;If the rest of your application stack lives in AWS, moving the database to OCI introduces latency and egress costs. You must factor in the cost of an Azure-Oracle Interconnect or FastConnect to AWS.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Comparing Oracle database costs across AWS and OCI is apples-to-oranges due to licensing penalties.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Model the exact core counts using Oracle’s Cloud Licensing Policy document.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: OCI BYOL consistently models cheaper for heavy Oracle workloads, provided egress and latency constraints are managed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Request a Cloud Database Cost Review to build a custom multi-cloud ROI model for your Exadata footprint.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category></item><item><title>Oracle to Aurora PostgreSQL: License Cost Elimination in Practice</title><link>https://rajivonai.com/blog/2026-03-11-aurora-postgresql-migration-cost-savings/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-03-11-aurora-postgresql-migration-cost-savings/</guid><description>The engineering reality and ROI of migrating from Oracle to Amazon Aurora PostgreSQL.</description><pubDate>Wed, 11 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Eliminating commercial database licensing is the holy grail of cloud cost optimization, but the migration path is heavily guarded by proprietary PL/SQL.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;A platform team is mandated by the CFO to exit their Oracle Enterprise Agreement due to a 20% year-over-year increase in support and maintenance costs.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;They decide to migrate to Amazon Aurora PostgreSQL. While tools like the AWS Schema Conversion Tool (SCT) and Database Migration Service (DMS) handle the raw table structures and data movement, they fail on complex stored procedures, hierarchical queries (&lt;code&gt;CONNECT BY&lt;/code&gt;), and Oracle-specific XML processing. How do you accurately model the ROI when the migration requires thousands of hours of manual rewrite?&lt;/p&gt;
&lt;h2 id=&quot;the-migration-investment-framework&quot;&gt;The Migration Investment Framework&lt;/h2&gt;
&lt;p&gt;To calculate the true ROI of an Oracle exit, you must factor in the migration cost.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Assessment&lt;/strong&gt;: Run SCT to generate an automated conversion report. Identify the “red” items (manual rewrite required).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Estimation&lt;/strong&gt;: Assign an engineering hour cost to every manual rewrite item.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Modeling&lt;/strong&gt;: Compare the 5-year TCO of staying on Oracle (including annual support increases) against the Aurora compute cost plus the one-time migration engineering cost.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern for successful Oracle exits involves establishing a “strangler fig” architecture. Rather than a massive big-bang cutover, teams replicate data to Aurora using DMS, point read-only workloads to PostgreSQL first, and slowly refactor the write-path APIs away from PL/SQL into the application layer.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

















&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Phase&lt;/th&gt;&lt;th&gt;Tradeoff&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Schema Conversion&lt;/td&gt;&lt;td&gt;SCT is optimistic. It will claim 95% automated conversion, but the remaining 5% of code often contains the core business logic.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Performance Tuning&lt;/td&gt;&lt;td&gt;Aurora PostgreSQL handles concurrency differently than Oracle RAC. Queries that were fast on Oracle may require significant index tuning or architectural changes (like removing sequence bottlenecks) on PostgreSQL.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Oracle licensing costs are unsustainable, but migration engineering costs are opaque.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Execute a strict schema assessment and build a 5-year TCO model that includes manual refactoring time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Organizations that treat the migration as an application refactoring project (moving logic out of the database) achieve a faster ROI.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Model your break-even point using our &lt;a href=&quot;https://rajivonai.com/tools/oracle-migration-savings-calculator/&quot;&gt;Oracle to PostgreSQL Migration Savings Calculator&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category><category>architecture</category></item><item><title>AWS RDS Oracle and SQL Server: The License Cost Nobody Talks About</title><link>https://rajivonai.com/blog/2026-03-04-aws-rds-oracle-sql-server-license-cost/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-03-04-aws-rds-oracle-sql-server-license-cost/</guid><description>Why the default License-Included model on AWS RDS is a financial trap for enterprise database workloads.</description><pubDate>Wed, 04 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The ease of provisioning a commercial database on AWS RDS masks a massive premium that compounds hourly.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Teams migrating quickly to the cloud often use AWS RDS for their existing Oracle or SQL Server workloads. During the provisioning wizard, they accept the default “License Included” pricing model to avoid the bureaucratic hassle of license procurement.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;“License Included” pricing bundles the compute cost with the software license cost. However, AWS applies a significant markup. For Oracle Enterprise Edition or SQL Server Enterprise, the license component of the RDS hourly rate can exceed the cost of the underlying EC2 compute by 3x to 5x.&lt;/p&gt;
&lt;h2 id=&quot;the-bring-your-own-license-byol-alternative&quot;&gt;The Bring Your Own License (BYOL) Alternative&lt;/h2&gt;
&lt;p&gt;AWS offers a BYOL model, but it comes with stringent requirements. For Oracle, you must ensure you are adhering to the Oracle Cloud Policy, which changes how core factors are calculated. For SQL Server, Microsoft’s licensing terms often require moving to EC2 Dedicated Hosts to fully realize the value of your Software Assurance.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;A documented pattern among enterprise migrations is that running commercial engines on RDS License Included is financially unsustainable at scale. Organizations that perform a licensing audit before migration often discover they can leverage existing Enterprise Agreements via BYOL, cutting their RDS spend drastically.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

















&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Strategy&lt;/th&gt;&lt;th&gt;Tradeoff&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;EC2 Dedicated Hosts&lt;/td&gt;&lt;td&gt;Reduces SQL Server licensing costs but shifts the burden of high availability, patching, and backups back to your DBA team, eliminating the benefits of RDS.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Oracle Core Factor&lt;/td&gt;&lt;td&gt;Oracle does not recognize AWS hyper-threading as equivalent to physical cores, meaning you often need to purchase twice as many licenses to cover the same vCPU footprint.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: RDS License Included pricing is punitively expensive for enterprise databases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Audit existing licenses and evaluate BYOL on RDS or EC2 Dedicated Hosts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: BYOL architectures routinely save 40-50% on AWS commercial database bills.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Compare your potential savings using our &lt;a href=&quot;https://rajivonai.com/tools/sql-server-license-calculator/&quot;&gt;SQL Server Cloud Licensing Calculator&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category><category>failures</category></item><item><title>Azure Hybrid Benefit for SQL Server: The Exact Math</title><link>https://rajivonai.com/blog/2026-02-25-azure-hybrid-benefit-database-guide/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-02-25-azure-hybrid-benefit-database-guide/</guid><description>A deep dive into the cost savings and mechanics of applying Azure Hybrid Benefit to SQL Server deployments.</description><pubDate>Wed, 25 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Defaulting to License-Included pricing on Azure means you might be paying twice for SQL Server licenses you already own.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Companies migrating from on-premises datacenters to Azure often carry large Enterprise Agreements with active Software Assurance (SA) for SQL Server.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Cloud migration teams frequently provision Azure SQL Database or Managed Instances using the default “License-Included” tier. This ignores existing on-premises licenses, resulting in massive and unnecessary OPEX. How do you accurately model the break-even math for Azure Hybrid Benefit (AHB)?&lt;/p&gt;
&lt;h2 id=&quot;the-mechanics-of-ahb&quot;&gt;The Mechanics of AHB&lt;/h2&gt;
&lt;p&gt;Azure Hybrid Benefit allows you to use your existing SQL Server licenses with active SA to pay a reduced “base rate” (compute-only) for SQL Server on Azure VMs, Azure SQL Database, and Azure SQL Managed Instance.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern for AHB adoption involves auditing your SA inventory, converting older DTU-based databases to the vCore model (which supports AHB), and applying the licenses. One Enterprise Edition core license typically covers four General Purpose vCores or one Business Critical vCore.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

















&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;Tradeoff&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;New SA Purchase&lt;/td&gt;&lt;td&gt;Buying new SA solely to use AHB requires factoring the upfront cost against the annualized savings. Break-even is usually 7-10 months.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;DTU Model&lt;/td&gt;&lt;td&gt;Legacy DTU-based Azure SQL databases do not support AHB. You must migrate to the vCore model first.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Paying retail license rates on Azure despite owning SQL Server SA.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Convert to vCore models and apply Azure Hybrid Benefit.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: AHB can meaningfully reduce SQL Server costs; Microsoft cites up to roughly 55% for qualifying configurations, but realized savings vary — model your own EA and workload rather than assuming a fixed percentage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Try our &lt;a href=&quot;https://rajivonai.com/tools/sql-server-license-calculator/&quot;&gt;SQL Server Cloud Licensing Calculator&lt;/a&gt; to compare your License-Included costs against AHB modeled costs. Request a Cloud Database Cost Review if you need help navigating your EA.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category><category>architecture</category></item><item><title>Azure Synapse Cost Optimization: DWU Right-Sizing, Serverless, and Hybrid Benefit</title><link>https://rajivonai.com/blog/2026-02-18-azure-synapse-cost-optimization/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-02-18-azure-synapse-cost-optimization/</guid><description>How to reduce your Azure Synapse compute bill by right-sizing dedicated pools and offloading to serverless.</description><pubDate>Wed, 18 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Many data warehouse deployments are oversized for their 95th percentile workload, silently burning budget on idle compute capacity.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Data engineering teams often provision Azure Synapse dedicated SQL pools to handle peak quarter-end load, but leave them running at that size 24/7.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Synapse dedicated pools charge by the Data Warehouse Unit (DWU) hour. When ad-hoc analyst queries compete with SLA-bound ETL jobs on the same oversized pool, costs spiral. How do you optimize Synapse performance without paying for idle DWUs?&lt;/p&gt;
&lt;h2 id=&quot;synapse-optimization-strategy&quot;&gt;Synapse Optimization Strategy&lt;/h2&gt;
&lt;p&gt;Cost reduction in Synapse relies on three primary levers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;DWU Right-Sizing&lt;/strong&gt;: Audit peak vs provisioned DWU. Most pools are 4-10x oversized.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Serverless Offload&lt;/strong&gt;: Move ad-hoc and exploratory queries to Synapse Serverless SQL pools, where you pay per TB scanned, not per hour.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Auto-Pause Schedules&lt;/strong&gt;: Pause non-prod pools during nights and weekends.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern is to isolate ETL workloads on dedicated pools (right-sized for the specific data integration window) while pointing BI tools and analysts to serverless endpoints. Additionally, applying Azure Hybrid Benefit to the underlying SQL Server licenses (if available) can significantly reduce the baseline compute cost.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

















&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Optimization&lt;/th&gt;&lt;th&gt;Tradeoff&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Serverless SQL&lt;/td&gt;&lt;td&gt;Unoptimized queries without partition pruning can scan massive amounts of data, leading to unexpected per-TB charges.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Auto-Pause&lt;/td&gt;&lt;td&gt;Resuming a paused pool takes time and clears the cache, potentially causing the first queries to run slower.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Synapse dedicated pools are expensive when left running at peak capacity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Right-size DWUs, offload ad-hoc queries to serverless, and pause non-prod environments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Organizations routinely cut their Synapse compute bill in half using these exact levers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Use our &lt;a href=&quot;https://rajivonai.com/tools/azure-synapse-cost-calculator/&quot;&gt;Azure Synapse Cost Optimizer&lt;/a&gt; to estimate your monthly savings. Request a Cloud Database Cost Review for a deeper analysis.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category><category>architecture</category></item><item><title>Database Runbooks as Agent Contracts</title><link>https://rajivonai.com/blog/2026-01-30-database-runbooks-as-agent-contracts/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-01-30-database-runbooks-as-agent-contracts/</guid><description>A reference operating model for turning human database runbooks into machine-usable agent contracts.</description><pubDate>Fri, 30 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A runbook that depends on human intuition is not ready for an agent.&lt;/strong&gt; Most database runbooks were written for experienced operators. They say check replication lag, inspect locks, validate backup health, or apply the standard rollback. A human knows which command to use, which output is suspicious, and when to stop.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most database runbooks were written for experienced operators. They say check replication lag, inspect locks, validate backup health, or apply the standard rollback. A human knows which command to use, which output is suspicious, and when to stop.&lt;/p&gt;
&lt;p&gt;The pattern matters for database, cloud, and platform teams because agents do not operate in a vacuum. They inherit repository rules, tool permissions, deployment workflows, incident history, and the quality of the evidence available to them.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Operating layer&lt;/th&gt;&lt;th&gt;Default approach&lt;/th&gt;&lt;th&gt;Better alternative&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Context&lt;/td&gt;&lt;td&gt;Rely on a long prompt or chat history&lt;/td&gt;&lt;td&gt;Give the agent task-specific evidence and rules&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tooling&lt;/td&gt;&lt;td&gt;Expose broad tools and inspect later&lt;/td&gt;&lt;td&gt;Expose narrow tools with clear approval boundaries&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Verification&lt;/td&gt;&lt;td&gt;Read the final answer&lt;/td&gt;&lt;td&gt;Check the artifact, trace, and final state&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Agents need the missing contract. Without exact inputs, commands, expected outputs, thresholds, and stop conditions, the agent fills gaps with inference. That is not acceptable for production databases.&lt;/p&gt;
&lt;p&gt;The practical question is not whether an agent can produce a convincing response. The question is whether the engineering system around that response makes the work observable, reversible, and reviewable.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Weak boundary&lt;/td&gt;&lt;td&gt;Agent authority is broader than the task&lt;/td&gt;&lt;td&gt;A diagnostic run can become an unsafe change&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Missing evidence&lt;/td&gt;&lt;td&gt;The agent cannot cite the state it used&lt;/td&gt;&lt;td&gt;Review becomes opinion instead of verification&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;No lifecycle&lt;/td&gt;&lt;td&gt;The workflow ends at a message&lt;/td&gt;&lt;td&gt;Ownership, audit, cleanup, and rollback disappear&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;runbook-contract-architecture&quot;&gt;Runbook Contract Architecture&lt;/h2&gt;
&lt;p&gt;Convert each runbook into a contract with five parts: trigger, allowed tools, required observations, decision table, and completion proof.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[task request — bounded intent] --&gt; B[runbook contract architecture — controls]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; C[tool execution — evidence collected]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; D[verification — final state checked]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; E[human handoff — audit retained]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Define the operating boundary.&lt;/strong&gt;&lt;br&gt;
Write down the task class, allowed tools, environment, data class, and approval mode before the agent runs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Shape the evidence.&lt;/strong&gt;&lt;br&gt;
Return compact observations instead of raw dumps. The agent should see enough to reason, but not so much that context is wasted.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Require proof of completion.&lt;/strong&gt;&lt;br&gt;
Completion should be an artifact or state check: a passing test, a reviewed plan, a valid rollback, a trace, or a linked ticket.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For each operational workflow, define what the agent may read, what it may draft, what requires approval, and which evidence must be attached to the final answer.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;Context: OpenAI’s Codex loop shows that tool outputs become future prompt context. A runbook therefore shapes not only the current action but the next reasoning step. Source: &lt;a href=&quot;https://openai.com/index/unrolling-the-codex-agent-loop/&quot;&gt;OpenAI, Unrolling the Codex agent loop&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Action: For each operational workflow, define what the agent may read, what it may draft, what requires approval, and which evidence must be attached to the final answer.&lt;/p&gt;
&lt;p&gt;Result: A contract runbook can be tested in an eval harness against historical incidents before it is used in production.&lt;/p&gt;
&lt;p&gt;Learning: Convert each runbook into a contract with five parts: trigger, allowed tools, required observations, decision table, and completion proof. This is a documented pattern or a direct consequence of how the named systems behave, not a fabricated production story.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Ambiguous command&lt;/td&gt;&lt;td&gt;Runbook says check lag without naming query&lt;/td&gt;&lt;td&gt;Provide exact SQL or script&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Hidden threshold&lt;/td&gt;&lt;td&gt;Only humans know what value is bad&lt;/td&gt;&lt;td&gt;Write thresholds and escalation rules&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;No abort path&lt;/td&gt;&lt;td&gt;Agent continues after unexpected output&lt;/td&gt;&lt;td&gt;Define stop conditions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;No completion proof&lt;/td&gt;&lt;td&gt;Agent summarizes instead of verifying&lt;/td&gt;&lt;td&gt;Require evidence artifact and owner handoff&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Agents need the missing contract. Without exact inputs, commands, expected outputs, thresholds, and stop conditions, the agent fills gaps with inference. That is not acceptable for production databases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Convert each runbook into a contract with five parts: trigger, allowed tools, required observations, decision table, and completion proof.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A contract runbook can be tested in an eval harness against historical incidents before it is used in production.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Pick the replication-lag runbook and rewrite it as trigger, inputs, commands, thresholds, abort conditions, and proof of completion.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The teams that get value from agents will not be the teams with the longest prompts. They will be the teams that turn agent work into a controlled engineering workflow.&lt;/p&gt;</content:encoded><category>databases</category><category>ai-engineering</category><category>architecture</category><category>checklist</category></item><item><title>Repo-Embedded Skills for Database Teams</title><link>https://rajivonai.com/blog/2026-01-23-repo-embedded-skills-for-database-teams/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-01-23-repo-embedded-skills-for-database-teams/</guid><description>Why database teams should store agent instructions, runbook contracts, and review policies in the repository instead of in memory.</description><pubDate>Fri, 23 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;If the rule matters during review, it belongs in the repository where the agent can read it.&lt;/strong&gt; Database teams carry a lot of implicit knowledge: which tables are too large for blocking DDL, which accounts are break-glass only, which dashboards prove a rollout is safe, and which rollback path is acceptable for each schema change.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Database teams carry a lot of implicit knowledge: which tables are too large for blocking DDL, which accounts are break-glass only, which dashboards prove a rollout is safe, and which rollback path is acceptable for each schema change.&lt;/p&gt;
&lt;p&gt;The pattern matters for database, cloud, and platform teams because agents do not operate in a vacuum. They inherit repository rules, tool permissions, deployment workflows, incident history, and the quality of the evidence available to them.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Operating layer&lt;/th&gt;&lt;th&gt;Default approach&lt;/th&gt;&lt;th&gt;Better alternative&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Context&lt;/td&gt;&lt;td&gt;Rely on a long prompt or chat history&lt;/td&gt;&lt;td&gt;Give the agent task-specific evidence and rules&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tooling&lt;/td&gt;&lt;td&gt;Expose broad tools and inspect later&lt;/td&gt;&lt;td&gt;Expose narrow tools with clear approval boundaries&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Verification&lt;/td&gt;&lt;td&gt;Read the final answer&lt;/td&gt;&lt;td&gt;Check the artifact, trace, and final state&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Implicit knowledge does not survive agent execution. If the agent cannot read the rule, it cannot reliably follow it. Prompting the rule by hand in every session creates drift and makes review impossible.&lt;/p&gt;
&lt;p&gt;The practical question is not whether an agent can produce a convincing response. The question is whether the engineering system around that response makes the work observable, reversible, and reviewable.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Weak boundary&lt;/td&gt;&lt;td&gt;Agent authority is broader than the task&lt;/td&gt;&lt;td&gt;A diagnostic run can become an unsafe change&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Missing evidence&lt;/td&gt;&lt;td&gt;The agent cannot cite the state it used&lt;/td&gt;&lt;td&gt;Review becomes opinion instead of verification&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;No lifecycle&lt;/td&gt;&lt;td&gt;The workflow ends at a message&lt;/td&gt;&lt;td&gt;Ownership, audit, cleanup, and rollback disappear&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;repository-skill-backbone&quot;&gt;Repository Skill Backbone&lt;/h2&gt;
&lt;p&gt;Store skills beside the code: migration review rules, incident triage steps, Terraform plan review guidance, test commands, and abort conditions.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[task request — bounded intent] --&gt; B[repository skill backbone — controls]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; C[tool execution — evidence collected]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; D[verification — final state checked]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; E[human handoff — audit retained]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Define the operating boundary.&lt;/strong&gt;&lt;br&gt;
Write down the task class, allowed tools, environment, data class, and approval mode before the agent runs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Shape the evidence.&lt;/strong&gt;&lt;br&gt;
Return compact observations instead of raw dumps. The agent should see enough to reason, but not so much that context is wasted.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Require proof of completion.&lt;/strong&gt;&lt;br&gt;
Completion should be an artifact or state check: a passing test, a reviewed plan, a valid rollback, a trace, or a linked ticket.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Create a &lt;code&gt;skills&lt;/code&gt; or &lt;code&gt;AGENTS.md&lt;/code&gt; layer that tells the agent how this repository works, which scripts are authoritative, and what proof is required before it can claim completion.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;Context: OpenAI’s harness engineering discussion emphasizes repository skills, local scripts, and environment-specific guidance as part of the system around Codex. That makes repo-local instructions part of engineering infrastructure. Source: &lt;a href=&quot;https://openai.com/index/harness-engineering/&quot;&gt;OpenAI, Harness engineering&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Action: Create a &lt;code&gt;skills&lt;/code&gt; or &lt;code&gt;AGENTS.md&lt;/code&gt; layer that tells the agent how this repository works, which scripts are authoritative, and what proof is required before it can claim completion.&lt;/p&gt;
&lt;p&gt;Result: When the rule is versioned, every change to the agent operating model can be reviewed like code.&lt;/p&gt;
&lt;p&gt;Learning: Store skills beside the code: migration review rules, incident triage steps, Terraform plan review guidance, test commands, and abort conditions. This is a documented pattern or a direct consequence of how the named systems behave, not a fabricated production story.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Tribal policy&lt;/td&gt;&lt;td&gt;Only senior engineers know the rule&lt;/td&gt;&lt;td&gt;Move rules into repo-local instructions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Stale prompts&lt;/td&gt;&lt;td&gt;Different users paste different guidance&lt;/td&gt;&lt;td&gt;Version shared skills with the code&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Script ignorance&lt;/td&gt;&lt;td&gt;Agent invents commands instead of using local scripts&lt;/td&gt;&lt;td&gt;Document canonical scripts and expected outputs&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;No stop conditions&lt;/td&gt;&lt;td&gt;Agent keeps trying unsafe alternatives&lt;/td&gt;&lt;td&gt;Write explicit abort conditions&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Implicit knowledge does not survive agent execution. If the agent cannot read the rule, it cannot reliably follow it. Prompting the rule by hand in every session creates drift and makes review impossible.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Store skills beside the code: migration review rules, incident triage steps, Terraform plan review guidance, test commands, and abort conditions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: When the rule is versioned, every change to the agent operating model can be reviewed like code.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Add one repository-local agent guide for migrations: allowed commands, rollback requirements, lock-risk rules, and proof of completion.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The teams that get value from agents will not be the teams with the longest prompts. They will be the teams that turn agent work into a controlled engineering workflow.&lt;/p&gt;</content:encoded><category>ai-engineering</category><category>databases</category><category>architecture</category></item><item><title>Agentic Code Review for Database Repositories</title><link>https://rajivonai.com/blog/2026-01-20-agentic-code-review-for-database-repositories/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-01-20-agentic-code-review-for-database-repositories/</guid><description>Database repositories contain hidden rules human reviewers know: never add a blocking index at peak hours, never widen IAM without owner approval. Agent review surfaces these violations before merge — without displacing the human judgment that set the rules.</description><pubDate>Tue, 20 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Database code review is no longer just syntax and style; agents can inspect the operational path around the diff.&lt;/strong&gt; A database repository usually contains more than SQL. It has Flyway or Liquibase migrations, Terraform modules, shell scripts, backup jobs, dashboards, and runbooks. Human reviewers know the hidden rules: never add the blocking index in peak hours, never widen IAM without owner approval, never merge a migration without rollback.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;A database repository usually contains more than SQL. It has Flyway or Liquibase migrations, Terraform modules, shell scripts, backup jobs, dashboards, and runbooks. Human reviewers know the hidden rules: never add the blocking index in peak hours, never widen IAM without owner approval, never merge a migration without rollback.&lt;/p&gt;
&lt;p&gt;The pattern matters for database, cloud, and platform teams because agents do not operate in a vacuum. They inherit repository rules, tool permissions, deployment workflows, incident history, and the quality of the evidence available to them.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Operating layer&lt;/th&gt;&lt;th&gt;Default approach&lt;/th&gt;&lt;th&gt;Better alternative&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Context&lt;/td&gt;&lt;td&gt;Rely on a long prompt or chat history&lt;/td&gt;&lt;td&gt;Give the agent task-specific evidence and rules&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tooling&lt;/td&gt;&lt;td&gt;Expose broad tools and inspect later&lt;/td&gt;&lt;td&gt;Expose narrow tools with clear approval boundaries&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Verification&lt;/td&gt;&lt;td&gt;Read the final answer&lt;/td&gt;&lt;td&gt;Check the artifact, trace, and final state&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Generic linters cannot reason across that repository. They can catch formatting, but not whether a migration conflicts with the rollback playbook or whether a Terraform change breaks the service catalog contract.&lt;/p&gt;
&lt;p&gt;The practical question is not whether an agent can produce a convincing response. The question is whether the engineering system around that response makes the work observable, reversible, and reviewable.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Weak boundary&lt;/td&gt;&lt;td&gt;Agent authority is broader than the task&lt;/td&gt;&lt;td&gt;A diagnostic run can become an unsafe change&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Missing evidence&lt;/td&gt;&lt;td&gt;The agent cannot cite the state it used&lt;/td&gt;&lt;td&gt;Review becomes opinion instead of verification&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;No lifecycle&lt;/td&gt;&lt;td&gt;The workflow ends at a message&lt;/td&gt;&lt;td&gt;Ownership, audit, cleanup, and rollback disappear&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;agentic-repository-review&quot;&gt;Agentic Repository Review&lt;/h2&gt;
&lt;p&gt;Give the review agent repository rules, migration policy, operational runbooks, and read-only access to test commands. Its job is to produce review findings with evidence, not to approve the change.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[task request — bounded intent] --&gt; B[agentic repository review — controls]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; C[tool execution — evidence collected]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; D[verification — final state checked]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; E[human handoff — audit retained]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Define the operating boundary.&lt;/strong&gt;&lt;br&gt;
Write down the task class, allowed tools, environment, data class, and approval mode before the agent runs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Shape the evidence.&lt;/strong&gt;&lt;br&gt;
Return compact observations instead of raw dumps. The agent should see enough to reason, but not so much that context is wasted.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Require proof of completion.&lt;/strong&gt;&lt;br&gt;
Completion should be an artifact or state check: a passing test, a reviewed plan, a valid rollback, a trace, or a linked ticket.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Split review into specialized checks: SQL lock risk, rollback completeness, Terraform blast radius, observability coverage, and deployment sequencing.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;Context: OpenAI’s public Datadog Codex example frames agent review as system-level review rather than only local code suggestions. That is the right lens for database repositories. Source: &lt;a href=&quot;https://openai.com/index/datadog/&quot;&gt;OpenAI, Datadog uses Codex for system-level code review&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Action: Split review into specialized checks: SQL lock risk, rollback completeness, Terraform blast radius, observability coverage, and deployment sequencing.&lt;/p&gt;
&lt;p&gt;Result: A useful agent review cites the exact file, command, or policy that supports the finding. If it cannot cite evidence, the finding should be downgraded to a question.&lt;/p&gt;
&lt;p&gt;Learning: Give the review agent repository rules, migration policy, operational runbooks, and read-only access to test commands. Its job is to produce review findings with evidence, not to approve the change. This is a documented pattern or a direct consequence of how the named systems behave, not a fabricated production story.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Style-only review&lt;/td&gt;&lt;td&gt;Agent comments on names but misses lock risk&lt;/td&gt;&lt;td&gt;Give it operational policies and migration examples&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Unbounded suggestions&lt;/td&gt;&lt;td&gt;Agent rewrites unrelated code&lt;/td&gt;&lt;td&gt;Require findings first, patches only after approval&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;No evidence&lt;/td&gt;&lt;td&gt;Comments are plausible but uncited&lt;/td&gt;&lt;td&gt;Require file path, command output, or policy citation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Human bypass&lt;/td&gt;&lt;td&gt;Agent approval becomes social proof&lt;/td&gt;&lt;td&gt;Keep human owner as final approver&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Generic linters cannot reason across that repository. They can catch formatting, but not whether a migration conflicts with the rollback playbook or whether a Terraform change breaks the service catalog contract.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Give the review agent repository rules, migration policy, operational runbooks, and read-only access to test commands. Its job is to produce review findings with evidence, not to approve the change.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A useful agent review cites the exact file, command, or policy that supports the finding. If it cannot cite evidence, the finding should be downgraded to a question.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Create a review checklist for one DB repo with five agent checks: lock risk, rollback, deploy order, observability, and Terraform blast radius.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The teams that get value from agents will not be the teams with the longest prompts. They will be the teams that turn agent work into a controlled engineering workflow.&lt;/p&gt;</content:encoded><category>ai-engineering</category><category>databases</category><category>architecture</category></item><item><title>Automated Reliability Across the Stack: Database Backups, Platform Observability, and SQL Quality (November 2025)</title><link>https://rajivonai.com/blog/2025-12-20-database-reliability-observability-sql-nov-2025/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-12-20-database-reliability-observability-sql-nov-2025/</guid><description>Three November 2025 open-source releases eliminate manual work from three engineering reliability tasks — multi-database backup verification, self-hosted log and trace collection, and SQL static analysis in CI pipelines.</description><pubDate>Sat, 20 Dec 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Database teams running production systems still spend significant time on three tasks that should not require human attention: manually verifying that backup restores work before an incident forces the test, triage of logs and traces from platform services, and SQL code review that catches — or misses — the specific patterns that cause production incidents. Three November 2025 open-source releases automate each of these, covering backup verification across seven database engines, self-hosted observability backed by your choice of storage, and SQL static analysis with 272 production-focused rules.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;The operational layer around production databases and platform services has a persistent gap: teams implement the primary infrastructure correctly and leave the reliability infrastructure to manual processes. Backup jobs run but restores are tested once at setup and never again. Observability requires either paying Datadog rates or running an ELK stack that needs its own operational attention. SQL quality gates rely on human code review — which scales poorly as schema complexity grows. All three of these gaps have open-source answers now.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Domain&lt;/th&gt;&lt;th&gt;Manual bottleneck&lt;/th&gt;&lt;th&gt;What it costs&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;Backup pipelines verify checksums but never test actual restores&lt;/td&gt;&lt;td&gt;Teams discover restore failures during incidents, not before&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform engineering&lt;/td&gt;&lt;td&gt;Unified logs, traces, and metrics require a managed service or months of ELK configuration&lt;/td&gt;&lt;td&gt;Observability budgets consume engineering time for setup and maintenance&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;System design&lt;/td&gt;&lt;td&gt;SQL quality review relies on code reviewers knowing which patterns — implicit casts, unbounded scans, missing indexes — cause production incidents&lt;/td&gt;&lt;td&gt;Incidents caused by anti-patterns that a static rule would catch at commit time&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;MySQL, PostgreSQL, MongoDB, Redis each require separate backup tools in mixed environments&lt;/td&gt;&lt;td&gt;Four tools, four retention policies, four notification configs, four failure modes to monitor&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Can these three operational gaps be closed with self-hosted open-source tooling that doesn’t require managed service accounts or custom platform engineering?&lt;/p&gt;
&lt;h2 id=&quot;automated-operational-reliability-across-the-engineering-stack&quot;&gt;Automated Operational Reliability Across the Engineering Stack&lt;/h2&gt;
&lt;p&gt;These three tools each eliminate a category of manual operational work:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    OpsTeam[engineering team — operational reliability]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    OpsTeam --&gt; BackupOps[databases — backup restore never verified after initial setup]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    OpsTeam --&gt; ObsOps[platform — logs and traces requiring managed service or ELK overhead]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    OpsTeam --&gt; SQLOps[system design — SQL quality depending on reviewer knowledge]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    BackupOps --&gt; databasement[databasement — multi-DB backup with automated restore verification]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    ObsOps --&gt; logtide[logtide — self-hosted observability on TimescaleDB or ClickHouse]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    SQLOps --&gt; slowql[slowql — 272-rule SQL static analyzer in CI pipelines]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    databasement --&gt; Out1[restore failures caught in scheduled runs, not during incidents]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    logtide --&gt; Out2[logs and traces on your infrastructure with sub-100ms query target]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    slowql --&gt; Out3[SQL anti-patterns blocked at merge time, not found in production]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;databasement--multi-database-backup-with-automated-restore-verification&quot;&gt;databasement — Multi-Database Backup with Automated Restore Verification&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The productivity problem it solves:&lt;/strong&gt; Database teams running mixed environments — PostgreSQL for OLTP, MongoDB for documents, Redis for cache — manage separate backup tools for each engine, and most of those pipelines verify checksums rather than actually testing the restore. databasement manages all seven engines from one interface and automates the restore verification step.&lt;/p&gt;
&lt;p&gt;According to the project README, databasement supports MySQL, PostgreSQL, MariaDB, Microsoft SQL Server, MongoDB, SQLite, and Redis from a single web UI. Storage destinations include S3-compatible storage (AWS S3, MinIO, and compatible endpoints), local filesystem, and remote servers via SFTP/FTP. SSH tunnel support allows connecting to databases in private networks through bastion hosts using password or key-based authentication.&lt;/p&gt;
&lt;p&gt;Retention policies support both simple time-based (days) and GFS (grandfather-father-son) rotation per the README. Compression includes gzip, zstd (documented as 20-40% better compression), and AES-256 encrypted archives. The project also exposes a REST API and an MCP server, enabling backup scheduling and status queries from AI agents and CI pipeline automation.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;docker&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; run&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -d&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  -p&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; 8080:8080&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  -v&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /data/databasement:/app/storage&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  -e&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; APP_KEY=your-32-char-key&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  davidcrty/databasement:latest&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Access at http://localhost:8080&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Add database servers, configure schedules, enable restore verification per backup job&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The cross-server restore feature documented in the README allows restoring from a production backup to a staging instance — enabling RTO testing without touching production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks:&lt;/strong&gt; For databases in the hundreds of gigabytes, full restore verification per backup cycle may not complete within maintenance windows. The README does not publish restore verification timing benchmarks by database engine and size. Teams should measure restore time for their largest databases before scheduling nightly verification — weekly full restore verification with daily backup-only runs is a reasonable starting point for large datasets.&lt;/p&gt;
&lt;h3 id=&quot;logtide--self-hosted-observability-without-the-elk-overhead&quot;&gt;logtide — Self-Hosted Observability Without the ELK Overhead&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The productivity problem it solves:&lt;/strong&gt; Unified collection of logs, traces, and metrics on your own infrastructure has historically meant either paying for Datadog or spending weeks configuring the Elasticsearch + Logstash + Kibana stack and then maintaining it. logtide is a self-hosted observability platform with pluggable storage that runs in Docker in under five minutes.&lt;/p&gt;
&lt;p&gt;According to the project README, logtide (v0.9.4, stable alpha) provides logs, traces, and metrics in a single interface with built-in security detection. The storage backend is configurable: TimescaleDB for standard deployments, ClickHouse for high-volume scenarios, or MongoDB for flexible document storage. The README documents a sub-100ms query performance target, PII masking for GDPR compliance, and a native Sigma Rules engine for real-time threat detection.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;yaml&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;services&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  logtide&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    image&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;logtide/backend:latest&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    environment&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      DB_ENGINE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;timescaledb&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      DB_HOST&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;timescaledb&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    ports&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      - &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;4000:4000&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  timescaledb&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    image&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;timescale/timescaledb:latest-pg16&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For platform teams choosing the TimescaleDB backend: observability data becomes queryable with standard SQL tools — the same &lt;code&gt;psql&lt;/code&gt; and query tooling used for application databases applies directly to log and trace data. Teams on ClickHouse for analytics already have the right infrastructure for the high-scale storage option.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks:&lt;/strong&gt; logtide is in “stable alpha” per the README. The Artifact Hub and Docker Hub listings are published, but the project signals active development with version cadence. Teams should not migrate primary production observability from an established system without evaluating the alpha stability against their requirements. The Sigma Rules threat detection requires familiarity with the Sigma format to write custom rules beyond the built-in set.&lt;/p&gt;
&lt;h3 id=&quot;slowql--sql-anti-patterns-caught-at-commit-time&quot;&gt;slowql — SQL Anti-Patterns Caught at Commit Time&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The productivity problem it solves:&lt;/strong&gt; SQL code review depends on reviewers knowing which patterns cause production incidents — missing indexes on join columns, implicit type casts that prevent index use, unbounded scans, N+1 query patterns, security vulnerabilities, compliance violations. slowql encodes 272 of these rules and runs them offline in any CI pipeline, catching problems before they reach production.&lt;/p&gt;
&lt;p&gt;According to the project README, slowql is a “production-focused offline SQL static analyzer” covering performance, security, reliability, compliance, cost, and code quality categories. It ships as a Python package, Docker image, and VS Code extension. The README describes it as “completely offline” — no SQL leaves the developer’s machine during analysis. It supports CI pipeline integration via standard exit codes and JSON output format.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pip&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; install&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; slowql&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Analyze migration files before merge&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;slowql&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; analyze&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --path&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; ./db/migrations/&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --rules&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; all&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# CI integration — fails on critical violations&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;slowql&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; analyze&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --path&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; ./db/migrations/&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --format&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; json&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --fail-on&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; critical&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For engineering teams using GitHub Actions or GitLab CI, adding slowql as a blocking check on pull requests catches structural SQL problems the same way a linter catches code style issues — at the point where the cost of fixing them is lowest.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks:&lt;/strong&gt; slowql is a static analyzer — it evaluates SQL text without executing queries against actual data. Performance problems caused by data distribution (a query fast on development data but slow on production table sizes) are not detectable by static analysis. Slowql catches structural anti-patterns; it does not replace query plan analysis and runtime monitoring for load-dependent performance problems. Teams should use it to gate structural quality while pairing it with &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; review for performance-critical queries.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;All descriptions above are grounded in the project READMEs. Items to verify:&lt;/p&gt;
&lt;p&gt;databasement’s cross-server restore is documented in the README feature list. The restore verification implementation — specifically how data integrity is confirmed after restore, not just that the restore process completed without error — should be reviewed in the project documentation before treating it as the primary RTO validation method.&lt;/p&gt;
&lt;p&gt;logtide’s sub-100ms query performance target is stated as a design goal in the README, not a published benchmark across workload types. Teams should benchmark against their specific event volume and query patterns against the storage backend they intend to run before replacing an existing observability system.&lt;/p&gt;
&lt;p&gt;slowql’s 272-rule count is documented in the project README. Rule coverage breakdown by SQL dialect (PostgreSQL vs. MySQL vs. others) is not detailed in the README summary — teams should verify that rules relevant to their primary database engine are represented before using it as a blocking CI gate.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;databasement restore verification timeout&lt;/td&gt;&lt;td&gt;Databases over 100 GB with narrow maintenance windows&lt;/td&gt;&lt;td&gt;Run weekly full restore verification; use backup-only jobs daily for large databases&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;databasement engine version mismatch&lt;/td&gt;&lt;td&gt;Backup from one major version, restore on another&lt;/td&gt;&lt;td&gt;Pin database engine version in backup configuration; test cross-version restores in staging&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;logtide alpha stability&lt;/td&gt;&lt;td&gt;Breaking configuration changes between 0.9.x releases&lt;/td&gt;&lt;td&gt;Pin to a specific image tag; review the changelog before upgrading&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;slowql false positives&lt;/td&gt;&lt;td&gt;Rules triggering on patterns valid in the team’s SQL dialect&lt;/td&gt;&lt;td&gt;Start with &lt;code&gt;--rules performance,security&lt;/code&gt;; expand to additional categories incrementally&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;slowql runtime gap&lt;/td&gt;&lt;td&gt;Queries fast on dev data but slow on production row counts&lt;/td&gt;&lt;td&gt;Pair slowql with mandatory &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; review for queries touching large tables&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Backup restore is untested until an incident, platform observability requires managed service costs or ELK complexity, and SQL quality depends on reviewer knowledge that doesn’t scale with schema growth.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: databasement for multi-engine backup with automated restore verification, logtide for self-hosted observability backed by TimescaleDB or ClickHouse, slowql for SQL static analysis as a CI pipeline gate.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Add &lt;code&gt;slowql analyze --path ./db/migrations --fail-on critical&lt;/code&gt; to your CI pipeline and run it against existing migration history. Count how many files trigger a rule. Any result is a pattern that code review missed and that now has an automated gate.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, deploy databasement against your staging environment and run one scheduled backup with cross-server restore verification enabled. The first restore failure you catch before an incident is direct evidence of value for expanding it to production.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>ai-engineering</category><category>architecture</category></item><item><title>Torn Page Protection Belongs Off the Foreground Path</title><link>https://rajivonai.com/blog/2025-10-25-torn-page-protection-belongs-off-the-foreground-path/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-10-25-torn-page-protection-belongs-off-the-foreground-path/</guid><description>A PostgreSQL kernel experiment shows why moving torn-page protection from WAL to background flush can change write latency.</description><pubDate>Sat, 25 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The expensive part of torn-page protection is not the extra write; it is where the extra write lands: PostgreSQL’s Full Page Write puts the copy on the foreground Write-Ahead Log path, while InnoDB’s Doublewrite Buffer moves the copy into the background flush path.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Database durability still lives below the abstraction line most application engineers prefer to ignore. That works until a write-heavy system hits checkpoint pressure, latency doubles, and the answer is not a missing index but an 8 KB page being protected from a 4 KB failure.&lt;/p&gt;
&lt;p&gt;PostgreSQL protects against torn pages with &lt;strong&gt;Full Page Write (FPW)&lt;/strong&gt;: after each checkpoint, the first modification of a data page writes the entire page image into &lt;strong&gt;Write-Ahead Log (WAL)&lt;/strong&gt;. MySQL’s InnoDB protects against the same class of failure with a &lt;strong&gt;Doublewrite Buffer (DWB)&lt;/strong&gt;: dirty pages are first written to a dedicated area, synced, then written to their final data-file locations.&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Design&lt;/th&gt;&lt;th&gt;Protection copy lives in&lt;/th&gt;&lt;th&gt;Request path impact&lt;/th&gt;&lt;th&gt;Recovery behavior&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL FPW&lt;/td&gt;&lt;td&gt;WAL stream&lt;/td&gt;&lt;td&gt;The first post-checkpoint dirtying of each page expands foreground WAL&lt;/td&gt;&lt;td&gt;Recovery restores the full page image from WAL, then replays later WAL records&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;InnoDB DWB&lt;/td&gt;&lt;td&gt;Doublewrite files&lt;/td&gt;&lt;td&gt;Dirty-page copy is paid by flush machinery, not directly by SQL execution&lt;/td&gt;&lt;td&gt;Recovery repairs torn data pages from the doublewrite copy&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Atomic-write storage&lt;/td&gt;&lt;td&gt;Storage layer&lt;/td&gt;&lt;td&gt;Database may avoid software copy only if the whole stack actually guarantees page atomicity&lt;/td&gt;&lt;td&gt;Recovery depends on the storage contract being true&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;PostgreSQL’s own documentation says &lt;code&gt;full_page_writes&lt;/code&gt; writes the entire disk page to WAL on first modification after checkpoint and warns that turning it off can cause unrecoverable or silent corruption after failure. The MySQL 8.4 manual describes InnoDB’s doublewrite buffer as a storage area written before final data-file placement and notes that the large sequential write usually avoids doubling I/O operations one-for-one. See the PostgreSQL WAL settings documentation and MySQL InnoDB doublewrite documentation for the baseline behavior: &lt;a href=&quot;https://www.postgresql.org/docs/current/runtime-config-wal.html&quot;&gt;PostgreSQL &lt;code&gt;full_page_writes&lt;/code&gt;&lt;/a&gt;, &lt;a href=&quot;https://dev.mysql.com/doc/refman/8.4/en/innodb-doublewrite-buffer.html&quot;&gt;MySQL 8.4 Doublewrite Buffer&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;A torn page is not a logical transaction problem. It is a physical write atomicity problem. PostgreSQL pages are normally 8 KB; MySQL InnoDB pages are commonly 16 KB; operating systems and devices often expose smaller practical atomic write units such as 4 KB sectors. If power loss or kernel failure interrupts a database page write, recovery may find a page that is half old and half new.&lt;/p&gt;
&lt;p&gt;That matters because PostgreSQL WAL records are usually physiological: they identify a physical page, then describe a logical change inside it. If the page cannot be parsed after a crash, the redo record may not have a sane object to apply to. The PostgreSQL wiki explains the problem directly: recovery needs a readable page with valid structure before logical page changes can be replayed. &lt;a href=&quot;https://wiki.postgresql.org/wiki/Full_page_writes&quot;&gt;PostgreSQL wiki: Full page writes&lt;/a&gt;&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;First dirty page after checkpoint in PostgreSQL 16, 17, or 18&lt;/td&gt;&lt;td&gt;The WAL record may include an 8 KB full page image instead of only the logical change&lt;/td&gt;&lt;td&gt;Write-heavy workloads see WAL volume jump immediately after checkpoint&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;checkpoint_timeout&lt;/code&gt; too low, such as the documented minimum of 30 seconds&lt;/td&gt;&lt;td&gt;Pages become “first dirty after checkpoint” more often&lt;/td&gt;&lt;td&gt;Lower recovery distance increases foreground WAL amplification&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;max_wal_size&lt;/code&gt; too low under write load&lt;/td&gt;&lt;td&gt;PostgreSQL triggers size-driven checkpoints earlier than the time schedule&lt;/td&gt;&lt;td&gt;A workload can enter a loop of checkpoint, FPW surge, WAL growth, checkpoint&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;wal_compression=off&lt;/code&gt; with highly compressible page images&lt;/td&gt;&lt;td&gt;Full page images are stored without compression&lt;/td&gt;&lt;td&gt;The storage bill moves from CPU to WAL bandwidth; compression can help but adds CPU on WAL insert and replay&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Data checksums enabled&lt;/td&gt;&lt;td&gt;Hint-bit behavior can create additional WAL pressure because checksum-protected pages need correctness around page writes&lt;/td&gt;&lt;td&gt;Checksums detect corruption; they do not remove the need for torn-page protection&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Benchmark with &lt;code&gt;full_page_writes=off&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Throughput improves while the system is no longer protected against the same crash class&lt;/td&gt;&lt;td&gt;This is a measurement mode, not a production durability design&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;PostgreSQL checkpoints are started by &lt;code&gt;checkpoint_timeout&lt;/code&gt; or when &lt;code&gt;max_wal_size&lt;/code&gt; is about to be exceeded. That means FPW makes checkpoint frequency a durability-performance coupling: shorter intervals reduce crash-recovery distance but increase the rate at which pages become eligible for full-page images again.&lt;/p&gt;
&lt;p&gt;The core question is not whether FPW or DWB performs “two writes.” The question is whether the durability copy blocks the foreground commit path, or whether the system can batch it behind dirty-page flushing without weakening crash recovery.&lt;/p&gt;
&lt;h2 id=&quot;move-torn-page-copies-off-the-foreground-path&quot;&gt;Move Torn-Page Copies Off the Foreground Path&lt;/h2&gt;
&lt;p&gt;The right architecture is not “turn off full-page writes and hope the storage behaves.” The right architecture is to separate two responsibilities that FPW intentionally combines: WAL should preserve transaction order, while the torn-page protection copy should be paid by the page-flush path.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    SQL[SQL transaction] --&gt; Buffer[shared buffer page dirtied]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Buffer --&gt; WAL[WAL foreground path — logical record]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Buffer --&gt; Checkpoint[checkpoint boundary]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Checkpoint --&gt; FPW[PostgreSQL FPW — first dirty page image in WAL]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Buffer --&gt; Flusher[background dirty page flusher]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Flusher --&gt; DWB[Doublewrite area — sequential page copies]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DWB --&gt; Sync[fsync doublewrite area]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Sync --&gt; DataFiles[scatter write final data files]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    FPW --&gt; Recovery[crash recovery — restore page then replay WAL]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DataFiles --&gt; Recovery&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DWB --&gt; Recovery&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The important distinction is scheduling. FPW pays the copy at WAL insertion time for the first page modification after checkpoint. DWB pays the copy when dirty pages leave the buffer pool. Both protect against torn pages; they do not put the pressure on the same queue.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Keep WAL responsible for transaction ordering, not page-copy transport.&lt;/p&gt;
&lt;p&gt;In PostgreSQL, WAL must be flushed before dirty data pages reach durable storage. That ordering is non-negotiable. A DWB prototype should not weaken WAL-before-data; it should remove full page images from the normal WAL record path only when the doublewrite mechanism can guarantee a complete repair copy before final page placement.&lt;/p&gt;
&lt;p&gt;Verification: crash after WAL flush but before final data-file write; recovery must replay WAL without reading an unrecoverable torn page.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Insert a doublewrite stage into the dirty-page flush path.&lt;/p&gt;
&lt;p&gt;The flush path should write dirty buffers into a sequential doublewrite area, force that area durable, then write the same pages to their final relation files. The doublewrite area needs enough metadata to map page identity back to relation fork and block number after restart.&lt;/p&gt;
&lt;p&gt;Verification: force a partial final data-file page write and confirm restart repairs it from the doublewrite copy before normal redo continues.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Preserve checkpoint semantics explicitly.&lt;/p&gt;
&lt;p&gt;A checkpoint cannot simply assume pages are safe because they were scheduled for writeback. It needs a durable boundary: either the final page reached storage intact, or the doublewrite copy did. Otherwise the checkpoint can advertise a recovery point that depends on a page image which exists only in kernel cache.&lt;/p&gt;
&lt;p&gt;Verification: kill the postmaster during checkpoint completion, restart, and verify that checkpoint redo location never advances past unprotected dirty pages.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Measure WAL bytes, data-file bytes, fsync latency, and tail latency separately.&lt;/p&gt;
&lt;p&gt;A DWB design can reduce foreground WAL pressure while increasing background writeback pressure. That is a good trade only if latency-critical SQL stops waiting and the background system does not fall behind. Use &lt;code&gt;pg_current_wal_lsn()&lt;/code&gt; deltas, &lt;code&gt;pg_stat_bgwriter&lt;/code&gt;, &lt;code&gt;pg_stat_io&lt;/code&gt; in PostgreSQL 16 and later, filesystem writeback metrics, and storage latency histograms.&lt;/p&gt;
&lt;p&gt;Verification: compare p50, p95, and p99 transaction latency across &lt;code&gt;checkpoint_timeout&lt;/code&gt;, &lt;code&gt;max_wal_size&lt;/code&gt;, and &lt;code&gt;shared_buffers&lt;/code&gt;, not only aggregate transactions per second.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Treat AI-assisted kernel work as scaffolding, not proof.&lt;/p&gt;
&lt;p&gt;Zongzhi Chen’s 2026 experiment reported a PostgreSQL prototype where Claude Code helped replace FPW with a DWB-style mechanism, with DWB outperforming FPW in an I/O-bound pgbench workload. That is interesting engineering signal, especially because the patch touches real storage-engine paths. It is not enough to declare the design production-safe. Storage bugs are excellent at passing normal tests and failing only when the machine dies at precisely the wrong time. See the source experiment here: &lt;a href=&quot;https://medium.com/@baotiao/in-2026-can-ai-modify-database-kernel-code-c7c88cb43389&quot;&gt;Zongzhi Chen, 2026&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Verification: run crash-restart loops with forced partial writes, checksum validation, logical consistency checks, and comparisons against a known-good source.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented PostgreSQL pattern is that FPW is checkpoint-coupled. The PostgreSQL documentation states that the first modification of a page after checkpoint writes the full page image to WAL, and that increasing checkpoint interval parameters can reduce that cost. That is not an implementation footnote; it is the operational reason write latency often worsens around checkpoint-heavy workloads.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Documented behavior&lt;/th&gt;&lt;th&gt;Production implication&lt;/th&gt;&lt;th&gt;Validation signal&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;full_page_writes=on&lt;/code&gt; is the default in PostgreSQL and protects against partially completed page writes&lt;/td&gt;&lt;td&gt;Disabling it for throughput changes the crash-safety contract&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW full_page_writes;&lt;/code&gt; must be treated as a durability check, not a tuning curiosity&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Full page images occur on first page modification after checkpoint&lt;/td&gt;&lt;td&gt;Checkpoint cadence directly affects WAL amplification&lt;/td&gt;&lt;td&gt;WAL growth should be measured before and after &lt;code&gt;CHECKPOINT&lt;/code&gt; under the same write workload&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;wal_compression&lt;/code&gt; can compress full page images with &lt;code&gt;pglz&lt;/code&gt;, &lt;code&gt;lz4&lt;/code&gt;, or &lt;code&gt;zstd&lt;/code&gt; when compiled in&lt;/td&gt;&lt;td&gt;Compression shifts cost from WAL bandwidth to CPU and replay decompression&lt;/td&gt;&lt;td&gt;Compare WAL bytes and CPU saturation with each compression method&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;pg_checksums&lt;/code&gt; can verify checksums offline when checksums are enabled&lt;/td&gt;&lt;td&gt;Checksums detect page corruption; they do not repair missing torn-page protection by themselves&lt;/td&gt;&lt;td&gt;Restart, stop cleanly, run &lt;code&gt;pg_checksums --check&lt;/code&gt; against the cluster&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;InnoDB DWB writes pages to doublewrite files before final placement&lt;/td&gt;&lt;td&gt;InnoDB pays an extra page-copy step outside the user transaction’s immediate WAL insert path&lt;/td&gt;&lt;td&gt;Monitor page cleaner activity, doublewrite files, fsync latency, and data-file writeback&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The documented InnoDB pattern is different. MySQL 8.4 says InnoDB writes flushed buffer-pool pages to doublewrite storage before writing to final data files, and crash recovery can use the doublewrite copy if the final page write was interrupted. The same documentation also says data is written twice, but not necessarily at twice the I/O operation cost, because the doublewrite write is a large sequential chunk with a single &lt;code&gt;fsync()&lt;/code&gt; in normal configurations.&lt;/p&gt;
&lt;p&gt;That distinction is the architecture lesson. Equal total bytes do not imply equal user-visible latency. A foreground WAL write competes with commit progress. A background doublewrite stage competes with page flushing, eviction, checkpoint completion, and storage bandwidth. Both queues can saturate; they fail differently.&lt;/p&gt;
&lt;p&gt;The source experiment’s reported pgbench numbers are consistent with this mechanism. In the reported write-only 128-thread result, FPW-on delivered 14,857 transactions per second, while the DWB prototype delivered 33,814 transactions per second. The interesting result is not “DWB is 2.3x faster” as a universal claim. The interesting result is that moving the copy away from foreground WAL changed where the bottleneck surfaced.&lt;/p&gt;
&lt;p&gt;For production builders, the deeper lesson is about validation. A storage-engine change is not proven by a five-minute pgbench run. It needs a crash matrix.&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Test class&lt;/th&gt;&lt;th&gt;What it proves&lt;/th&gt;&lt;th&gt;Minimum bar&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Forced partial final-page write&lt;/td&gt;&lt;td&gt;DWB can repair a torn data page&lt;/td&gt;&lt;td&gt;Inject half-page writes and confirm recovery restores the page&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Crash after doublewrite sync before final scatter write&lt;/td&gt;&lt;td&gt;Durable repair copy exists before final placement&lt;/td&gt;&lt;td&gt;Restart must complete without checksum failure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Crash during doublewrite write&lt;/td&gt;&lt;td&gt;Recovery ignores incomplete doublewrite entries&lt;/td&gt;&lt;td&gt;Restart must not restore from a corrupt doublewrite slot&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Checkpoint boundary crash&lt;/td&gt;&lt;td&gt;Recovery point is not advanced beyond protected pages&lt;/td&gt;&lt;td&gt;Repeated kill during checkpoint must preserve logical contents&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replica and backup interaction&lt;/td&gt;&lt;td&gt;WAL stream remains sufficient for replicas and point-in-time recovery expectations&lt;/td&gt;&lt;td&gt;Physical replica, base backup, and restore tests must pass&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Device diversity&lt;/td&gt;&lt;td&gt;Sequential-write assumptions hold on real storage&lt;/td&gt;&lt;td&gt;Test local NVMe, network-attached block storage, and throttled cloud volumes&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;I have not run this PostgreSQL DWB prototype at scale personally. The documented failure mode is clear anyway: if a DWB design acknowledges a checkpoint or allows final data-file writes before the repair copy is durable, it can create a database that looks faster until the first badly timed crash. That is the least charming kind of benchmark.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Doublewrite area becomes the new bottleneck&lt;/td&gt;&lt;td&gt;High dirty-page churn with &lt;code&gt;shared_buffers&lt;/code&gt; large enough to delay eviction, then sudden checkpoint pressure&lt;/td&gt;&lt;td&gt;Size the doublewrite area for flush bursts; track fsync latency and dirty buffer age&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Recovery restores the wrong page version&lt;/td&gt;&lt;td&gt;Doublewrite metadata does not encode relation identity, fork, block number, and page LSN safely&lt;/td&gt;&lt;td&gt;Treat DWB metadata as recovery-critical; checksum the slot header and page body&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Checkpoint completes too early&lt;/td&gt;&lt;td&gt;Prototype marks pages safe after scheduling writeback instead of after durable doublewrite or durable final write&lt;/td&gt;&lt;td&gt;Checkpoint accounting must wait for a durable protection point&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cloud block storage reorders or stalls writes&lt;/td&gt;&lt;td&gt;Network-attached volumes with variable latency and opaque cache behavior&lt;/td&gt;&lt;td&gt;Test under the actual storage class; do not extrapolate from local NVMe&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;WAL compression already solves enough of the pain&lt;/td&gt;&lt;td&gt;PostgreSQL workload has compressible full page images and CPU headroom&lt;/td&gt;&lt;td&gt;Benchmark &lt;code&gt;wal_compression=zstd&lt;/code&gt; or &lt;code&gt;lz4&lt;/code&gt; before changing storage architecture&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Full-page images help replica recovery behavior&lt;/td&gt;&lt;td&gt;Large working sets where WAL page images reduce random data-page reads during replay&lt;/td&gt;&lt;td&gt;Measure replica replay lag and recovery prefetch behavior, not only primary throughput&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;DWB increases write amplification under cold churn&lt;/td&gt;&lt;td&gt;Workload dirties pages once and evicts them without repeated updates&lt;/td&gt;&lt;td&gt;Compare physical bytes written per committed transaction across FPW and DWB&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;AI-generated kernel patch misses crash edge cases&lt;/td&gt;&lt;td&gt;Normal regression tests pass because they rarely interrupt I/O at durability boundaries&lt;/td&gt;&lt;td&gt;Add fault injection, checksum validation, crash loops, and page-level corruption tests&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Treating all durability writes as equivalent hides the queue that users actually wait on.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Keep transaction ordering in WAL, but move torn-page repair copies to a durable background flush mechanism when the storage engine can prove the ordering.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A credible result is not one pgbench chart; it is lower foreground WAL amplification plus successful crash recovery across forced partial writes and checkpoint-boundary failures.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, measure your PostgreSQL WAL growth around &lt;code&gt;CHECKPOINT&lt;/code&gt; with &lt;code&gt;full_page_writes=on&lt;/code&gt;, test &lt;code&gt;wal_compression&lt;/code&gt;, and record p95 commit latency alongside &lt;code&gt;pg_stat_bgwriter&lt;/code&gt; and &lt;code&gt;pg_stat_io&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A storage engine is allowed to be faster only after it has earned the right to crash badly and come back boring.&lt;/p&gt;</content:encoded><category>databases</category><category>ai-engineering</category><category>architecture</category></item><item><title>PostgreSQL 18 Replication Upgrade Opportunities</title><link>https://rajivonai.com/blog/2025-04-21-postgresql-18-replication-upgrade-opportunities/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-04-21-postgresql-18-replication-upgrade-opportunities/</guid><description>What changes in replication when upgrading from PostgreSQL 14–16 to PostgreSQL 18: parallel apply, pg_createsubscriber, and surfaced conflict visibility.</description><pubDate>Tue, 07 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;PostgreSQL 18 ships with replication changes that are improvements in normal operation and surprises in the first week after upgrade.&lt;/strong&gt; Parallel logical apply, the &lt;code&gt;pg_createsubscriber --all&lt;/code&gt; utility, and better conflict logging each change the operational model for replication in ways that require preparation — not because they are dangerous, but because they surface behavior that was previously invisible. Planning the upgrade without understanding these changes means discovering them at 2 AM.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This post was originally written during the PostgreSQL 18 beta 1 period. It has been updated to confirm behavior against the final release (September 25, 2025). The &lt;code&gt;conflict_resolution&lt;/code&gt; parameter and &lt;code&gt;pg_createsubscriber --all&lt;/code&gt; behavior described here reflect the GA release.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;p&gt;Upgrading to PostgreSQL 18 introduces critical changes to logical replication that alter default concurrency and conflict visibility. While these represent architectural improvements, they will break applications that assume sequential logical apply and will trigger alerts for previously silent replication conflicts. Engineering leaders must ensure teams audit their current logical replication topology, explicitly test parallel apply ordering assumptions, and tune monitoring to handle the new structured conflict logging before upgrading production environments.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Teams on PostgreSQL 14, 15, or 16 are increasingly evaluating an upgrade to PostgreSQL 18. The database engine improvements — parallel query enhancements, improved statistics, and JSON improvements — are the typical headline justifications. Replication is often assessed as “nothing major changed” until someone runs the upgrade in staging and discovers that the conflict logging they had silenced for years is now surfacing in a new format that breaks their monitoring.&lt;/p&gt;
&lt;p&gt;The three replication areas that actually change in PostgreSQL 18 and require deliberate assessment:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parallel logical apply&lt;/strong&gt; (available since PostgreSQL 16, now enabled by default with &lt;code&gt;max_parallel_apply_workers_per_subscription = 2&lt;/code&gt;): logical replication can now apply transactions concurrently across multiple apply workers when the publisher commits parallel transactions. This improves throughput significantly for write-heavy publishers but means that the apply order across concurrent transactions is no longer sequential — which breaks applications that assume apply order matches commit order.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;pg_createsubscriber --all&lt;/code&gt;&lt;/strong&gt;: a new command-line utility that converts a physical streaming standby into a logical replication subscriber in a single operation. Teams with physical standbys used for read scaling can now convert them to logical subscribers without tearing down and rebuilding the standby. This is an opportunity for teams that want subscriber-level table filtering or cross-version replication.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Improved conflict logging&lt;/strong&gt;: PostgreSQL 18 surfaces logical replication conflicts with more detail in the server log, including the specific row values involved. Previously, conflicts were logged at a level that was easy to suppress; now they appear as &lt;code&gt;ERROR&lt;/code&gt; level with structured detail. If you had suppressed replication conflict alerts because the volume was too noisy, PostgreSQL 18 will make them reappear prominently.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The current approach to PostgreSQL major version upgrades often treats replication as a transparent layer that will simply resume functioning once the engine is upgraded. However, this approach breaks when upgrading to PostgreSQL 18 because the default concurrency model for logical replication fundamentally shifts.&lt;/p&gt;
&lt;p&gt;When a team upgrades a logical subscriber to PostgreSQL 18 without preparation, the new default of &lt;code&gt;max_parallel_apply_workers_per_subscription = 2&lt;/code&gt; immediately activates. If the downstream application relies on strict sequential ordering of independent transactions — for example, building derived state or feeding an event-driven architecture — the sudden parallel apply will cause subtle data anomalies. Concurrently, the new verbose conflict logging will trigger massive volumes of &lt;code&gt;ERROR&lt;/code&gt; level alerts for conflicts that were previously ignored, overwhelming observability pipelines.&lt;/p&gt;
&lt;p&gt;How can engineering teams proactively identify and manage these replication changes before they cause data anomalies and alert fatigue in production?&lt;/p&gt;
&lt;h2 id=&quot;upgrade-readiness-framework&quot;&gt;Upgrade Readiness Framework&lt;/h2&gt;
&lt;p&gt;To navigate these changes, teams should follow a structured diagnostic and remediation process.&lt;/p&gt;
&lt;h3 id=&quot;symptoms-and-signals&quot;&gt;Symptoms and Signals&lt;/h3&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Current replication lag baseline&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_replication.replay_lag&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Establish before upgrade to detect regression&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Existing logical subscriptions&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_subscription&lt;/code&gt; on subscribers&lt;/td&gt;&lt;td&gt;Will be affected by parallel apply default&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replication conflict errors in current logs&lt;/td&gt;&lt;td&gt;&lt;code&gt;postgresql.log&lt;/code&gt; grep for &lt;code&gt;conflict in logical replication&lt;/code&gt;&lt;/td&gt;&lt;td&gt;These will become more visible in PG18&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Physical standbys that could become logical&lt;/td&gt;&lt;td&gt;Infrastructure inventory&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_createsubscriber --all&lt;/code&gt; conversion opportunity&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Current &lt;code&gt;max_wal_senders&lt;/code&gt; and &lt;code&gt;max_replication_slots&lt;/code&gt; values&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW max_wal_senders; SHOW max_replication_slots;&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Parallel apply adds additional worker connections&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h3 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Identify current replication type and topology&lt;/strong&gt; — establish what you have before planning what changes:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check physical standbys (streaming replication)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; client_addr, application_name, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, sent_lsn, replay_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;       now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_last_xact_replay_timestamp() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; lag_estimate&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_replication;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check logical subscriptions (run on subscriber)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; subname, subenabled, subconninfo, subpublications&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_subscription;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check logical publishers (run on publisher)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pubname, puballtables, pubinsert, pubupdate, pubdelete&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_publication;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This establishes your current topology. Physical standbys and logical subscribers are upgraded differently — physical standbys follow the primary’s upgrade path, logical subscribers can remain on older versions while the publisher upgrades to PG18, which is one of the benefits of logical replication.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Measure current replication lag baseline&lt;/strong&gt; — capture before upgrade to detect regressions:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- On publisher: physical replication lag&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  application_name,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  client_addr,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  write_lag,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  flush_lag,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  replay_lag&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_replication&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; replay_lag &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NULLS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LAST&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- On subscriber: time-based lag for logical replication&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  subname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  received_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_msg_send_time,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_msg_receipt_time,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  latest_end_time&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_subscription;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Record these baseline values. After the upgrade, the same queries run against the upgraded instance should show stable or improved lag. If lag increases after upgrade, parallel apply worker count or worker connection limits may need tuning.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Check for existing logical replication subscriptions&lt;/strong&gt; — these require the most careful upgrade planning:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- On subscriber: full subscription inventory&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;subname&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;subenabled&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;srrelid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;::regclass &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;srsubstate&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_subscription s&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_subscription_rel r &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;srsubid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;oid&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;subname&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;srsubstate&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check current parallel apply setting (PostgreSQL 16+)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW max_parallel_apply_workers_per_subscription;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If your subscribers are on PostgreSQL 16 or 17, &lt;code&gt;max_parallel_apply_workers_per_subscription&lt;/code&gt; may already be set. If subscribers are on PostgreSQL 14 or 15, this parameter does not exist yet — it becomes relevant when the subscriber is upgraded to 18.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Audit current conflict handling&lt;/strong&gt; — understand what conflicts are already happening silently:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Search the current PostgreSQL log for existing replication conflicts&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;grep&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -c&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;conflict in logical replication&apos;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /var/log/postgresql/postgresql.log&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Get the distinct conflict types&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;grep&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;conflict in logical replication&apos;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /var/log/postgresql/postgresql.log&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;  grep&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -oP&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;conflict on \w+&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; sort&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; uniq&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -c&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; sort&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -rn&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you find zero conflicts in the log, either your replication is clean or conflicts are being logged at a level you are not capturing. After upgrading to PostgreSQL 18, conflict errors will be more prominently logged. Knowing the baseline before upgrade means you can distinguish “this is a new problem” from “this was always happening.”&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Check &lt;code&gt;max_wal_senders&lt;/code&gt; and &lt;code&gt;max_replication_slots&lt;/code&gt; headroom&lt;/strong&gt; — parallel apply uses additional worker slots:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW max_wal_senders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW max_replication_slots;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Current usage&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; active_wal_senders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_replication;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; active_slots &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_replication_slots &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; active;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Parallel apply workers each require a &lt;code&gt;walsender&lt;/code&gt; connection from the publisher. If you have 5 logical subscribers with &lt;code&gt;max_parallel_apply_workers_per_subscription = 2&lt;/code&gt;, you need at minimum &lt;code&gt;5 * (1 + 2) = 15&lt;/code&gt; wal senders just for logical replication. Ensure &lt;code&gt;max_wal_senders&lt;/code&gt; is sized to accommodate this plus physical standbys.&lt;/p&gt;
&lt;h3 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h3&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Planning PG18 upgrade] --&gt; B{Using logical replication?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| C{Parallel apply already enabled?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|yes — PG16 or 17| D[Test apply ordering assumptions in staging]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|no — PG14 or 15| E[Set max_parallel_apply to 0 initially after upgrade]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; F[Enable incrementally after validation]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no — physical only| G{Physical standbys present?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|yes| H{Convert any to logical?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|yes| I[Test pg_createsubscriber in staging first]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|no| J[Physical replication — minimal changes in PG18]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; K{Conflict log volume change after upgrade?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|yes — more conflicts visible| L[Review and resolve — do not suppress]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|no| M[Validate lag baseline matches pre-upgrade]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Staged parallel apply enablement&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;After upgrading the subscriber to PostgreSQL 18, start with parallel apply disabled, validate behavior, then enable incrementally:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Disable parallel apply immediately after upgrade&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUBSCRIPTION my_subscription&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (max_parallel_apply_workers_per_subscription &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify subscriber is applying correctly with zero parallel workers&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; subname, received_lsn, latest_end_lsn, latest_end_time&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_subscription;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- After 48 hours of stable operation, enable with 1 worker&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUBSCRIPTION my_subscription&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (max_parallel_apply_workers_per_subscription &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- If stable for another 48 hours, increase to default&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUBSCRIPTION my_subscription&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (max_parallel_apply_workers_per_subscription &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 2&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The risk of parallel apply is not data corruption — PostgreSQL ensures causally-related transactions are applied in order. The risk is application code that assumes a specific apply order between causally-independent transactions and uses that assumption to build derived state.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Convert physical standby with &lt;code&gt;pg_createsubscriber&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PostgreSQL 18 includes &lt;code&gt;pg_createsubscriber&lt;/code&gt; with an &lt;code&gt;--all&lt;/code&gt; flag that converts an existing physical standby to a logical subscriber in one operation:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Stop the standby (required — it cannot be running during conversion)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_ctl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; stop&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -D&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /var/lib/postgresql/standby_data&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Convert to logical subscriber&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# (run as postgres user, connecting to publisher)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_createsubscriber&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --pgdata=/var/lib/postgresql/standby_data&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --publisher-server=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;host=publisher port=5432 dbname=mydb&quot;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --all&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --subscription-name=my_logical_sub&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Start the converted subscriber&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_ctl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; start&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -D&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /var/lib/postgresql/standby_data&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Verify subscription is running&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;psql&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -c&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;SELECT subname, subenabled FROM pg_subscription;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;--all&lt;/code&gt; flag replicates all tables from all databases, equivalent to &lt;code&gt;FOR ALL TABLES IN SCHEMA public&lt;/code&gt;. Per the PostgreSQL 18 beta documentation, the standby must be on the same major version as the publisher for the conversion to succeed.&lt;/p&gt;
&lt;p&gt;This is an opportunity if you have read replicas that are underutilized as physical standbys and would benefit from logical replication’s filtering and cross-version upgrade flexibility.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Conflict monitoring setup for PG18 log format&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PostgreSQL 18 logs replication conflicts with structured detail. Update any log parsing or alerting to match the new format:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# New PG18 conflict log format includes row values:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# ERROR:  conflict detected on relation &quot;public.orders&quot;: conflict=insert_exists&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;#         Key (id)=(12345); existing local tuple (12345, &apos;pending&apos;, ...);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;#         remote tuple (12345, &apos;shipped&apos;, ...); ...&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Update log monitoring to capture conflict type&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;grep&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -E&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;conflict=(insert_exists|update_missing|delete_missing)&apos;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  /var/log/postgresql/postgresql.log&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;  awk&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;{print $NF}&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; sort&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; uniq&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -c&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Set up a per-conflict-type count alert in your monitoring tool&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Alert threshold: &gt; 10 conflicts per hour of any type&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The PostgreSQL 18 beta documentation describes the &lt;code&gt;conflict_resolution&lt;/code&gt; parameter for subscriptions (new in PG18), which can be set to &lt;code&gt;apply_remote&lt;/code&gt; (default), &lt;code&gt;keep_local&lt;/code&gt;, or &lt;code&gt;skip&lt;/code&gt; to control automatic conflict resolution behavior. Previously, all conflicts required manual &lt;code&gt;SKIP&lt;/code&gt; intervention.&lt;/p&gt;
&lt;h3 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Parallel apply&lt;/strong&gt;: disable immediately with &lt;code&gt;ALTER SUBSCRIPTION ... SET (max_parallel_apply_workers_per_subscription = 0)&lt;/code&gt;. No data loss — takes effect on the next transaction. Reversible at any time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;pg_createsubscriber&lt;/code&gt; conversion&lt;/strong&gt;: not directly reversible — once converted to a logical subscriber, restoring to a physical standby requires rebuilding the standby from the primary with &lt;code&gt;pg_basebackup&lt;/code&gt;. Keep a snapshot of the standby data directory before conversion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL 18 upgrade&lt;/strong&gt;: major version downgrades require restoring from a pre-upgrade backup. The upgrade itself does not change replication topology; the changes are in behavior. Pre-upgrade backup is the only rollback path.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conflict resolution parameter&lt;/strong&gt;: &lt;code&gt;ALTER SUBSCRIPTION ... SET (conflict_resolution = &apos;skip&apos;)&lt;/code&gt; can be set or unset at any time without a restart.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h3&gt;
&lt;p&gt;A pre-upgrade validation script that runs the five checks automatically and flags risks:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;#!/bin/bash&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# PostgreSQL 18 replication upgrade readiness check&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;PSQL&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;psql -tAc&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;=== Replication Upgrade Readiness Check ===&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Check 1: Replication topology&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;--- Logical subscriptions:&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$PSQL &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;SELECT count(*) FROM pg_subscription WHERE subenabled;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Check 2: Current lag&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;--- Max replay lag (physical):&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$PSQL &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;SELECT max(replay_lag) FROM pg_stat_replication;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Check 3: Parallel apply headroom&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;MAX_WS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$($PSQL &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;SHOW max_wal_senders;&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ACTIVE_WS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$($PSQL &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;SELECT count(*) FROM pg_stat_replication;&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SUB_COUNT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$($PSQL &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;SELECT count(*) FROM pg_subscription;&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;NEEDED_WS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$((&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;ACTIVE_WS&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; SUB_COUNT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 3&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;))  &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# conservative: 3 workers per sub&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;--- max_wal_senders: &lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$MAX_WS&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;, current active: &lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$ACTIVE_WS&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;, needed with parallel: &lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$NEEDED_WS&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Check 4: Existing conflict count&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;--- Conflict count in last 7 days of logs:&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;grep&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -c&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;conflict in logical replication&apos;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /var/log/postgresql/postgresql.log&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; 2&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;/dev/null&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ||&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;0&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;=== Done ===&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run this against production before the upgrade window and again 24 hours after the upgrade to confirm stable behavior.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern is that PostgreSQL 18 fundamentally alters logical replication concurrency. The PostgreSQL Global Development Group’s beta release notes describe parallel logical apply as controlled by &lt;code&gt;max_parallel_apply_workers_per_subscription&lt;/code&gt;, with a default of 2 workers. The parallel apply documentation explicitly notes that causally-related transactions — transactions where one depends on the other’s committed state — are always applied in order, but independent concurrent transactions may be applied in a different order than they were committed on the publisher.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;pg_createsubscriber&lt;/code&gt; utility was introduced in PostgreSQL 17 and is extended in PostgreSQL 18 with the &lt;code&gt;--all&lt;/code&gt; flag. The documented behavior is that it stops WAL recovery on the standby, promotes it to standalone, creates the necessary publication on the publisher, and sets up the logical subscription — all in one operation. The beta documentation notes that the standby must have been a synchronous or asynchronous physical standby that was fully caught up at the time of conversion.&lt;/p&gt;
&lt;h2 id=&quot;tradeoff-matrix&quot;&gt;Tradeoff Matrix&lt;/h2&gt;
&lt;p&gt;Three distinct upgrade paths. Each is appropriate for a different team posture — the wrong choice for your application topology creates the failure modes in the table below.&lt;/p&gt;

































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Upgrade path&lt;/th&gt;&lt;th&gt;Sequential apply guarantee&lt;/th&gt;&lt;th&gt;Ops complexity&lt;/th&gt;&lt;th&gt;Standby topology change&lt;/th&gt;&lt;th&gt;When to choose&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Disable parallel apply&lt;/strong&gt; — set &lt;code&gt;max_parallel_apply_workers = 0&lt;/code&gt; after upgrade&lt;/td&gt;&lt;td&gt;Preserved fully&lt;/td&gt;&lt;td&gt;Low&lt;/td&gt;&lt;td&gt;None&lt;/td&gt;&lt;td&gt;Any application with causal ordering assumptions; start here for every upgrade&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Enable parallel apply incrementally&lt;/strong&gt; — 0 → 1 → 2 workers over 96 hours&lt;/td&gt;&lt;td&gt;Relaxed for causally-independent txns only&lt;/td&gt;&lt;td&gt;Medium — requires apply-order audit&lt;/td&gt;&lt;td&gt;None&lt;/td&gt;&lt;td&gt;Event-driven consumers that tolerate out-of-order independent writes; high-write publishers&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Convert standby to logical&lt;/strong&gt; — run &lt;code&gt;pg_createsubscriber --all&lt;/code&gt;&lt;/td&gt;&lt;td&gt;N/A — logical replication model&lt;/td&gt;&lt;td&gt;High — topology change, irreversible without rebuild&lt;/td&gt;&lt;td&gt;Physical standby becomes logical subscriber&lt;/td&gt;&lt;td&gt;Teams needing table-level filtering, cross-version replication, or subscriber-level write access&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Choosing parallel apply without an ordering audit is the highest-risk option — it silently changes the consistency model of your subscriber for any application that reads derived state across independent tables.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Application reads stale data from subscriber&lt;/td&gt;&lt;td&gt;Parallel apply changes apply order for independent transactions&lt;/td&gt;&lt;td&gt;Audit application for causal ordering assumptions; add explicit ordering via sequence or timestamp&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;max_wal_senders&lt;/code&gt; exceeded after enabling parallel apply&lt;/td&gt;&lt;td&gt;Multiple subscriptions × parallel workers exceeds the limit&lt;/td&gt;&lt;td&gt;Increase &lt;code&gt;max_wal_senders&lt;/code&gt; before enabling parallel apply&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Conflict log volume overwhelms monitoring&lt;/td&gt;&lt;td&gt;PG18 surfaces previously-silent conflicts at ERROR level&lt;/td&gt;&lt;td&gt;Triage and resolve conflicts; do not suppress — they represent real data divergence&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;pg_createsubscriber&lt;/code&gt; fails mid-conversion&lt;/td&gt;&lt;td&gt;Standby still active or primary unreachable during conversion&lt;/td&gt;&lt;td&gt;Stop standby completely before running; verify publisher connectivity&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Conflict resolution parameter set to &lt;code&gt;skip&lt;/code&gt; globally&lt;/td&gt;&lt;td&gt;All conflicts silently skipped — subscriber diverges permanently&lt;/td&gt;&lt;td&gt;Set &lt;code&gt;conflict_resolution = &apos;apply_remote&apos;&lt;/code&gt; for insert conflicts; investigate and fix root cause&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: PostgreSQL 18 enables parallel logical apply by default and surfaces replication conflicts at a higher log level — both are improvements that can cause operational surprises if not prepared for before the upgrade.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Set &lt;code&gt;max_parallel_apply_workers_per_subscription = 0&lt;/code&gt; immediately after upgrading logical replication subscribers, validate behavior, then enable incrementally after confirming application ordering assumptions hold.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After upgrade, replication lag should match or improve versus the pre-upgrade baseline, and &lt;code&gt;pg_stat_subscription.received_lsn&lt;/code&gt; should advance continuously.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run the five pre-upgrade checks against your production database this week. Record baseline lag values and conflict log counts so you have a comparison point for post-upgrade validation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;checklist&quot;&gt;Checklist&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Identify replication topology — physical standbys, logical subscribers, or both&lt;/li&gt;
&lt;li&gt;Record baseline replication lag from &lt;code&gt;pg_stat_replication&lt;/code&gt; and &lt;code&gt;pg_stat_subscription&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Check current &lt;code&gt;max_wal_senders&lt;/code&gt; — calculate headroom with parallel apply workers added&lt;/li&gt;
&lt;li&gt;Count existing replication conflicts in current logs — establish baseline before upgrade&lt;/li&gt;
&lt;li&gt;Check for logical subscriptions on PostgreSQL 14 or 15 — plan subscriber upgrade path&lt;/li&gt;
&lt;li&gt;Test upgrade procedure in staging with production data volume — including parallel apply enabled&lt;/li&gt;
&lt;li&gt;After upgrade: immediately set &lt;code&gt;max_parallel_apply_workers_per_subscription = 0&lt;/code&gt; on all subscribers&lt;/li&gt;
&lt;li&gt;Run for 48 hours at zero parallel workers — confirm lag is stable and no new conflicts&lt;/li&gt;
&lt;li&gt;Enable parallel apply with 1 worker — monitor for 48 hours&lt;/li&gt;
&lt;li&gt;Increase to default 2 workers — monitor lag and conflict log for another 48 hours&lt;/li&gt;
&lt;/ol&gt;</content:encoded><category>databases</category><category>architecture</category><category>checklist</category></item><item><title>Top GitHub Breakouts: August 2025 — Part II</title><link>https://rajivonai.com/blog/2025-09-27-github-stars-aug-2025/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-09-27-github-stars-aug-2025/</guid><description>The highest-starred new open-source projects in August 2025 where AI takes over cloud operations, infrastructure provisioning, and production Postgres coding.</description><pubDate>Sat, 27 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The last generation of AI tooling told engineers what was wrong. August 2025’s second wave goes further — cloud agents that provision infrastructure from a description, AI that translates natural language into AWS operations, and an MCP server that teaches coding agents what production Postgres actually looks like. The gap being closed is not information; it is execution.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;AI-assisted operations have followed a familiar arc: first came dashboards, then query-answering chatbots, then recommendation engines. Each layer added latency between the diagnosis and the fix. The bottleneck was always the same: a human in the loop who had to translate the AI’s output into a real action.&lt;/p&gt;
&lt;p&gt;The tools gaining traction in August 2025 skip the translation step. They connect AI models directly to execution paths — a cloud CLI that generates and applies infrastructure plans, an agent that owns the AWS state machine, and a Postgres MCP server that gives coding agents the context they need to generate correct production SQL without a DBA in the loop.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Domain&lt;/th&gt;&lt;th&gt;Manual bottleneck&lt;/th&gt;&lt;th&gt;What it costs&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;System design&lt;/td&gt;&lt;td&gt;Translating a verbal infrastructure description into provider-specific CLI commands&lt;/td&gt;&lt;td&gt;30–60 minutes of lookup, flag-checking, and dry-runs per change&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform engineering&lt;/td&gt;&lt;td&gt;Context-switching between AWS console, Terraform state, and incident context during an outage&lt;/td&gt;&lt;td&gt;Slow incident response; cognitive overhead on the most critical path&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform engineering&lt;/td&gt;&lt;td&gt;Writing Terraform or CloudFormation for each new AWS resource type added to a service&lt;/td&gt;&lt;td&gt;Weeks of IaC work before a new service reaches production&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;Providing AI coding agents with enough Postgres context to generate production-safe SQL&lt;/td&gt;&lt;td&gt;Agents that generate syntactically valid but operationally wrong queries (missing indexes, wrong isolation levels, no error handling)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Can AI tooling take over the execution step without requiring engineers to review every generated action in a separate review cycle?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Human describes intent in plain language] --&gt; B[Cloud infrastructure request]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; C[AWS provisioning request]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; D[Production Postgres code request]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; E[bgdnvk — Clanker CLI]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; F[VersusControl — AI Infrastructure Agent]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; G[timescale — Tiger CLI and MCP]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; H[Inspect and generate infra plans]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; I[Natural language to AWS operations]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt; J[Context-aware Postgres code generation]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;bgdnvkclanker--cloud-infrastructure-questions-and-plan-generation-from-the-terminal&quot;&gt;bgdnvk/clanker — cloud infrastructure questions and plan generation from the terminal&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: Engineers asking “what is deployed in this environment?” have to query multiple AWS/GCP/Cloudflare APIs manually; generating a change plan means writing CLI commands or Terraform from scratch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces that task&lt;/strong&gt;: The README describes Clanker as the CLI powering “the first AI DevOps IDE for agents and humans.” It supports two flows: an inspect flow (“ask questions about your infra”) and a maker/deploy flow (“generate or apply infrastructure and deploy plans”). It connects to your existing AWS CLI profiles — not raw keys — and uses OpenAI, Gemini, or Cohere as the reasoning backend. The ask-questions flow queries live infrastructure state; the maker flow generates plans the engineer can review before applying.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;: Install via Homebrew (&lt;code&gt;brew tap clankercloud/tap &amp;#x26;&amp;#x26; brew install clanker&lt;/code&gt;) or from source. Run &lt;code&gt;clanker config init&lt;/code&gt; to wire in your cloud credentials and AI provider. Then: &lt;code&gt;clanker ask &quot;what EC2 instances are running in production?&quot;&lt;/code&gt; for inspection, or trigger the maker flow to generate a deployment plan from a description. The README notes AWS CLI v2 is required; v1 breaks the &lt;code&gt;--no-cli-pager&lt;/code&gt; flag.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: Clanker is in active early development — the README links to docs.clankercloud.ai for full feature coverage, which signals the CLI surface is still shifting. The maker/deploy flow generates plans for review, not autonomous applies; teams expecting zero-touch automation will still have an approval step.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;versuscontrolai-infrastructure-agent--natural-language-to-aws-operations-with-state-tracking&quot;&gt;VersusControl/ai-infrastructure-agent — natural language to AWS operations with state tracking&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: Provisioning an EC2 instance with a matching security group requires knowing the specific CLI flags, correct CIDR notation, and order-of-operations across multiple &lt;code&gt;aws&lt;/code&gt; subcommands.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces that task&lt;/strong&gt;: The README describes an agent that translates a natural language request like “Create an EC2 instance for hosting an Apache Server with a dedicated security group that allows inbound HTTP and SSH traffic” into a sequenced set of AWS API calls, while maintaining a Terraform-like state file to track what it has provisioned. It supports OpenAI GPT, Google Gemini, Anthropic Claude, AWS Bedrock Nova, and Ollama as the reasoning layer, and includes a web dashboard with built-in conflict detection and dry-run mode.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;: The agent maintains state and performs conflict detection before executing, which means it can identify when a requested resource would overlap with existing infrastructure. Current resource support per the README: VPC, EC2, security groups, Autoscaling Groups, and ALB.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: The README explicitly labels this “a proof-of-concept implementation” that is “not intended for production use.” This is worth taking seriously — the state management approach is described as “Terraform-like” but the codebase is in active development. The honest use case right now is evaluation and learning, not replacing Terraform in a production pipeline.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;timescaletiger-cli--mcp-server-that-teaches-ai-coding-agents-production-postgres&quot;&gt;timescale/tiger-cli — MCP server that teaches AI coding agents production Postgres&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: AI coding agents generating SQL or application database code lack the context to know whether their output is operationally safe — correct index usage, right transaction isolation level, appropriate use of connection pooling, error handling patterns for production Postgres.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces that task&lt;/strong&gt;: Tiger CLI is the interface for Timescale’s managed Postgres service (Tiger Cloud), and the README describes a built-in MCP server (&lt;code&gt;tiger mcp install&lt;/code&gt;) designed to give AI assistants the production Postgres context they need. The project description calls this “context engineering” — the MCP server surfaces live schema information, service configuration, and connection parameters so coding agents can generate SQL that matches the actual production environment rather than a generic Postgres assumption.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;: Install via &lt;code&gt;curl -fsSL https://cli.tigerdata.com | sh&lt;/code&gt;, authenticate with &lt;code&gt;tiger auth login&lt;/code&gt;, and run &lt;code&gt;tiger mcp install&lt;/code&gt; to register the MCP server with your AI assistant. From that point, the assistant has access to service metadata, connection strings, and schema context. The CLI also handles full service lifecycle: &lt;code&gt;tiger service create&lt;/code&gt;, &lt;code&gt;tiger db connect&lt;/code&gt;, &lt;code&gt;tiger service logs&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: Tiger CLI is tightly coupled to Tiger Cloud — the MCP server’s value comes from live access to a managed Timescale instance. Teams running self-hosted Postgres won’t get the same context richness without a separate MCP layer pointed at their own cluster.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern is to tightly couple AI execution with local identity and operational state. For example, Timescale built Tiger CLI’s MCP server to surface live database engine versions and connection pool configurations directly to agents, a public decision rooted in how PostgreSQL’s behavior dictates query generation constraints. Rather than generic code, agents need the live schema to avoid missing indexes or incorrect isolation levels. Similarly, tools like Clanker rely on the user’s existing AWS CLI profiles rather than new API keys, honoring existing IAM boundaries. The AI Infrastructure Agent acknowledges the risk of unsanctioned modifications by operating with a state file, much like Terraform, proving that even natural-language tooling must adopt established distributed systems reconciliation patterns to safely modify cloud infrastructure.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Clanker maker flow generates incorrect plan for multi-region resources&lt;/td&gt;&lt;td&gt;AI model lacks region-specific context in the prompt&lt;/td&gt;&lt;td&gt;Add region and account context explicitly in the request; review plans before applying&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;AI Infrastructure Agent state drifts from actual AWS state&lt;/td&gt;&lt;td&gt;Manual changes outside the agent between runs&lt;/td&gt;&lt;td&gt;Treat the agent’s state file as the source of truth; avoid manual console changes on agent-managed resources&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tiger CLI MCP loses context after schema changes&lt;/td&gt;&lt;td&gt;DDL applied outside the CLI session&lt;/td&gt;&lt;td&gt;Re-authenticate and refresh service metadata; run &lt;code&gt;tiger db connect&lt;/code&gt; to verify current schema&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Clanker requires AWS CLI v2 but v1 is installed&lt;/td&gt;&lt;td&gt;Legacy tooling in CI/CD environments&lt;/td&gt;&lt;td&gt;Pin &lt;code&gt;awscli&gt;=2.0&lt;/code&gt; in environment setup; test with &lt;code&gt;aws --version&lt;/code&gt; before wiring Clanker into a pipeline&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Engineering teams are still hand-writing cloud provisioning commands and generating SQL code without production database context — execution steps that AI can handle directly if given the right connections.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Clanker CLI for cloud infrastructure inspection and plan generation; AI Infrastructure Agent for natural-language-to-AWS provisioning (as an evaluation tool); Tiger CLI’s MCP server for grounding coding agents in live production Postgres context.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: The clearest signal from Tiger CLI is asking your AI coding assistant to write a query against your actual production schema — after &lt;code&gt;tiger mcp install&lt;/code&gt; — and comparing the output to what the same assistant produces without that context. The difference in index awareness and schema accuracy is the productivity delta.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run &lt;code&gt;tiger mcp install&lt;/code&gt; and connect it to a Tiger Cloud service (or evaluate against the free tier). Ask your coding assistant to generate a query you know is tricky — a multi-table join with a specific filter selectivity. Compare the output with and without MCP context.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>ai-engineering</category><category>cloud</category><category>databases</category></item><item><title>PostgreSQL 18: Features DB Engineers Should Watch</title><link>https://rajivonai.com/blog/2025-09-25-postgresql-18-features-db-engineers-should-watch/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-09-25-postgresql-18-features-db-engineers-should-watch/</guid><description>PostgreSQL 18 introduces fundamental changes to the storage engine — asynchronous I/O, parallel logical apply, and improved conflict visibility are the changes operators need to understand before upgrading.</description><pubDate>Thu, 25 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;PostgreSQL 18 shipped in September 2025 and delivers the most fundamental change to PostgreSQL’s storage engine in its history: asynchronous I/O.&lt;/strong&gt; This post was written in January 2025 based on accepted CommitFest patches and has been validated against the final PG18 release. All four features described below shipped as documented.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL has used synchronous I/O since its inception. Every read and write to storage blocks the backend process until the kernel returns. This is simple, predictable, and correct — but it means every disk-bound query is a sequence of blocking kernel calls with no opportunity for the backend to do useful work while waiting for I/O.&lt;/p&gt;
&lt;p&gt;Modern storage — NVMe SSDs, io_uring-capable kernels, cloud block storage with significant parallelism — is well-suited to concurrent I/O. PostgreSQL could not take advantage of this without a fundamental change to how it submits and waits for I/O requests.&lt;/p&gt;
&lt;p&gt;PG18 introduces asynchronous I/O as an optional mode. Alongside this, several replication and operational improvements address long-standing gaps. Operators who plan upgrades should understand these changes now, because some of them alter default behavior.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The synchronous I/O model has a measurable impact on workloads that require high disk throughput: parallel queries hitting large tables, checkpoint writers under heavy write load, and logical replication subscribers applying changes from high-write publishers. Each backend process can only have one I/O operation in flight at a time.&lt;/p&gt;
&lt;p&gt;The operational impact shows up as I/O utilization that looks low on aggregate metrics (storage is not at 100% IOPS) while query latency is high. The storage device has capacity, but PostgreSQL is not submitting enough concurrent requests to use it. This is the structural problem that asynchronous I/O in PG18 addresses.&lt;/p&gt;
&lt;p&gt;The risk for operators: asynchronous I/O changes how PostgreSQL interacts with the kernel, which changes how it behaves on specific OS and storage configurations. Teams that upgrade to PG18 on non-standard storage setups (network block storage, certain cloud filesystems, shared storage) may observe different I/O patterns than they expect. How should engineering teams prepare their infrastructure for PostgreSQL 18’s new I/O and replication models?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[&quot;Client Query&quot;] --&gt; B[&quot;PG18 Backend Process&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; C{&quot;io_method GUC&quot;}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|&quot;sync&quot;| D[&quot;Blocking Kernel Calls&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|&quot;worker&quot;| E[&quot;Background Worker Threads&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|&quot;io_uring&quot;| F[&quot;Linux io_uring Non-blocking AIO&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; G[&quot;Storage Engine&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;1. Asynchronous I/O (AIO)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PG18 introduces a framework for non-blocking I/O. On Linux with kernel 5.1 or newer, PostgreSQL can use &lt;code&gt;io_uring&lt;/code&gt; as the AIO backend. On other platforms, it falls back to a worker-thread-based AIO implementation.&lt;/p&gt;
&lt;p&gt;The GUC &lt;code&gt;io_method&lt;/code&gt; controls the behavior:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;sync&lt;/code&gt; — traditional synchronous I/O (always available, backward-compatible)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;worker&lt;/code&gt; — AIO using background worker threads (available on all platforms)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;io_uring&lt;/code&gt; — AIO using Linux io_uring (Linux 5.1 and newer; requires PostgreSQL built with &lt;code&gt;--with-liburing&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The expected benefit is measurable on parallel sequential scans and checkpointing — workloads where multiple I/O operations can be queued concurrently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Parallel streaming apply for logical replication&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PG17 improved sequence replication. PG18 extends parallel apply by changing the default &lt;code&gt;streaming&lt;/code&gt; option for &lt;code&gt;CREATE SUBSCRIPTION&lt;/code&gt; from &lt;code&gt;off&lt;/code&gt; to &lt;code&gt;parallel&lt;/code&gt;. In PG16 and PG17, parallel streaming required explicit configuration. In PG18, new subscriptions stream large transactions in parallel by default.&lt;/p&gt;
&lt;p&gt;The operational consequence: subscribers on PG18 will consume more CPU and hold more locks during apply than a comparable PG17 subscriber would. Conflict handling logic that assumes single-threaded apply ordering may behave differently with parallel apply enabled. The &lt;code&gt;pg_stat_subscription_stats&lt;/code&gt; view provides per-subscription apply metrics including conflict counts, which is the right place to observe this.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. &lt;code&gt;pg_createsubscriber --all&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PG18 adds &lt;code&gt;--all&lt;/code&gt; to &lt;code&gt;pg_createsubscriber&lt;/code&gt;, the tool for converting a physical standby into a logical replication subscriber. Before PG18, this required specifying individual databases or tables. With &lt;code&gt;--all&lt;/code&gt;, the tool sets up logical replication for all databases on the standby in one command.&lt;/p&gt;
&lt;p&gt;This simplifies the zero-downtime major version upgrade workflow significantly. The documented use case: take a physical streaming replica, convert it to a logical subscriber of the primary, let it catch up as a logical subscriber, then promote. The &lt;code&gt;--all&lt;/code&gt; flag reduces the setup steps for multi-database clusters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. Improved conflict visibility in logical replication&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Logical replication conflict handling in PG17 and earlier emitted minimal log information when a conflict occurred (a duplicate key or update to a row that was deleted on the subscriber). PG18 adds structured conflict detail to the log messages and extends &lt;code&gt;pg_stat_subscription_stats&lt;/code&gt; with conflict type counts.&lt;/p&gt;
&lt;p&gt;The operational impact: conflict-based apply failures are now diagnosable from log output without attaching debuggers or running manual queries. The new log format changes what conflict monitoring tools expect to parse. Log aggregation pipelines that alert on replication conflict patterns need to update their regex or structured log parsers before upgrading to PG18.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL 18’s AIO framework shipped with &lt;code&gt;io_uring&lt;/code&gt; requiring both Linux kernel 5.1 or newer and a PostgreSQL build with &lt;code&gt;--with-liburing&lt;/code&gt;. PostgreSQL’s behavior when falling back is well-defined: if the environment restricts &lt;code&gt;io_uring&lt;/code&gt; at the container or hypervisor level — which is common in some managed cloud offerings — the system gracefully falls back to traditional modes. Database operators must test the specific &lt;code&gt;io_method&lt;/code&gt; setting against their target storage environment.&lt;/p&gt;
&lt;p&gt;For logical replication, PostgreSQL’s behavior with &lt;code&gt;max_parallel_apply_workers_per_subscription&lt;/code&gt; is documented to change ordering guarantees. Within a single transaction, order is preserved, but across transactions, parallel workers may apply changes out of logical commit order. Applications that depend on subscribers seeing changes in strict commit order must account for this behavior change.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;AIO on unsupported storage or kernel&lt;/td&gt;&lt;td&gt;io_uring mode falls back to worker mode, and expected I/O gains do not materialize&lt;/td&gt;&lt;td&gt;io_uring requires kernel 5.1 or newer and is blocked in some cloud managed environments&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Parallel apply with existing conflict handling&lt;/td&gt;&lt;td&gt;Apply errors or stalled replication on rows processed out of expected order&lt;/td&gt;&lt;td&gt;Multi-worker apply does not guarantee cross-transaction ordering, so single-threaded conflict logic may not handle this correctly&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Log parsing for replication conflict alerts&lt;/td&gt;&lt;td&gt;Alert rules that matched old conflict log format produce no alerts or false positives&lt;/td&gt;&lt;td&gt;PG18 structured conflict log messages use a different format than PG17 unstructured messages&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: PG18’s AIO and default parallel apply change I/O behavior and replication ordering assumptions — upgrading without testing on representative workloads risks performance regressions and silent replication issues.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Test PG18 with &lt;code&gt;io_method = worker&lt;/code&gt; first to establish broad platform compatibility, validate logical replication behavior with parallel apply enabled, and update conflict log parsing before production adoption.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: On a PG18 test instance, run a parallel sequential scan against a large table with &lt;code&gt;io_method = worker&lt;/code&gt; and compare elapsed time against the same query on PG17 — the expected result is measurably faster for scans larger than shared buffers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: If you run logical replication subscribers today, review &lt;code&gt;pg_stat_subscription_stats&lt;/code&gt; on PG17 and establish a conflict count baseline — this is the metric to validate stays within expected range on PG18 after enabling parallel apply.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category></item><item><title>Autovacuum Is a Capacity Problem, Not a Maintenance Task</title><link>https://rajivonai.com/blog/2025-09-13-autovacuum-is-a-capacity-problem-not-a-maintenance-task/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-09-13-autovacuum-is-a-capacity-problem-not-a-maintenance-task/</guid><description>PostgreSQL vacuum failures often start with blocked cleanup, table bloat, and weak lock observability during peak load.</description><pubDate>Sat, 13 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Autovacuum is not a background chore; it is part of write capacity, and PostgreSQL will collect that debt during peak traffic if the system does not budget for cleanup before the workload arrives.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s multi-version concurrency control, or MVCC, makes reads and writes coexist by leaving old row versions behind after &lt;code&gt;UPDATE&lt;/code&gt; and &lt;code&gt;DELETE&lt;/code&gt;. &lt;code&gt;VACUUM&lt;/code&gt; later removes or marks that dead space reusable, updates planner statistics, maintains visibility maps for index-only scans, and protects the database from transaction ID wraparound, as PostgreSQL’s own routine vacuuming documentation describes: &lt;a href=&quot;https://www.postgresql.org/docs/17/routine-vacuuming.html&quot;&gt;PostgreSQL 17 routine vacuuming&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The operational mistake is treating autovacuum as maintenance instead of capacity. In a write-heavy commerce system, queue processor, billing ledger, workflow engine, or event ingestion service, dead tuples are not an after-hours concern. They are a steady byproduct of throughput.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Default mental model&lt;/th&gt;&lt;th&gt;Production reality&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Autovacuum is background maintenance&lt;/td&gt;&lt;td&gt;Autovacuum competes for I/O, workers, locks, and transaction horizon progress&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Active connection count explains the incident&lt;/td&gt;&lt;td&gt;Table-level dead tuples, lock waits, and oldest &lt;code&gt;xmin&lt;/code&gt; explain the incident&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;One cluster setting fits every table&lt;/td&gt;&lt;td&gt;High-churn tables need per-table settings&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Killing autovacuum ends the emergency&lt;/td&gt;&lt;td&gt;Killing autovacuum creates cleanup debt that must be paid back deliberately&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The common failure is backwards: autovacuum usually does not start as the villain. It becomes visible after the system has already created cleanup debt.&lt;/p&gt;
&lt;p&gt;PostgreSQL standard &lt;code&gt;VACUUM&lt;/code&gt; can run alongside ordinary &lt;code&gt;SELECT&lt;/code&gt;, &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, and &lt;code&gt;DELETE&lt;/code&gt;, while &lt;code&gt;VACUUM FULL&lt;/code&gt; requires an &lt;code&gt;ACCESS EXCLUSIVE&lt;/code&gt; lock and rewrites the table. That distinction matters. A normal autovacuum is designed to be cooperative, but it still consumes I/O and takes a &lt;code&gt;SHARE UPDATE EXCLUSIVE&lt;/code&gt; lock. If conflicting operations keep interrupting it, if long transactions hold the visibility horizon open, or if the write rate exceeds cleanup capacity, dead tuples accumulate until the application starts paying for them in heap scans, index scans, cache churn, and longer vacuum cycles.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Long-running transaction or &lt;code&gt;idle in transaction&lt;/code&gt; session&lt;/td&gt;&lt;td&gt;Dead tuples remain visible to the oldest snapshot and cannot be removed&lt;/td&gt;&lt;td&gt;Autovacuum can run and still fail to reclaim the space operators expect&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Default &lt;code&gt;autovacuum_vacuum_scale_factor = 0.2&lt;/code&gt; on a 200M-row table&lt;/td&gt;&lt;td&gt;Vacuum may wait for tens of millions of obsolete tuples before triggering&lt;/td&gt;&lt;td&gt;The threshold is mathematically sane for small tables and operationally late for hot large tables&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replication slot or stale replica feedback holds &lt;code&gt;xmin&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Cleanup is pinned behind downstream consumption&lt;/td&gt;&lt;td&gt;Primary database bloat becomes a replication and availability problem, not just local storage waste&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Large tables become eligible together&lt;/td&gt;&lt;td&gt;&lt;code&gt;autovacuum_max_workers&lt;/code&gt; can be occupied by a small number of relations&lt;/td&gt;&lt;td&gt;Smaller hot tables wait behind large scans and latency spreads across unrelated features&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Monitoring only &lt;code&gt;pg_stat_activity&lt;/code&gt; active count&lt;/td&gt;&lt;td&gt;Operators see queueing, not the relation causing cleanup debt&lt;/td&gt;&lt;td&gt;The dashboard points at symptoms while the table-level cause grows&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The core question is not “Why did autovacuum run during peak load?” The useful question is: &lt;strong&gt;why did the system enter peak load with no table-level cleanup budget, no lock visibility, and no oldest-transaction alarm?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;treat-vacuum-as-a-capacity-control-plane&quot;&gt;Treat Vacuum as a Capacity Control Plane&lt;/h2&gt;
&lt;p&gt;The right architecture is a small vacuum control plane: table-level observability, per-table policy, lock and horizon detection, and an operator runbook that distinguishes emergency relief from debt repayment.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    App[application writes] --&gt; MVCC[MVCC creates old row versions]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    MVCC --&gt; Stats[pg_stat_user_tables dead tuple counters]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    MVCC --&gt; Horizon[oldest xmin and replication horizon]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Stats --&gt; Dashboard[vacuum health dashboard]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Horizon --&gt; Dashboard&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Locks[pg_locks and pg_stat_activity] --&gt; Dashboard&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Progress[pg_stat_progress_vacuum] --&gt; Dashboard&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Dashboard --&gt; Policy[per-table autovacuum policy]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Policy --&gt; Workers[autovacuum workers]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Workers --&gt; Cleanup[dead tuple cleanup and freeze progress]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Cleanup --&gt; Capacity[steady write capacity]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Dashboard --&gt; Runbook[operator runbook]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Build the dashboard around relations, not sessions.&lt;/p&gt;
&lt;p&gt;Start with &lt;code&gt;pg_stat_user_tables&lt;/code&gt;, &lt;code&gt;pg_class&lt;/code&gt;, &lt;code&gt;pg_stat_activity&lt;/code&gt;, &lt;code&gt;pg_locks&lt;/code&gt;, and &lt;code&gt;pg_stat_progress_vacuum&lt;/code&gt;. Active connections are only the smoke. The heat is per relation: &lt;code&gt;n_dead_tup&lt;/code&gt;, relation size, &lt;code&gt;last_autovacuum&lt;/code&gt;, &lt;code&gt;last_autoanalyze&lt;/code&gt;, current vacuum phase, lock wait duration, and the oldest transaction age.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schemaname&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;relname&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;n_live_tup&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;n_dead_tup&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    pg_size_pretty(pg_total_relation_size(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;oid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; total_size,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    ROUND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;((&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;n_dead_tup&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;numeric&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; NULLIF&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;n_live_tup&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;2&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; dead_rows_pct,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;last_autovacuum&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;last_autoanalyze&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    age(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(), &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;last_autovacuum&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; last_autovacuum_age&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables s&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_class c &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;relname&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;relname&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_namespace n &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; n&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;oid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;relnamespace&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; n&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;nspname&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schemaname&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;n_dead_tup&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verification: the top 20 write-heavy tables should have visible dead tuple count, dead tuple ratio, total relation size, last autovacuum age, and last analyze age on one screen.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add horizon monitoring before tuning cost limits.&lt;/p&gt;
&lt;p&gt;Autovacuum cannot remove row versions still visible to an old snapshot. A single abandoned transaction can make vacuum appear “ineffective” even when workers are active. Check for large &lt;code&gt;backend_xmin&lt;/code&gt;, old &lt;code&gt;backend_xid&lt;/code&gt;, prepared transactions, and replication slots.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    usename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    application_name,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    wait_event_type,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    wait_event,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    age(backend_xmin) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; backend_xmin_age,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    age(backend_xid) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; backend_xid_age,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    age(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(), xact_start) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transaction_age,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    LEFT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(query, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;160&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query_sample&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; backend_xmin &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IS NOT NULL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;   OR&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; backend_xid &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IS NOT NULL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; GREATEST&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    COALESCE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(age(backend_xmin), &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;),&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    COALESCE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(age(backend_xid), &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verification: alert when a transaction age crosses a workload-specific threshold, such as 5 minutes for OLTP checkout paths or 30 minutes for internal reporting, before tying the alert to dead tuple growth.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Track vacuum progress by phase.&lt;/p&gt;
&lt;p&gt;PostgreSQL exposes &lt;code&gt;pg_stat_progress_vacuum&lt;/code&gt; for active vacuum operations, including autovacuum workers. The view reports heap blocks scanned, heap blocks vacuumed, index vacuum count, dead tuple counters, and the current phase; PostgreSQL documents this under progress reporting: &lt;a href=&quot;https://www.postgresql.org/docs/current/progress-reporting.html&quot;&gt;VACUUM progress reporting&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    p&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    a&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;datname&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    p&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;relid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;::regclass &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; relation,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    a&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;query&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    p&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;phase&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    p&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;heap_blks_total&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    p&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;heap_blks_scanned&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    p&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;heap_blks_vacuumed&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    ROUND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;100&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; p&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;heap_blks_scanned&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;numeric&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; NULLIF&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;p&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;heap_blks_total&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;), &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;2&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pct_scanned,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    p&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;index_vacuum_count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    p&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;num_dead_tuples&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_progress_vacuum p&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity a &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;USING&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (pid)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; p&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verification: operators should be able to classify an active vacuum as scanning, vacuuming indexes, vacuuming heap, cleaning indexes, truncating heap, or performing final cleanup without reading server logs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tune hot tables with absolute thresholds, not ratios alone.&lt;/p&gt;
&lt;p&gt;PostgreSQL triggers autovacuum when obsolete tuple count exceeds:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;text&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That formula is documented in the PostgreSQL autovacuum daemon section: &lt;a href=&quot;https://www.postgresql.org/docs/17/routine-vacuuming.html&quot;&gt;autovacuum threshold formula&lt;/a&gt;. On a 10M-row &lt;code&gt;orders&lt;/code&gt; table, the default &lt;code&gt;50 + 0.2 * 10000000&lt;/code&gt; means roughly 2,000,050 obsolete tuples before vacuum eligibility. On a hot table updated continuously, that is not a maintenance threshold. It is an incident waiting room with chairs.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    autovacuum_vacuum_scale_factor &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;01&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    autovacuum_vacuum_threshold &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 50000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    autovacuum_analyze_scale_factor &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;02&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    autovacuum_analyze_threshold &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 50000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    autovacuum_vacuum_cost_delay &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verification: after a realistic write-load test, the table should show smaller, more frequent vacuum cycles, stable &lt;code&gt;n_dead_tup&lt;/code&gt;, and no sustained increase in p95 query latency during vacuum phases.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Separate emergency termination from recovery.&lt;/p&gt;
&lt;p&gt;Terminating an autovacuum worker may reduce immediate pressure if it is contending with production traffic, but it does not remove the dead tuples. It postpones cleanup. Worse, if the worker is running to prevent wraparound, PostgreSQL does not treat it like ordinary background work; autovacuum behavior around wraparound prevention is intentionally harder to interrupt.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    query,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    age(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(), query_start) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; runtime,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    wait_event_type,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    wait_event&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query ILIKE &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%autovacuum%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verification: every termination action must create a follow-up ticket with target relation, observed dead tuples, oldest transaction state, and an explicit manual &lt;code&gt;VACUUM&lt;/code&gt; or retuning plan.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern is not theoretical. GitLab publicly analyzed PostgreSQL autovacuum behavior on GitLab.com and treated it as a production tuning problem backed by stats, logs, and Prometheus data. In their autovacuum considerations issue, they reported autovacuum consuming a high share of read I/O while doing a small amount of block cleanup, then evaluated table-specific behavior and candidate configuration changes: &lt;a href=&quot;https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/4916&quot;&gt;GitLab autovacuum considerations&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The important engineering detail is scale. GitLab called out relations in the hundreds of millions to over a billion tuples, including &lt;code&gt;merge_request_diff_files&lt;/code&gt; and &lt;code&gt;merge_request_diff_commits&lt;/code&gt;. For those shapes, a global threshold is a blunt instrument. A scale factor that is reasonable for a 500K-row table can be absurd for a 1B-row table, and a threshold tuned for one high-churn table can make quieter tables vacuum too often.&lt;/p&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Public evidence&lt;/th&gt;&lt;th&gt;What it shows&lt;/th&gt;&lt;th&gt;Production lesson&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;GitLab tracked autovacuum and autoanalyze daily counts&lt;/td&gt;&lt;td&gt;Vacuum frequency was measured as an operational signal&lt;/td&gt;&lt;td&gt;Count vacuum cycles per table, not just cluster-wide activity&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GitLab compared before and after migration behavior&lt;/td&gt;&lt;td&gt;Configuration changed based on observed workload&lt;/td&gt;&lt;td&gt;Treat autovacuum tuning as capacity testing, not folklore&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GitLab inspected &lt;code&gt;pg_stat_all_table.n_dead_tup&lt;/code&gt; in Prometheus&lt;/td&gt;&lt;td&gt;Dead tuples were tracked over time&lt;/td&gt;&lt;td&gt;Alert on trajectory, not only threshold breach&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GitLab selected candidate tables for custom settings&lt;/td&gt;&lt;td&gt;Large relations needed table-specific policy&lt;/td&gt;&lt;td&gt;Per-table storage parameters are normal for serious PostgreSQL operations&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;This also follows directly from PostgreSQL behavior. &lt;code&gt;UPDATE&lt;/code&gt; and &lt;code&gt;DELETE&lt;/code&gt; leave old row versions behind under MVCC until vacuum can mark space reusable. Standard vacuum does not generally return space to the operating system; it makes space reusable inside the relation. &lt;code&gt;VACUUM FULL&lt;/code&gt; rewrites the table and requires an exclusive lock. That is why waiting until bloat is obvious is expensive: at that point, the fix may require either a long plain vacuum that only stabilizes reuse or a rewrite operation that needs a maintenance window.&lt;/p&gt;
&lt;p&gt;The source incident describes the recognizable operational smell: response time spikes, lock waits, autovacuum visible in &lt;code&gt;pg_stat_activity&lt;/code&gt;, and operators reaching for termination commands. The deeper diagnosis is that the system had no pre-peak signal for cleanup debt. Once users are checking out, workers are busy, indexes are colder, heap pages are dirty, and autovacuum is behind, every option is ugly. The best time to find a bloated &lt;code&gt;orders&lt;/code&gt; table is before the marketing email, not while the payment service is practicing interpretive latency.&lt;/p&gt;
&lt;p&gt;A production vacuum dashboard should make five questions answerable in less than a minute:&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Question&lt;/th&gt;&lt;th&gt;View or metric&lt;/th&gt;&lt;th&gt;Bad signal&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Which tables are accumulating cleanup debt?&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_user_tables.n_dead_tup&lt;/code&gt;, relation size&lt;/td&gt;&lt;td&gt;Dead tuples rising faster than vacuum completion&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Is vacuum running or stalled?&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_progress_vacuum.phase&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Phase unchanged while lock waits or I/O waits climb&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;What is pinning cleanup?&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity.backend_xmin&lt;/code&gt;, replication slots&lt;/td&gt;&lt;td&gt;Old snapshot age grows while dead tuples persist&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Are workers saturated?&lt;/td&gt;&lt;td&gt;Active autovacuum workers and table queue&lt;/td&gt;&lt;td&gt;Large relations occupy workers for long periods&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Is the threshold wrong?&lt;/td&gt;&lt;td&gt;Dead tuples at vacuum start and duration&lt;/td&gt;&lt;td&gt;Vacuum starts only after latency or bloat is visible&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Dead tuple percentage looks fine while absolute debt is huge&lt;/td&gt;&lt;td&gt;A 1B-row table with 1 percent dead rows still has 10M obsolete tuples&lt;/td&gt;&lt;td&gt;Alert on absolute &lt;code&gt;n_dead_tup&lt;/code&gt;, dead tuple ratio, and relation size together&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Autovacuum runs but bloat does not fall&lt;/td&gt;&lt;td&gt;Long transaction, prepared transaction, stale replica feedback, or replication slot pins the visibility horizon&lt;/td&gt;&lt;td&gt;Monitor &lt;code&gt;backend_xmin&lt;/code&gt;, &lt;code&gt;backend_xid&lt;/code&gt;, &lt;code&gt;pg_prepared_xacts&lt;/code&gt;, and replication slot lag before changing vacuum cost settings&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Vacuum becomes too aggressive after lowering scale factor&lt;/td&gt;&lt;td&gt;Hot tables vacuum frequently enough to compete with foreground I/O&lt;/td&gt;&lt;td&gt;Tune &lt;code&gt;autovacuum_vacuum_cost_delay&lt;/code&gt;, table thresholds, and worker count under load; verify p95 latency during vacuum&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;VACUUM FULL&lt;/code&gt; becomes the only visible cleanup option&lt;/td&gt;&lt;td&gt;Plain vacuum can reuse space but cannot compact most table files back to the operating system&lt;/td&gt;&lt;td&gt;Prefer steady plain vacuum; reserve &lt;code&gt;VACUUM FULL&lt;/code&gt;, &lt;code&gt;CLUSTER&lt;/code&gt;, or table rewrite for controlled maintenance windows with disk headroom&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Partitioned parent has stale planner statistics&lt;/td&gt;&lt;td&gt;Autovacuum processes partitions, but parent-level statistics may not update as expected&lt;/td&gt;&lt;td&gt;Run explicit &lt;code&gt;ANALYZE&lt;/code&gt; on partitioned parents after load or distribution shifts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Insert-heavy table misses cleanup expectations&lt;/td&gt;&lt;td&gt;PostgreSQL 13 and later include insert-trigger autovacuum settings, but older tuning habits focus only on update and delete churn&lt;/td&gt;&lt;td&gt;Include &lt;code&gt;autovacuum_vacuum_insert_threshold&lt;/code&gt; and &lt;code&gt;autovacuum_vacuum_insert_scale_factor&lt;/code&gt; in version-aware reviews&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Terminating autovacuum becomes the runbook&lt;/td&gt;&lt;td&gt;Operators kill workers during peak traffic and never repay cleanup debt&lt;/td&gt;&lt;td&gt;Require a follow-up manual vacuum, threshold change, or capacity review for every terminated worker&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Managed database hides host-level detail&lt;/td&gt;&lt;td&gt;Amazon RDS, Aurora PostgreSQL, Cloud SQL, or Azure Database for PostgreSQL restrict OS-level inspection&lt;/td&gt;&lt;td&gt;Use SQL-visible signals first: stats views, logs, parameter groups, Performance Insights, and query wait sampling&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Vacuum incidents happen when write throughput creates cleanup debt faster than PostgreSQL can safely remove it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Treat autovacuum as a capacity control plane with table-level metrics, horizon detection, progress visibility, and per-table policy.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A healthy system shows bounded &lt;code&gt;n_dead_tup&lt;/code&gt;, recent &lt;code&gt;last_autovacuum&lt;/code&gt; on hot tables, short transaction ages, and vacuum progress that completes without sustained lock waits.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, build a dashboard for the top 20 write-heavy tables showing dead tuples, relation size, last autovacuum age, oldest transaction age, lock waiters, and active vacuum phase.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Autovacuum does not need heroics; it needs budget, observability, and the dignity of being treated like production capacity before it collects payment at the worst possible hour.&lt;/p&gt;</content:encoded><category>databases</category><category>failures</category><category>checklist</category></item><item><title>Top GitHub Breakouts: August 2025 — Part I</title><link>https://rajivonai.com/blog/2025-09-06-github-stars-aug-2025/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-09-06-github-stars-aug-2025/</guid><description>The gap between AI prototype and production system is routing tables, deployment YAML, and observability scaffolding. August 2025&apos;s top breakouts targeted exactly the code engineers keep rewriting: model routing logic, agent deployment manifests, and PostgreSQL diagnostics.</description><pubDate>Sat, 06 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Building production AI systems in 2025 still means writing three layers of boilerplate nobody talks about: the routing logic that decides which model handles which request, the Kubernetes manifests that wire agent workloads together, and the SQL diagnostic queries a DBA writes when Postgres starts choking. August’s top GitHub breakouts attack all three directly.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Every organization adopting LLMs runs into the same friction point: the gap between a working prototype and a production-grade system is filled with infrastructure that has nothing to do with the actual intelligence — it’s routing tables, deployment YAML, and observability scaffolding. Meanwhile, the teams building that scaffolding are the same ones being asked to ship faster.&lt;/p&gt;
&lt;p&gt;August 2025 saw a cluster of open-source releases that treat this scaffolding layer as a solved problem. The three projects with the most traction target exactly the code that engineers keep rewriting from scratch: model routing logic, agent deployment manifests, and PostgreSQL diagnostics.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Domain&lt;/th&gt;&lt;th&gt;Manual bottleneck&lt;/th&gt;&lt;th&gt;What it costs&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;System design&lt;/td&gt;&lt;td&gt;Writing routing rules to dispatch prompts across models by cost, capability, or privacy boundary&lt;/td&gt;&lt;td&gt;Weeks of logic that breaks when you swap providers&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;System design&lt;/td&gt;&lt;td&gt;Implementing PII detection and jailbreak guards per-service&lt;/td&gt;&lt;td&gt;Each team builds its own leaky filter&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform engineering&lt;/td&gt;&lt;td&gt;Authoring Kubernetes manifests for every new agent workload&lt;/td&gt;&lt;td&gt;Hours per service; bespoke YAML that drifts from staging to prod&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;Running VACUUM analysis, lock monitoring, and slow query triage manually&lt;/td&gt;&lt;td&gt;DBAs context-switching to the same diagnostic queries repeatedly&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Can AI tooling available today eliminate this scaffolding without requiring teams to build custom infrastructure of their own?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Manual engineering boilerplate] --&gt; B[Model routing logic]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; C[Agent deployment manifests]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; D[DBA diagnostics scripts]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; E[vllm-project — Semantic Router]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; F[mckinsey — ARK]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; G[call518 — MCP-PostgreSQL-Ops]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; H[AI-automated routing and safety]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; I[Declarative agent infrastructure]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt; J[Natural language DB operations]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;vllm-projectsemantic-router--replacing-hand-coded-model-selection-and-safety-filters&quot;&gt;vllm-project/semantic-router — replacing hand-coded model selection and safety filters&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: Engineers manually write routing rules to decide which model handles a given request, then bolt on separate PII detectors and jailbreak guards per service.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces that task&lt;/strong&gt;: According to the project README, vLLM Semantic Router is a “signal-driven” intelligent router that dispatches requests across model pools based on token economics, safety signals, and capability boundaries. The project uses BERT-based classification (per the repository topics) to detect sensitive content and prompt injection at the system layer — before the request reaches any model — without per-application guard code. The README describes three outcomes: reduced wasted tokens, jailbreak and hallucination detection, and cross-boundary model coordination between edge and cloud deployments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;: Install via &lt;code&gt;curl -fsSL https://vllm-semantic-router.com/install.sh | bash&lt;/code&gt;, configure a model pool, and the router handles dispatch. Each of the three outcomes (token efficiency, safety, multi-boundary routing) was previously a separate engineering problem requiring separate tooling.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: The repository was created in late August 2025 and was still early-stage at the time of this roundup. Classification confidence thresholds and fallback routing behavior were not documented in the README. Teams with strict audit requirements should evaluate the safety detection layer before relying on it as the primary guard.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;mckinseyagents-at-scale-ark--replacing-bespoke-kubernetes-manifests-with-declarative-agent-specs&quot;&gt;mckinsey/agents-at-scale-ark — replacing bespoke Kubernetes manifests with declarative agent specs&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: Each new agent workload requires authoring Kubernetes manifests from scratch — deployments, services, RBAC rules, monitoring hooks — with nothing shared between projects.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces that task&lt;/strong&gt;: ARK (Agentic Runtime for Kubernetes) takes a declarative approach: you specify &lt;em&gt;what&lt;/em&gt; an agent should do rather than &lt;em&gt;how&lt;/em&gt; to deploy it. The README describes ARK as built on Kubernetes so that proven patterns for security, monitoring, and RBAC ship with the framework rather than being re-implemented per project. Python and npm SDKs expose agents as declarative specs that run on a single developer machine or scale across multi-cloud infrastructure without changes to the spec itself.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;: Install the SDK (&lt;code&gt;pip install ark-sdk&lt;/code&gt; or &lt;code&gt;npm install @agents-at-scale/ark&lt;/code&gt;), write a declarative agent spec, and deploy. McKinsey states in the README that the framework encodes patterns developed across “dozens of agentic application projects” — meaning it reflects real deployment constraints rather than a clean-room design.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: ARK is Kubernetes-native, so teams without an existing cluster face non-trivial setup (Kind or K3s works locally, but adds a dependency). The declarative model assumes agents fit the framework’s abstraction — workloads with unusual resource profiles or custom network topologies may require escape hatches the current documentation does not fully describe.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;call518mcp-postgresql-ops--replacing-manual-dba-diagnostics-with-natural-language-queries&quot;&gt;call518/MCP-PostgreSQL-Ops — replacing manual DBA diagnostics with natural language queries&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: Diagnosing PostgreSQL issues requires knowing which system views to query for which problem — &lt;code&gt;pg_stat_statements&lt;/code&gt; for slow queries, &lt;code&gt;pg_stat_bgwriter&lt;/code&gt; for checkpoint pressure, &lt;code&gt;pg_locks&lt;/code&gt; for deadlocks — and writing the correct SQL every time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces that task&lt;/strong&gt;: MCP-PostgreSQL-Ops is an MCP server exposing 30+ PostgreSQL diagnostic tools to AI assistants. The README states it supports natural language queries like “Show me slow queries” or “Analyze table bloat” against PostgreSQL 12-18, works with RDS and Aurora via read-only operations, and requires no extensions for baseline functionality (though &lt;code&gt;pg_stat_statements&lt;/code&gt; and &lt;code&gt;pg_stat_monitor&lt;/code&gt; unlock additional query analytics). The MCP protocol means any compatible AI assistant can use it without a custom integration layer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;: &lt;code&gt;pip install MCP-PostgreSQL-Ops&lt;/code&gt; or run via Docker (&lt;code&gt;docker pull call518/mcp-server-postgresql-ops&lt;/code&gt;). Wire it to your AI assistant’s MCP configuration with a connection string, and ask diagnostic questions in plain language. The README confirms all operations are read-only, making it safe to connect to a production replica.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: Read-only is a feature and a constraint — the server identifies that autovacuum is falling behind but cannot issue the VACUUM itself. Closing the loop from detection to remediation requires a separate write-capable tool or a manual step.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;McKinsey’s documented public decision to open-source ARK emphasizes that encoding infrastructure patterns from internal agentic applications directly into Kubernetes controllers eliminates duplicate platform engineering effort. The documented pattern across enterprise deployments is that declarative specifications actively reconciled by a controller prevent configuration drift. For database observability, PostgreSQL’s behavior when executing diagnostic queries against system views like &lt;code&gt;pg_stat_statements&lt;/code&gt; is that it allows read-only visibility into query performance and lock contention without degrading production throughput. This makes it safe to run tools like MCP-PostgreSQL-Ops against read replicas. However, because these tools operate strictly within read-only constraints, they cannot autonomously execute remediation commands like &lt;code&gt;VACUUM&lt;/code&gt; to resolve bloat. In model routing, the documented architectural pattern is that applying BERT-based classification models for PII and safety filtering introduces non-zero latency; running these checks synchronously requires optimized compute placement to avoid bottlenecking user-facing generation.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Semantic Router safety classification blocks legitimate prompts&lt;/td&gt;&lt;td&gt;BERT classification thresholds set too conservatively&lt;/td&gt;&lt;td&gt;Tune thresholds once documented; maintain a bypass path for trusted internal callers&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ARK spec diverges from actual Kubernetes cluster state&lt;/td&gt;&lt;td&gt;Manual edits to generated manifests outside the SDK&lt;/td&gt;&lt;td&gt;Treat generated manifests as read-only; route all changes through the declarative spec&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MCP-PostgreSQL-Ops detects bloat but cannot fix it&lt;/td&gt;&lt;td&gt;Autovacuum lag exceeds thresholds&lt;/td&gt;&lt;td&gt;Pair with a separate remediation workflow; use the MCP server for detection only&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Semantic Router adds latency to the inference path&lt;/td&gt;&lt;td&gt;Classification runs synchronously on every request&lt;/td&gt;&lt;td&gt;Deploy closer to the model pool; cache results for repeated prompt patterns&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Engineering teams are rewriting the same routing logic, agent deployment YAML, and DBA diagnostic queries on every project — infrastructure work that delivers no differentiated value.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: vLLM Semantic Router handles model routing and safety filtering at the system layer; ARK provides a declarative Kubernetes-native framework for agent deployment; MCP-PostgreSQL-Ops connects AI assistants directly to PostgreSQL diagnostics via natural language.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: The first signal that MCP-PostgreSQL-Ops is working is asking “which tables are most bloated?” and getting a ranked list without writing SQL — that shift from query-writing to question-asking is the productivity delta in concrete form.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Install &lt;code&gt;pip install MCP-PostgreSQL-Ops&lt;/code&gt;, wire it to a read-only replica connection string, and connect it to your AI assistant’s MCP configuration. Ask one diagnostic question you previously had to write SQL for. That is the week-one win.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>ai-engineering</category><category>architecture</category><category>databases</category></item><item><title>The Semantics AI Misses When Porting Storage Designs</title><link>https://rajivonai.com/blog/2025-08-30-the-semantics-ai-misses-when-porting-storage-designs/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-08-30-the-semantics-ai-misses-when-porting-storage-designs/</guid><description>Why a PostgreSQL double write buffer prototype failed despite compiling, and what it reveals about AI-assisted systems design.</description><pubDate>Sat, 30 Aug 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;AI can copy the shape of a storage design and still miss the contract that makes it correct: a double write buffer is not an extra write path, it is a durability boundary.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;AI coding agents are now good enough to produce plausible database internals patches: new structs, recovery hooks, background workers, tests, and code that compiles. That changes the review problem. The risk is no longer only “does the code build?” The risk is “did the agent preserve the invisible contract between the database, kernel, filesystem, block device, and recovery algorithm?”&lt;/p&gt;
&lt;p&gt;The source experiment is a useful failure: a Claude Code prototype attempted to port an InnoDB-style double write buffer into PostgreSQL. The implementation followed the surface pattern. Write page to double write buffer. Write page to the real data file. Reuse the slot. The failure was semantic: PostgreSQL and InnoDB do not share the same I/O model, process model, or recovery trust boundary.&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Mechanism&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Default trust boundary&lt;/th&gt;&lt;th&gt;What protects against torn pages&lt;/th&gt;&lt;th&gt;Review question&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL full page writes&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Write-ahead log, or WAL, flush&lt;/td&gt;&lt;td&gt;First modified 8KB page image after checkpoint&lt;/td&gt;&lt;td&gt;Is the WAL image durable before recovery needs it?&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;InnoDB doublewrite buffer&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Doublewrite file flush&lt;/td&gt;&lt;td&gt;Page copy written before final tablespace overwrite&lt;/td&gt;&lt;td&gt;Is the doublewrite copy durable before the destination page can tear?&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Naive AI port&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Function names and control flow&lt;/td&gt;&lt;td&gt;Assumed equivalence between writes&lt;/td&gt;&lt;td&gt;Did the patch prove the same crash states are recoverable?&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The lesson generalizes beyond databases. AI-generated infrastructure code often calls the right APIs in the wrong contract order.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;A double write buffer, or DWB, protects a database page from a torn write by writing a complete copy somewhere else before overwriting the page at its final location. InnoDB documents this directly: pages flushed from the buffer pool are written to the doublewrite buffer before their proper locations, so crash recovery can find a good copy if the final page write is torn. &lt;a href=&quot;https://dev.mysql.com/doc/refman/8.4/en/innodb-doublewrite-buffer.html&quot;&gt;MySQL 8.4 documentation&lt;/a&gt; names that as the purpose of the feature.&lt;/p&gt;
&lt;p&gt;PostgreSQL solves the same class of failure differently. With &lt;code&gt;full_page_writes=on&lt;/code&gt;, PostgreSQL writes the entire page to WAL during the first modification after each checkpoint. The PostgreSQL docs are explicit: without that page image, a crash during a page write can leave mixed old and new data, and normal row-level WAL records are not enough to reconstruct the page. &lt;a href=&quot;https://www.postgresql.org/docs/current/runtime-config-wal.html&quot;&gt;PostgreSQL current WAL documentation&lt;/a&gt; also warns that turning it off can lead to unrecoverable or silent corruption after system failure.&lt;/p&gt;
&lt;p&gt;The bug in the AI-generated design was treating those mechanisms as interchangeable.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;write()&lt;/code&gt; treated as durable&lt;/td&gt;&lt;td&gt;PostgreSQL writes dirty buffers through the operating system page cache; the kernel can accept the bytes before media persistence&lt;/td&gt;&lt;td&gt;A DWB slot reused after &lt;code&gt;smgrwrite()&lt;/code&gt; can destroy the only good recovery copy&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;sync_file_range()&lt;/code&gt; treated as &lt;code&gt;fsync()&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Linux documents &lt;code&gt;SYNC_FILE_RANGE_WRITE&lt;/code&gt; as asynchronous and not suitable for data integrity operations; it also does not flush volatile disk write caches&lt;/td&gt;&lt;td&gt;Advisory writeback is performance plumbing, not a crash recovery guarantee&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;BgWriter path gets synchronous durability work&lt;/td&gt;&lt;td&gt;PostgreSQL’s background writer is tuned around cheap dirty-page writes and checkpoint-spread I/O&lt;/td&gt;&lt;td&gt;Per-page DWB fsync turns an amortized background path into a latency amplifier&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Full page writes disabled too early&lt;/td&gt;&lt;td&gt;WAL no longer contains first-dirtied page images after checkpoint&lt;/td&gt;&lt;td&gt;Recovery must trust a DWB copy that may not actually be durable or current&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slot lifecycle lacks LSN accounting&lt;/td&gt;&lt;td&gt;DWB slot reuse is disconnected from destination file fsync progress&lt;/td&gt;&lt;td&gt;Crash recovery can observe a stale tablespace page and an overwritten DWB slot&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The core question is not “can PostgreSQL be given a DWB?” It is: what additional durability accounting would make a DWB at least as trustworthy as PostgreSQL’s existing WAL full page image boundary?&lt;/p&gt;
&lt;h2 id=&quot;a-crash-state-contract-for-double-write-buffering&quot;&gt;A Crash-State Contract for Double Write Buffering&lt;/h2&gt;
&lt;p&gt;The right design starts with crash states, not code generation. If the system crashes at every boundary, recovery must have one complete page image with a known log sequence number, or LSN. Anything less is wishful thinking with structs.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Dirty[dirty PostgreSQL buffer — page LSN known] --&gt; WAL[WAL record — optional full page image]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Dirty --&gt; DWBWrite[DWB slot write — buffered copy]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DWBWrite --&gt; DWBFlush[DWB file fsync — durable recovery copy]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DWBFlush --&gt; DataWrite[tablespace write — page cache accepted]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DataWrite --&gt; DataFlush[tablespace fsync — final page durable]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DataFlush --&gt; Reclaim[DWB slot reclaim — safe reuse]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    WAL --&gt; Recovery[crash recovery — choose trusted image]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DWBFlush --&gt; Recovery&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DataFlush --&gt; Recovery&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The invariant is narrow:&lt;/p&gt;









































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;State&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;DWB slot reusable?&lt;/th&gt;&lt;th&gt;Recovery source&lt;/th&gt;&lt;th&gt;Reason&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Before DWB fsync&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;No&lt;/td&gt;&lt;td&gt;WAL full page image&lt;/td&gt;&lt;td&gt;DWB copy may not exist after power loss&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;After DWB fsync, before tablespace write&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;No&lt;/td&gt;&lt;td&gt;DWB or WAL&lt;/td&gt;&lt;td&gt;DWB copy is durable, destination is old&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;After tablespace write, before tablespace fsync&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;No&lt;/td&gt;&lt;td&gt;DWB&lt;/td&gt;&lt;td&gt;Destination may be stale or torn&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;After tablespace fsync&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Yes&lt;/td&gt;&lt;td&gt;Tablespace&lt;/td&gt;&lt;td&gt;Final copy is durable through the filesystem boundary&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;After checkpoint and slot reclaim&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Yes&lt;/td&gt;&lt;td&gt;Tablespace plus WAL from checkpoint&lt;/td&gt;&lt;td&gt;Recovery no longer depends on that DWB slot&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;That table is the design. The implementation follows from it.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Keep &lt;code&gt;full_page_writes=on&lt;/code&gt; while developing the DWB path.&lt;/p&gt;
&lt;p&gt;A prototype that disables full page writes before proving DWB recovery has removed PostgreSQL’s existing safety net. PostgreSQL’s documented default is &lt;code&gt;full_page_writes=on&lt;/code&gt;, and the reason is exactly torn-page recovery after OS crashes. The first implementation should run DWB as a redundant mechanism, then compare recovery decisions against WAL.&lt;/p&gt;
&lt;p&gt;Verification: after crash recovery, report every page where WAL full page image and DWB recovery would have chosen different page contents or LSNs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Treat DWB slot state as a durability state machine.&lt;/p&gt;
&lt;p&gt;A slot is not “free” after the page is copied. It is not free after the destination &lt;code&gt;write()&lt;/code&gt;. It is free only after the destination relation file has been synced past the page’s write. That requires at least: relation identifier, fork, block number, page LSN, DWB slot identifier, DWB fsync generation, and destination fsync generation.&lt;/p&gt;
&lt;p&gt;Verification: inject crashes at each transition and assert that no slot with &lt;code&gt;tablespace_fsync_lsn &amp;#x3C; page_lsn&lt;/code&gt; is reused.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Batch fsyncs around files, not pages.&lt;/p&gt;
&lt;p&gt;A naive per-page &lt;code&gt;fsync(dwb_fd)&lt;/code&gt; will collapse throughput on ordinary SSDs and will be theatrical on network block devices. The DWB write path needs group commit semantics: append many page copies to DWB storage, issue one durable flush, then schedule destination writes. The destination side also needs file-level fsync grouping by relation segment, because PostgreSQL relations are spread across segment files.&lt;/p&gt;
&lt;p&gt;Verification: expose counters for pages per DWB fsync, relation files per destination fsync batch, p50 and p99 fsync latency, and backend buffer eviction waits.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Move synchronous work out of &lt;code&gt;FlushBuffer()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;FlushBuffer()&lt;/code&gt; is the wrong abstraction boundary for the whole protocol. It can mark that a page needs protection, enqueue the copy, and coordinate state. It should not become a per-page durability transaction. PostgreSQL already separates WAL writer, background writer, and checkpointer roles; a DWB design needs a manager that coordinates DWB slots, DWB fsync completion, destination writes, and reclaim.&lt;/p&gt;
&lt;p&gt;Verification: run write-heavy workloads with &lt;code&gt;bgwriter_lru_maxpages&lt;/code&gt;, &lt;code&gt;checkpoint_timeout&lt;/code&gt;, &lt;code&gt;checkpoint_completion_target&lt;/code&gt;, and &lt;code&gt;checkpoint_flush_after&lt;/code&gt; visible in logs; confirm backend writes do not spike because DWB workers are saturated.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Make recovery distrustful by default.&lt;/p&gt;
&lt;p&gt;During startup, recovery must validate DWB records by checksum, relation identity, block number, page LSN, and DWB fsync generation. A DWB record without proof of durable completion is a hint, not a recovery source. PostgreSQL page checksums, when enabled, help detect torn pages, but detection is not repair.&lt;/p&gt;
&lt;p&gt;Verification: corrupt DWB records, destination pages, and WAL records independently in test images; recovery must either repair from a proven source or fail loudly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Test against the actual storage stack.&lt;/p&gt;
&lt;p&gt;PostgreSQL deployments differ by &lt;code&gt;wal_sync_method&lt;/code&gt;, filesystem, cloud block device, hypervisor cache mode, RAID controller cache, and mount options. PostgreSQL documents several WAL sync methods, including &lt;code&gt;fdatasync&lt;/code&gt;, &lt;code&gt;fsync&lt;/code&gt;, &lt;code&gt;open_sync&lt;/code&gt;, and &lt;code&gt;open_datasync&lt;/code&gt;; Linux is not the whole production universe. The DWB claim is only meaningful on the stack where it is measured.&lt;/p&gt;
&lt;p&gt;Verification: repeat crash-injection tests on the production-like filesystem and block layer, including VM-level kill, host reboot where available, and forced process termination.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The public evidence points in one direction: the prototype failed because it copied an algorithm without copying the assumptions that make the algorithm true.&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Evidence&lt;/th&gt;&lt;th&gt;Type&lt;/th&gt;&lt;th&gt;Engineering implication&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;InnoDB documents the doublewrite buffer as a separate area written before pages reach their final data-file positions&lt;/td&gt;&lt;td&gt;Public documented design&lt;/td&gt;&lt;td&gt;The protection comes from write ordering plus recovery lookup, not from an extra copy alone&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL documents &lt;code&gt;full_page_writes&lt;/code&gt; as writing the entire disk page to WAL on first modification after checkpoint&lt;/td&gt;&lt;td&gt;Public documented design&lt;/td&gt;&lt;td&gt;PostgreSQL’s trust boundary is WAL durability, not destination data-file durability&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL documents &lt;code&gt;wal_sync_method&lt;/code&gt; choices and warns that crash-safe configuration depends on system configuration&lt;/td&gt;&lt;td&gt;Public documented design&lt;/td&gt;&lt;td&gt;A DWB replacement must be validated under the configured sync method and storage layer&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Linux documents &lt;code&gt;SYNC_FILE_RANGE_WRITE&lt;/code&gt; as asynchronous and “not suitable for data integrity operations”&lt;/td&gt;&lt;td&gt;System behavior&lt;/td&gt;&lt;td&gt;Code that treats it as a durability boundary is wrong even if smoke tests pass&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL checkpoint settings include &lt;code&gt;checkpoint_flush_after&lt;/code&gt;, which attempts to push dirty data to storage to reduce later stalls&lt;/td&gt;&lt;td&gt;System behavior&lt;/td&gt;&lt;td&gt;PostgreSQL already distinguishes writeback pressure from confirmed persistence&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;JIN’s Claude Code experiment compiled and passed basic smoke tests before semantic review exposed the DWB flaw&lt;/td&gt;&lt;td&gt;Documented source experiment&lt;/td&gt;&lt;td&gt;Build success is not evidence of crash-state correctness&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The deeper point is that storage correctness is usually hidden behind boring verbs: write, flush, sync, checkpoint, recover. Those verbs are not portable across systems.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;write()&lt;/code&gt; to a regular file usually means “the kernel accepted bytes.” It does not mean “the bytes survived power loss.” &lt;code&gt;sync_file_range()&lt;/code&gt; can start writeback and can be useful for reducing dirty-page backlog, but the Linux man page explicitly separates that from data integrity. &lt;code&gt;fsync()&lt;/code&gt; is closer to the boundary PostgreSQL recovery cares about, but even then the real guarantee depends on the filesystem, block device, drive cache behavior, and whether the stack lies about flush completion.&lt;/p&gt;
&lt;p&gt;This is exactly where AI-assisted systems work becomes dangerous. The model sees an InnoDB pattern:&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;InnoDB-looking step&lt;/th&gt;&lt;th&gt;What the AI can reproduce&lt;/th&gt;&lt;th&gt;What it may miss&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Copy page to DWB&lt;/td&gt;&lt;td&gt;Buffer allocation and file write&lt;/td&gt;&lt;td&gt;Whether the copy is durable before final overwrite&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Flush DWB&lt;/td&gt;&lt;td&gt;Call a function with “flush” in the name&lt;/td&gt;&lt;td&gt;Whether the function is advisory or a persistence barrier&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Write destination page&lt;/td&gt;&lt;td&gt;&lt;code&gt;smgrwrite()&lt;/code&gt; or equivalent call&lt;/td&gt;&lt;td&gt;Whether the write reached media or page cache&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Reclaim slot&lt;/td&gt;&lt;td&gt;Free-list manipulation&lt;/td&gt;&lt;td&gt;Whether recovery still depends on that slot&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Disable FPW&lt;/td&gt;&lt;td&gt;Config change or branch bypass&lt;/td&gt;&lt;td&gt;Whether WAL still has a complete first-touch page image&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;That is not a PostgreSQL-only lesson. The same failure shape appears when agents generate Kafka consumers without understanding offset commit semantics, Kubernetes controllers without understanding finalizers, S3 pipelines without understanding read-after-write boundaries by operation type, or distributed locks without understanding fencing tokens. The API name is the shallow part. The recovery contract is the system.&lt;/p&gt;
&lt;p&gt;For this specific DWB design, I have not run the patch at production scale personally. The documented failure mode is enough to reject the architecture as described: if a DWB slot is reused after a buffered destination write but before a confirmed destination fsync, a crash can leave no durable complete image outside WAL. If full page writes have also been disabled, PostgreSQL’s documented repair mechanism has been removed.&lt;/p&gt;
&lt;p&gt;The most deceptive benchmark would be a clean-shutdown write throughput test. It might show lower WAL volume and acceptable latency because it never exercises the crash boundary. A correct benchmark has to kill the database and the machine at controlled points: before DWB fsync, after DWB fsync, after destination write, before destination fsync, after destination fsync, and during checkpoint. Then it has to verify page checksums, page LSNs, WAL replay behavior, and DWB reclaim metadata. Anything else is testing formatting.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;DWB slot reused too early&lt;/td&gt;&lt;td&gt;Slot freed after &lt;code&gt;smgrwrite()&lt;/code&gt; or &lt;code&gt;sync_file_range()&lt;/code&gt; instead of after destination &lt;code&gt;fsync()&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Track destination fsync generation per relation segment and reclaim only when &lt;code&gt;tablespace_fsync_lsn &gt;= page_lsn&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;WAL safety removed before DWB is proven&lt;/td&gt;&lt;td&gt;&lt;code&gt;full_page_writes=off&lt;/code&gt; during prototype or benchmark runs&lt;/td&gt;&lt;td&gt;Run DWB in shadow mode first; compare recovery choices against WAL full page images&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;BgWriter stalls under durability work&lt;/td&gt;&lt;td&gt;Per-page DWB fsync inside dirty buffer eviction path&lt;/td&gt;&lt;td&gt;Use DWB workers, group commit, and file-level batching outside the critical buffer eviction path&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Checkpoint I/O becomes spiky&lt;/td&gt;&lt;td&gt;DWB backlog prevents pages from becoming safely reclaimable before checkpoint pressure rises&lt;/td&gt;&lt;td&gt;Coordinate DWB manager with checkpointer progress and expose backlog metrics tied to checkpoint cycles&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Advisory flush mistaken for crash safety&lt;/td&gt;&lt;td&gt;Linux &lt;code&gt;sync_file_range()&lt;/code&gt; or PostgreSQL writeback hints treated as persistence&lt;/td&gt;&lt;td&gt;Reserve advisory writeback for latency smoothing; require &lt;code&gt;fsync&lt;/code&gt;, &lt;code&gt;fdatasync&lt;/code&gt;, or platform-equivalent durability boundary&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Storage stack changes invalidate assumptions&lt;/td&gt;&lt;td&gt;Moving from local NVMe to EBS, Azure managed disks, GCP Persistent Disk, ZFS, ext4, XFS, or a controller with volatile cache&lt;/td&gt;&lt;td&gt;Certify the crash matrix per production stack and keep the result with the deployment profile&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Recovery accepts stale DWB records&lt;/td&gt;&lt;td&gt;DWB metadata lacks relation identity, block number, checksum, page LSN, or fsync generation&lt;/td&gt;&lt;td&gt;Validate DWB records as recovery artifacts; reject ambiguous records loudly&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Benchmark hides corruption&lt;/td&gt;&lt;td&gt;Tests use clean shutdown, process kill only, or no filesystem fault injection&lt;/td&gt;&lt;td&gt;Add power-loss style crash testing and page verification after replay&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: AI-generated systems code can preserve code shape while breaking the durability, scheduling, and recovery contracts underneath it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Review infrastructure patches by crash-state matrix first, then by code diff.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A PostgreSQL DWB design is not credible until every page state between DWB write, DWB fsync, destination write, destination fsync, checkpoint, and slot reclaim has a verified recovery source.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, take one AI-generated infrastructure patch and write its hidden contract table: API call, assumed guarantee, actual guarantee, failure if the assumption is false.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The hard part of storage engineering is not making the second write happen; it is knowing exactly which copy the system is allowed to trust after the lights come back on.&lt;/p&gt;</content:encoded><category>databases</category><category>ai-engineering</category><category>failures</category></item><item><title>Natural Language SQL Agents Need Database Guardrails</title><link>https://rajivonai.com/blog/2025-07-26-natural-language-sql-agents-need-database-guardrails/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-07-26-natural-language-sql-agents-need-database-guardrails/</guid><description>The risk in a natural-language SQL agent is not bad SQL — it is authority compilation: a user sentence becomes a database operation unless the control plane proves, before execution, which role, rows, cost, and columns the query is allowed to touch.</description><pubDate>Sat, 26 Jul 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The dangerous part of a natural-language SQL agent is not bad SQL. It is authority compilation: a sentence from a user becomes a database operation unless the system proves, before execution, which role, rows, columns, cost, endpoint, and business definitions the query is allowed to touch.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL chat agents are moving from demos into operational workflows: fraud review, support analytics, compliance pulls, finance close checks, customer health reports. The production pattern is not the chat interface. It is the control plane around database authority.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Default approach&lt;/th&gt;&lt;th&gt;Production approach&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Prompt goes to LLM, LLM writes SQL, workflow runs it&lt;/td&gt;&lt;td&gt;Prompt becomes an authorized analytical request, SQL is generated, parsed, bounded, executed, audited, and summarized&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Agent connects as a broad application user&lt;/td&gt;&lt;td&gt;Agent connects through a read-only role scoped to curated views&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Safety lives in prompt instructions&lt;/td&gt;&lt;td&gt;Safety lives in PostgreSQL privileges, row-level security, SQL parsing, timeouts, execution policy, and audit records&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Results are trusted because the query ran&lt;/td&gt;&lt;td&gt;Results are checked against definitions, row counts, tenant scope, freshness, truncation, and expected shape&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;A workflow stack using Crafted AI Framework, n8n, CopilotKit, Supabase, Slack, and PostgreSQL can be useful. The source pattern is attractive: natural-language request, generated PostgreSQL query, n8n workflow execution, CopilotKit-style summarization, and delivery to a UI or channel.&lt;/p&gt;
&lt;p&gt;That is the easy part.&lt;/p&gt;
&lt;p&gt;The harder question is: what happens when the user asks a plausible question that maps to an expensive, unauthorized, stale, or semantically wrong query?&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Natural-language SQL fails in production because language is flexible and databases are literal. “Show anomalous transactions in Q3” sounds harmless until the agent scans a large event table on the primary writer, omits the tenant predicate, reads restricted columns through broad credentials, and sends a confident summary to Slack.&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL role design&lt;/td&gt;&lt;td&gt;Agent connects as an app owner, migration user, Supabase service role, or another role with broad grants&lt;/td&gt;&lt;td&gt;&lt;code&gt;SELECT&lt;/code&gt; becomes only the visible part of authority; the same credentials may read sensitive columns, bypass RLS, or run write statements&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SQL generation&lt;/td&gt;&lt;td&gt;LLM emits &lt;code&gt;SELECT *&lt;/code&gt;, missing tenant filters, broad joins, ambiguous dates, unbounded detail queries, or &lt;code&gt;ORDER BY&lt;/code&gt; on non-indexed expressions&lt;/td&gt;&lt;td&gt;A syntactically valid query can be operationally wrong, expensive, or unauthorized&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL planner behavior&lt;/td&gt;&lt;td&gt;A generated query can choose a sequential scan, hash join, nested loop, or large sort based on predicates and statistics&lt;/td&gt;&lt;td&gt;The agent does not know that its “simple report” just became an OLTP workload problem&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Row-level security&lt;/td&gt;&lt;td&gt;Policies apply only when enabled and evaluated for the role actually executing the query&lt;/td&gt;&lt;td&gt;Authorization bugs move from application code into database policy, where silent under-filtering is easy to miss&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Workflow automation&lt;/td&gt;&lt;td&gt;Webhooks, schedules, and retries repeatedly trigger the same bad query&lt;/td&gt;&lt;td&gt;A single bad prompt becomes recurring workload&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Result summarization&lt;/td&gt;&lt;td&gt;CopilotKit or another summarizer compresses rows into prose&lt;/td&gt;&lt;td&gt;The final answer can hide missing filters, partial results, timeout truncation, replica lag, or policy caveats&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The core question is not “Can the agent write SQL?” The core question is “Can the system prove that the generated SQL is authorized, bounded, explainable, and cheap enough to run before PostgreSQL sees it?”&lt;/p&gt;
&lt;h2 id=&quot;architecture-problem&quot;&gt;Architecture Problem&lt;/h2&gt;
&lt;p&gt;The architectural tension is that natural language and database authority operate on incompatible principles.&lt;/p&gt;
&lt;p&gt;Natural language is designed to be flexible, contextual, and forgiving. “Show me the risky transactions last quarter” is meaningful to a human even without knowing which table, which column definition of risk, which fiscal calendar, which tenant, or how expensive the query is. The speaker expects the listener to resolve ambiguity gracefully.&lt;/p&gt;
&lt;p&gt;Database authority is designed to be precise, bounded, and unforgiving. PostgreSQL does not interpret intent. It executes exactly what it receives: the role determines what can be read, the SQL determines what is read, and once a query runs, the cost and data exposure have already occurred.&lt;/p&gt;
&lt;p&gt;A naive SQL agent architecture collapses these two systems directly: user text goes to a model, the model emits SQL, and that SQL runs. This architecture fails in production not because the model is incompetent but because the authority boundary is wrong. The model is solving a language problem. The authority problem requires a different layer.&lt;/p&gt;
&lt;p&gt;The architecture problem is: &lt;strong&gt;how do you insert a control plane between language and authority that is narrow enough to be safe, without being so narrow that it is useless?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;design-options&quot;&gt;Design Options&lt;/h2&gt;
&lt;p&gt;Three common approaches exist, and each trades safety against capability differently.&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Option&lt;/th&gt;&lt;th&gt;Description&lt;/th&gt;&lt;th&gt;Safety mechanism&lt;/th&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Prompt-only guardrails&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;LLM is instructed not to write dangerous queries&lt;/td&gt;&lt;td&gt;Model compliance&lt;/td&gt;&lt;td&gt;Any prompt injection, jailbreak, or training gap can bypass it&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Application-layer validation&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Middleware checks SQL for banned patterns before execution&lt;/td&gt;&lt;td&gt;Regex and keyword matching&lt;/td&gt;&lt;td&gt;Multi-statement tricks, schema aliases, and edge-case syntax bypass string checks&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Database-native boundaries + control plane&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;PostgreSQL role, RLS, views, parser gate, planner check, read-only execution, timeouts&lt;/td&gt;&lt;td&gt;Database engine and abstract syntax tree&lt;/td&gt;&lt;td&gt;Requires upfront investment; does not protect against slow but valid queries unless planner bounds are set&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Option A: Prompt-only&lt;/strong&gt; is appropriate for demos and internal low-risk tools where the SQL touches only non-sensitive read data and the blast radius of a wrong query is low. It should never be used in production with customer data, production credentials, or any write path.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option B: Application-layer validation&lt;/strong&gt; adds a middleware filter that scans SQL for &lt;code&gt;DROP&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, &lt;code&gt;INSERT&lt;/code&gt;, and similar keywords. This is stronger than a prompt, but still weak: PostgreSQL syntax has too many legitimate variations and aliases to reliably block dangerous patterns with strings. String-based SQL validation fails open under adversarial pressure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option C: Database-native + control plane&lt;/strong&gt; is the only production-grade approach. It eliminates reliance on model compliance or string matching by enforcing authority at the layer that cannot be bypassed: the PostgreSQL role model, the AST parser, the transaction mode, and the execution endpoint.&lt;/p&gt;
&lt;h2 id=&quot;tradeoff-matrix&quot;&gt;Tradeoff Matrix&lt;/h2&gt;



























































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Dimension&lt;/th&gt;&lt;th&gt;Prompt-only&lt;/th&gt;&lt;th&gt;App-layer validation&lt;/th&gt;&lt;th&gt;Database-native control plane&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Setup time&lt;/td&gt;&lt;td&gt;Minutes&lt;/td&gt;&lt;td&gt;Hours&lt;/td&gt;&lt;td&gt;Days&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Authority enforcement&lt;/td&gt;&lt;td&gt;Model compliance only&lt;/td&gt;&lt;td&gt;Partial — string matching&lt;/td&gt;&lt;td&gt;Database engine — cannot be bypassed&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Write protection&lt;/td&gt;&lt;td&gt;Advisory&lt;/td&gt;&lt;td&gt;Partial&lt;/td&gt;&lt;td&gt;Enforced&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PII exposure risk&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;td&gt;Partial&lt;/td&gt;&lt;td&gt;Low — views and column grants&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Load isolation&lt;/td&gt;&lt;td&gt;None&lt;/td&gt;&lt;td&gt;None&lt;/td&gt;&lt;td&gt;Enforced by endpoint routing and timeouts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Prompt injection resistance&lt;/td&gt;&lt;td&gt;None&lt;/td&gt;&lt;td&gt;Low&lt;/td&gt;&lt;td&gt;High — model output cannot grant authority&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Compliance defensibility&lt;/td&gt;&lt;td&gt;None&lt;/td&gt;&lt;td&gt;Low&lt;/td&gt;&lt;td&gt;High — role grants and RLS are auditable&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Right for&lt;/td&gt;&lt;td&gt;Demos, internal tools&lt;/td&gt;&lt;td&gt;Low-risk read workflows&lt;/td&gt;&lt;td&gt;Customer data, production, regulated contexts&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;build-a-sql-agent-control-plane&quot;&gt;Build a SQL Agent Control Plane&lt;/h2&gt;
&lt;p&gt;The right architecture puts the LLM behind a policy boundary. The model may propose SQL. It does not decide whether the SQL is safe.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    User[User question] --&gt; Intake[request intake — identity and purpose]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Intake --&gt; Catalog[semantic catalog — approved metrics and views]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Catalog --&gt; Generator[LLM SQL generator]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Generator --&gt; Parser[SQL parser — inspect query tree]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Parser --&gt; Policy[policy gate — tables columns tenant and limits]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Policy --&gt;|approved query| Planner[PostgreSQL explain check]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Policy --&gt;|rejected query| Repair[repair prompt with policy error]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Repair --&gt; Generator&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Planner --&gt;|acceptable cost| Replica[read replica or analytics endpoint]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Planner --&gt;|too expensive| Reject[reject with safer query shape]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Replica --&gt; Validator[result validator — shape and scope]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Validator --&gt; Summarizer[LLM report composer]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Summarizer --&gt; Delivery[Slack email dashboard or UI]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Validator --&gt; Audit[audit log — prompt query user result metadata]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The architecture has six controls. Skip any one of them and the agent has more authority than you think.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Constrain the data surface before prompting the model.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Do not expose base tables such as &lt;code&gt;transactions&lt;/code&gt;, &lt;code&gt;customers&lt;/code&gt;, &lt;code&gt;accounts&lt;/code&gt;, or &lt;code&gt;payments&lt;/code&gt; directly. Create approved views such as &lt;code&gt;analytics_agent.agent_fraud_transactions_v1&lt;/code&gt; and &lt;code&gt;analytics_agent.agent_customer_activity_daily_v1&lt;/code&gt;. These views should encode allowed columns, masking rules, joins, freshness expectations, and business definitions such as “high-risk country” or “Q3 fiscal calendar.”&lt;/p&gt;
&lt;p&gt;A useful view is boring on purpose:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SCHEMA&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; IF&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXISTS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; analytics_agent;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; VIEW&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; analytics_agent&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.agent_fraud_transactions_v1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WITH&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (security_barrier &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; true) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;tenant_id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;transaction_id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;user_id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;amount_cents&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;transaction_at&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;destination_country&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    rc&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;risk_level&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    rc&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;definition_version&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; risk_definition_version&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; app&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;transactions&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; t&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; app&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;risk_countries&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; rc&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; rc&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;country_code&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;destination_country&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;deleted_at&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; IS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NULL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;PostgreSQL &lt;code&gt;security_barrier&lt;/code&gt; views matter because user-supplied predicates are not always innocent. PostgreSQL documents that view conditions are evaluated before user-added conditions for security-barrier views, with leakproof-function caveats (&lt;a href=&quot;https://www.postgresql.org/docs/16/sql-createview.html&quot;&gt;PostgreSQL 16 CREATE VIEW&lt;/a&gt;). That does not make a view a complete security system, but it makes predicate ordering part of the access design instead of an accident.&lt;/p&gt;
&lt;p&gt;Verification:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; grantee, table_schema, table_name, privilege_type&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;role_table_grants&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; grantee &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;agent_reader&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; table_schema, table_name, privilege_type;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then connect as the runtime role and confirm it has &lt;code&gt;SELECT&lt;/code&gt; only on approved views:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;psql&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$AGENT_DATABASE_URL&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -c&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;\dp analytics_agent.*&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use PostgreSQL privileges and RLS as the first hard boundary.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PostgreSQL row-level security restricts which rows are visible once row security is enabled. The documentation also states that table owners normally bypass row security unless &lt;code&gt;FORCE ROW LEVEL SECURITY&lt;/code&gt; is set, and roles with &lt;code&gt;BYPASSRLS&lt;/code&gt; bypass it (&lt;a href=&quot;https://www.postgresql.org/docs/16/ddl-rowsecurity.html&quot;&gt;PostgreSQL 16 RLS&lt;/a&gt;). Supabase has the same operational warning in another form: service keys can bypass RLS and should not be exposed to customers or browsers (&lt;a href=&quot;https://supabase.com/docs/guides/database/postgres/row-level-security&quot;&gt;Supabase RLS docs&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;For agent access, ownership, application runtime, and agent querying should be separate roles:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ROLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_reader NOLOGIN;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ROLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_runtime &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LOGIN&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; PASSWORD&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;use-secret-manager&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_reader &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_runtime;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;REVOKE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ALL &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SCHEMA&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; app &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_reader;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;REVOKE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ALL &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ALL TABLES &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IN&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SCHEMA&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; app &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_reader;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; USAGE &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SCHEMA&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; analytics_agent &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_reader;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; analytics_agent&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;agent_fraud_transactions_v1&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_reader;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ROLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_runtime &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; statement_timeout &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;5s&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ROLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_runtime &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; lock_timeout&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;500ms&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ROLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_runtime &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idle_in_transaction_session_timeout &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;10s&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ROLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_runtime &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; default_transaction_read_only &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; on&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ROLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_runtime &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; work_mem &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;16MB&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If tenant isolation is handled through RLS or session context, test the exact runtime role:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BEGIN&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; READ&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ONLY;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LOCAL&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; app&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;tenant_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;42&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; analytics_agent&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;agent_fraud_transactions_v1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tenant_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; current_setting(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;app.tenant_id&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;bigint&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;COMMIT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verification should compare at least three perspectives: table owner, application role, and agent role. The agent role is the one that matters.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Parse generated SQL before execution.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A regex that blocks &lt;code&gt;DELETE&lt;/code&gt; is theater. Parse the query into an abstract syntax tree and inspect statement type, referenced relations, selected columns, functions, joins, predicates, &lt;code&gt;LIMIT&lt;/code&gt;, comments, and statement count. For PostgreSQL-specific syntax, use a parser tied to PostgreSQL grammar, such as &lt;code&gt;libpg_query&lt;/code&gt;, which exposes the PostgreSQL parser outside the server (&lt;a href=&quot;https://github.com/pganalyze/libpg_query&quot;&gt;pganalyze libpg_query&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The policy should reject multi-statement input before relying on database timeouts. PostgreSQL 16 documents that &lt;code&gt;statement_timeout&lt;/code&gt; applies to each statement in a simple-query message, and that behavior changed from versions before PostgreSQL 13 (&lt;a href=&quot;https://www.postgresql.org/docs/16/runtime-config-client.html&quot;&gt;PostgreSQL 16 client defaults&lt;/a&gt;). That version detail matters: a control plane that accepts &lt;code&gt;SELECT ...; DROP ...;&lt;/code&gt; and hopes timeout saves it has already failed.&lt;/p&gt;
&lt;p&gt;The rejection suite should include at least these cases:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DELETE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; app&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;transactions&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tenant_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 42&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; app&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;customers&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; email, card_number&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; analytics_agent&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;agent_fraud_transactions_v1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; analytics_agent&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;agent_fraud_transactions_v1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; amount_cents &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1000000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_sleep(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;30&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; analytics_agent&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;agent_fraud_transactions_v1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; app&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;transactions&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verification: dangerous prompts should produce blocked SQL, not “best effort” repairs that silently weaken the policy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Run planner checks before execution.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PostgreSQL &lt;code&gt;EXPLAIN (FORMAT JSON)&lt;/code&gt; returns the selected plan without executing the statement. PostgreSQL also notes that planner decisions depend on up-to-date &lt;code&gt;pg_statistic&lt;/code&gt; data (&lt;a href=&quot;https://www.postgresql.org/docs/16/sql-explain.html&quot;&gt;PostgreSQL 16 EXPLAIN&lt;/a&gt;). Treat planner checks as a guardrail, not as proof.&lt;/p&gt;
&lt;p&gt;Example policy:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;json&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;max_estimated_rows&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1000000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;max_total_cost&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;250000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;forbid_seq_scan_on&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: [&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;    &quot;app.transactions&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;    &quot;app.events&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;    &quot;app.audit_log&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  ],&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;require_limit_for_detail_queries&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;max_limit&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;5000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use &lt;code&gt;EXPLAIN&lt;/code&gt; without &lt;code&gt;ANALYZE&lt;/code&gt; in the preflight path. &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; executes the statement, which defeats the purpose of a pre-execution gate.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Execute on isolated read capacity.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Natural-language analytics should not run on the primary writer unless the dataset is small and the blast radius is understood. Amazon RDS documents PostgreSQL read replicas as read-only instances used to scale read traffic (&lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.Replication.ReadReplicas.html&quot;&gt;RDS PostgreSQL read replicas&lt;/a&gt;). Aurora reader endpoints provide connection balancing for read-only connections across reader instances, with the caveat that if a cluster has no Aurora Replicas the reader endpoint connects to the primary instance (&lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Endpoints.Reader.html&quot;&gt;Aurora reader endpoint&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Verification should be explicit:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW transaction_read_only;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_is_in_recovery();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In ordinary PostgreSQL physical replicas, &lt;code&gt;pg_is_in_recovery()&lt;/code&gt; returns true on a standby. In managed services, also verify the endpoint label and deployment topology because the connection string is part of the architecture.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Make audit records useful for replay.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Logging “user asked a question” is not enough. A production audit record should let a reviewer reconstruct the request, policy decision, query, plan, execution boundary, and delivered answer.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;json&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;request_id&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;req_01j...&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;user_id&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;user_12345&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;tenant_id&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;42&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;source&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;copilot_ui&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;natural_language_prompt&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;Show transactions over $10,000 in Q3 2025 for user 12345 and flag high-risk countries&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;semantic_definitions&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    &quot;quarter&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;calendar_quarter_v1&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    &quot;risk_country&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;risk_country_v2&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  },&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;generated_sql_hash&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;sha256:...&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;approved_sql_hash&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;sha256:...&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;referenced_relations&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: [&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;    &quot;analytics_agent.agent_fraud_transactions_v1&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  ],&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;policy_decision&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;approved&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;policy_version&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;sql_agent_policy_2026_05_23&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;postgres_role&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;agent_runtime&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;execution_endpoint&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;reader&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;statement_timeout_ms&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;5000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;estimated_rows&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;840&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;returned_rows&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;3&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;result_truncated&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;false&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;replica_lag_ms&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1200&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;delivered_to&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;slack:fallback-review-channel&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A minimal guardrail policy looks like this:&lt;/p&gt;























































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Control&lt;/th&gt;&lt;th&gt;Example policy&lt;/th&gt;&lt;th&gt;Failure behavior&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Statement type&lt;/td&gt;&lt;td&gt;Allow one &lt;code&gt;SELECT&lt;/code&gt; statement only&lt;/td&gt;&lt;td&gt;Reject&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Relation access&lt;/td&gt;&lt;td&gt;Allow &lt;code&gt;analytics_agent.*&lt;/code&gt; views only&lt;/td&gt;&lt;td&gt;Reject&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Column access&lt;/td&gt;&lt;td&gt;Block raw &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;ssn&lt;/code&gt;, &lt;code&gt;card_number&lt;/code&gt;, &lt;code&gt;access_token&lt;/code&gt;, &lt;code&gt;address&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Reject&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tenant scope&lt;/td&gt;&lt;td&gt;Require &lt;code&gt;tenant_id = current_setting(&apos;app.tenant_id&apos;)&lt;/code&gt; or enforce through RLS&lt;/td&gt;&lt;td&gt;Reject&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Row bound&lt;/td&gt;&lt;td&gt;Require &lt;code&gt;LIMIT &amp;#x3C;= 5000&lt;/code&gt; unless aggregate-only&lt;/td&gt;&lt;td&gt;Rewrite or reject&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Time bound&lt;/td&gt;&lt;td&gt;Require date predicate for event tables over 10 million rows&lt;/td&gt;&lt;td&gt;Reject&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Planner bound&lt;/td&gt;&lt;td&gt;Reject estimated rows over 1 million or total cost over policy threshold&lt;/td&gt;&lt;td&gt;Reject&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Execution bound&lt;/td&gt;&lt;td&gt;&lt;code&gt;READ ONLY&lt;/code&gt;, &lt;code&gt;statement_timeout&lt;/code&gt;, &lt;code&gt;lock_timeout&lt;/code&gt;, read endpoint&lt;/td&gt;&lt;td&gt;Cancel or reject&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Summary bound&lt;/td&gt;&lt;td&gt;Require row count, filter statement, definition versions, and truncation status&lt;/td&gt;&lt;td&gt;Withhold summary&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The uncomfortable detail: the LLM should not be asked to remember these controls. It should be allowed to fail against them.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;This is not a private case study. It follows from documented PostgreSQL behavior, Supabase security guidance, and public cloud database design.&lt;/p&gt;









































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Documented behavior or decision&lt;/th&gt;&lt;th&gt;Production lesson&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL read-only transactions disallow &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, &lt;code&gt;MERGE&lt;/code&gt;, DDL, &lt;code&gt;TRUNCATE&lt;/code&gt;, and other write-oriented commands, with documented exceptions and caveats (&lt;a href=&quot;https://www.postgresql.org/docs/15/sql-set-transaction.html&quot;&gt;PostgreSQL 15 SET TRANSACTION&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;A prompt instruction saying “never modify data” is weaker than a transaction mode that refuses write statements&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL RLS applies policies once row security is enabled, but table owners normally bypass row security unless forced, and &lt;code&gt;BYPASSRLS&lt;/code&gt; roles bypass it (&lt;a href=&quot;https://www.postgresql.org/docs/16/ddl-rowsecurity.html&quot;&gt;PostgreSQL 16 RLS&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;Agent isolation belongs in the database role model, not only in application middleware&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Supabase service keys can bypass RLS and are intended for administrative server-side use, not exposed clients (&lt;a href=&quot;https://supabase.com/docs/guides/database/postgres/row-level-security&quot;&gt;Supabase RLS docs&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;A database agent should not run with Supabase service-role authority unless it is performing an explicitly administrative workflow&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL &lt;code&gt;security_barrier&lt;/code&gt; views affect when view predicates are evaluated relative to user-supplied predicates, with leakproof-function caveats (&lt;a href=&quot;https://www.postgresql.org/docs/16/sql-createview.html&quot;&gt;PostgreSQL 16 CREATE VIEW&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;Curated views are not just developer convenience; they are part of the access boundary for agent-generated predicates&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL &lt;code&gt;statement_timeout&lt;/code&gt; is measured from command arrival through completion and, since PostgreSQL 13, applies separately to each statement in a simple-query message (&lt;a href=&quot;https://www.postgresql.org/docs/16/runtime-config-client.html&quot;&gt;PostgreSQL 16 client defaults&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;The parser must reject multiple statements; timeout policy is not a substitute for statement-shape validation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL &lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt; terminates sessions idle inside an open transaction, and the docs note that open transactions can prevent cleanup of recently dead tuples (&lt;a href=&quot;https://www.postgresql.org/docs/16/runtime-config-client.html&quot;&gt;PostgreSQL 16 client defaults&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;A chat workflow that starts a transaction and waits on an external LLM call can contribute to bloat if timeout policy is missing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Amazon RDS documents PostgreSQL read replicas as read-only instances for scaling read traffic (&lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.Replication.ReadReplicas.html&quot;&gt;RDS PostgreSQL read replicas&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;Analytical agent traffic should be isolated from the write path before recurring workflows depend on it&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Aurora reader endpoints balance read-only connections across reader instances when replicas exist (&lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Endpoints.Reader.html&quot;&gt;Aurora reader endpoint&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;The database endpoint is an architectural control, not a deployment detail&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;I have not run the exact Crafted AI Framework plus n8n plus CopilotKit stack at scale personally. The documented failure mode is still clear: any system that turns user language into PostgreSQL queries must defend against overbroad authority, expensive plans, ambiguous definitions, stale reads, and misleading summaries.&lt;/p&gt;
&lt;p&gt;The production pattern is to split &lt;strong&gt;query authoring&lt;/strong&gt; from &lt;strong&gt;query authority&lt;/strong&gt;. The LLM authors a candidate. PostgreSQL, the parser, the policy engine, and the workflow orchestrator decide whether that candidate deserves execution.&lt;/p&gt;
&lt;p&gt;For the source example, the user asks:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Show transactions over $10,000 in Q2 2025 for user ID 12345 and flag high-risk countries.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A weak agent might produce this:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    t.&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;risk_level&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions t&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; countries c &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;destination_country&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;country_code&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;user_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 12345&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;amount&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; BETWEEN&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2025-04-01&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2025-06-30&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;risk_level&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;high&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query should be rejected, even though it looks close. It references base tables, uses &lt;code&gt;SELECT *&lt;/code&gt;, relies on ambiguous money units, omits tenant binding, uses an inclusive date boundary on a likely timestamp column, relies on unversioned risk definitions, and has no explicit row bound.&lt;/p&gt;
&lt;p&gt;A guarded system should repair it into a query against an approved surface:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    transaction_id,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    user_id,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    amount_cents,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    transaction_at,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    destination_country,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    risk_level,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    risk_definition_version&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; analytics_agent&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;agent_fraud_transactions_v1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tenant_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; current_setting(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;app.tenant_id&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;bigint&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; user_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 12345&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; amount_cents &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1000000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transaction_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TIMESTAMPTZ&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2025-04-01 00:00:00+00&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transaction_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  TIMESTAMPTZ&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2025-07-01 00:00:00+00&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; risk_level &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;high&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; amount_cents &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 500&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The validation result should be explicit:&lt;/p&gt;




























































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Check&lt;/th&gt;&lt;th&gt;Result&lt;/th&gt;&lt;th&gt;Reason&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Statement type&lt;/td&gt;&lt;td&gt;Pass&lt;/td&gt;&lt;td&gt;Single &lt;code&gt;SELECT&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Relation allowlist&lt;/td&gt;&lt;td&gt;Pass&lt;/td&gt;&lt;td&gt;Uses &lt;code&gt;analytics_agent.agent_fraud_transactions_v1&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Base table access&lt;/td&gt;&lt;td&gt;Pass&lt;/td&gt;&lt;td&gt;No direct &lt;code&gt;app.*&lt;/code&gt; relation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Sensitive columns&lt;/td&gt;&lt;td&gt;Pass&lt;/td&gt;&lt;td&gt;No raw email, card number, token, or address fields&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tenant scope&lt;/td&gt;&lt;td&gt;Pass&lt;/td&gt;&lt;td&gt;Binds to &lt;code&gt;current_setting(&apos;app.tenant_id&apos;)&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Time scope&lt;/td&gt;&lt;td&gt;Pass&lt;/td&gt;&lt;td&gt;Half-open Q3 UTC range&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Row bound&lt;/td&gt;&lt;td&gt;Pass&lt;/td&gt;&lt;td&gt;&lt;code&gt;LIMIT 500&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Planner check&lt;/td&gt;&lt;td&gt;Pass or reject&lt;/td&gt;&lt;td&gt;Based on &lt;code&gt;EXPLAIN (FORMAT JSON)&lt;/code&gt; policy thresholds&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Execution endpoint&lt;/td&gt;&lt;td&gt;Pass&lt;/td&gt;&lt;td&gt;Reader connection only&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Summary contract&lt;/td&gt;&lt;td&gt;Pass&lt;/td&gt;&lt;td&gt;Must include filters, definitions, row count, and truncation status&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The workflow output should not only say “3 transactions over $10,000 detected.” It should include the query boundary:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Q2 2025 was interpreted as 2025-04-01 through 2025-06-30 UTC. High-risk country came from &lt;code&gt;risk_country_v2&lt;/code&gt;. Results were limited to tenant 42, user 12345, and 500 rows. The query returned 3 rows from the reader endpoint. No causal explanation was inferred from these rows.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That is not verbosity. That is evidence.&lt;/p&gt;
&lt;p&gt;A useful workflow looks like this:&lt;/p&gt;

































































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Stage&lt;/th&gt;&lt;th&gt;Input&lt;/th&gt;&lt;th&gt;Output&lt;/th&gt;&lt;th&gt;Control&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;User request&lt;/td&gt;&lt;td&gt;Natural-language question&lt;/td&gt;&lt;td&gt;Structured intent&lt;/td&gt;&lt;td&gt;Require authenticated user, tenant context, and purpose&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Semantic lookup&lt;/td&gt;&lt;td&gt;“Q3 2025”, “high-risk country”, “transactions”&lt;/td&gt;&lt;td&gt;Approved metric and view definitions&lt;/td&gt;&lt;td&gt;Use catalog definitions, not model memory&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SQL generation&lt;/td&gt;&lt;td&gt;Structured intent and schema subset&lt;/td&gt;&lt;td&gt;Candidate SQL&lt;/td&gt;&lt;td&gt;Prompt includes only approved views&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SQL validation&lt;/td&gt;&lt;td&gt;Candidate SQL&lt;/td&gt;&lt;td&gt;Approved or rejected query&lt;/td&gt;&lt;td&gt;Parser enforces allowlist, predicates, and limits&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Plan check&lt;/td&gt;&lt;td&gt;Approved query&lt;/td&gt;&lt;td&gt;Plan JSON&lt;/td&gt;&lt;td&gt;Reject large scans, unsafe joins, and high-cost plans&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Execution&lt;/td&gt;&lt;td&gt;Final SQL&lt;/td&gt;&lt;td&gt;Rows or aggregate result&lt;/td&gt;&lt;td&gt;Read-only role, read endpoint, timeout, lock timeout&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Result validation&lt;/td&gt;&lt;td&gt;Rows plus metadata&lt;/td&gt;&lt;td&gt;Validated result envelope&lt;/td&gt;&lt;td&gt;Check row count, truncation, tenant scope, and freshness&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Summarization&lt;/td&gt;&lt;td&gt;Validated result envelope&lt;/td&gt;&lt;td&gt;Report&lt;/td&gt;&lt;td&gt;Include filters, row count, definitions, and caveats&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Audit&lt;/td&gt;&lt;td&gt;Prompt, SQL, user, plan, result metadata&lt;/td&gt;&lt;td&gt;Immutable log&lt;/td&gt;&lt;td&gt;Support review, replay, and incident analysis&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;A basic PostgreSQL harness should be part of the release checklist:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Must fail: no base table access&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ROLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; agent_runtime;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; app&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;transactions&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Must fail: no write path&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BEGIN&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; READ&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ONLY;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DELETE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; analytics_agent&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;agent_fraud_transactions_v1&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tenant_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 42&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ROLLBACK&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Must pass: approved view and bounded tenant context&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BEGIN&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; READ&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ONLY;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LOCAL&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; app&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;tenant_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;42&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transaction_id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; analytics_agent&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;agent_fraud_transactions_v1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tenant_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; current_setting(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;app.tenant_id&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;bigint&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transaction_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;COMMIT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Must be inspected before execution in the control plane&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN (FORMAT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JSON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transaction_id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; analytics_agent&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;agent_fraud_transactions_v1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tenant_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; current_setting(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;app.tenant_id&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;bigint&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transaction_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the difference between a demo and an operating surface: the negative tests are as important as the happy path.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

































































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;The agent omits tenant scope&lt;/td&gt;&lt;td&gt;User asks a broad question, schema includes &lt;code&gt;tenant_id&lt;/code&gt;, prompt does not force tenant binding&lt;/td&gt;&lt;td&gt;Enforce tenant scope through RLS or reject SQL missing the required tenant predicate&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;The query is read-only but still harmful&lt;/td&gt;&lt;td&gt;&lt;code&gt;SELECT count(*)&lt;/code&gt; or a broad join scans a large event table on the writer&lt;/td&gt;&lt;td&gt;Route to a replica, require date predicates, set &lt;code&gt;statement_timeout&lt;/code&gt;, and block high-cost plans from &lt;code&gt;EXPLAIN (FORMAT JSON)&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;RLS gives false confidence&lt;/td&gt;&lt;td&gt;Policy exists, but the agent executes as table owner, a &lt;code&gt;BYPASSRLS&lt;/code&gt; role, or a Supabase service role&lt;/td&gt;&lt;td&gt;Test access as the exact runtime role; avoid service-role credentials for user-scoped analytics&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Views leak more than intended&lt;/td&gt;&lt;td&gt;A curated view includes sensitive columns, unsafe functions, or unclear predicate behavior&lt;/td&gt;&lt;td&gt;Keep views narrow, use &lt;code&gt;security_barrier&lt;/code&gt; where appropriate, and test selected columns through the agent role&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;LIMIT&lt;/code&gt; hides correctness bugs&lt;/td&gt;&lt;td&gt;Agent adds &lt;code&gt;LIMIT 100&lt;/code&gt; to satisfy policy but summarizes as if the result is complete&lt;/td&gt;&lt;td&gt;Require the report to state row limits and total count strategy; use aggregates for counts and samples for inspection&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replica lag creates stale answers&lt;/td&gt;&lt;td&gt;Agent reads from an asynchronous replica during incident response or fraud review&lt;/td&gt;&lt;td&gt;Include replica lag in result metadata; route freshness-critical questions to a dedicated bounded primary path&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SQL parser and database version drift&lt;/td&gt;&lt;td&gt;Parser supports a different PostgreSQL grammar than the server executes&lt;/td&gt;&lt;td&gt;Pin parser support to the database major version; reject unsupported syntax rather than falling back to string checks&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;n8n retries multiply load&lt;/td&gt;&lt;td&gt;Workflow retry policy repeats a timeout-heavy query after transient failures&lt;/td&gt;&lt;td&gt;Add idempotency keys, exponential backoff, per-user rate limits, and query fingerprint throttling&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;LLM call happens inside a transaction&lt;/td&gt;&lt;td&gt;Workflow opens a transaction, calls the model, and waits while the database session sits idle&lt;/td&gt;&lt;td&gt;Generate and validate before &lt;code&gt;BEGIN&lt;/code&gt;; set &lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt; anyway&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Summarizer invents explanation&lt;/td&gt;&lt;td&gt;Result table has sparse evidence, but the LLM describes causality or risk with high confidence&lt;/td&gt;&lt;td&gt;Give the summarizer only rows, schema definitions, and allowed explanation patterns; separate observation from interpretation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Business terms drift&lt;/td&gt;&lt;td&gt;“High risk,” “active user,” or “Q3” changes across finance, fraud, and product teams&lt;/td&gt;&lt;td&gt;Store definitions in a semantic catalog with versioned names such as &lt;code&gt;risk_country_v2&lt;/code&gt; and &lt;code&gt;fiscal_quarter_calendar_v1&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The version-specific gotcha worth repeating is parser and server drift. PostgreSQL syntax and timeout behavior change across major versions. If the validation service parses a different dialect than the server executes, the safety layer can reject valid queries, accept wrong assumptions, or fail open under pressure. A SQL agent control plane should fail closed. Annoying users is cheaper than explaining why an assistant queried outside its boundary.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: A natural-language SQL agent concentrates risk because it converts ambiguous user intent into executable database authority.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Put the LLM behind a control plane with curated views, PostgreSQL roles, RLS, SQL parsing, planner checks, read-only execution, timeouts, endpoint isolation, result validation, and audit logs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: The first validation signal is a rejection suite where dangerous prompts produce blocked SQL and every approved query has a stored prompt, query, plan, role, timeout, row count, freshness marker, and delivery target.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, build one read-only agent role that can query only two approved views, then add a parser gate that rejects writes, cross-schema reads, missing tenant scope, sensitive columns, multi-statement input, and unbounded selects.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A database agent is production-ready only when the least interesting part of the system is the chat box.&lt;/p&gt;</content:encoded><category>ai-engineering</category><category>databases</category><category>architecture</category></item><item><title>Covering Indexes Are Not Enough Without Visibility</title><link>https://rajivonai.com/blog/2025-07-12-covering-indexes-are-not-enough-without-visibility/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-07-12-covering-indexes-are-not-enough-without-visibility/</guid><description>PostgreSQL index-only scans only stay fast when covering indexes and visibility map maintenance work together.</description><pubDate>Sat, 12 Jul 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A PostgreSQL covering index is not a performance fix by itself; it is a bet that the query, the index payload, and the visibility map will stay aligned under real production churn.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;The default move is still an ordinary B-tree index on the predicate column: &lt;code&gt;CREATE INDEX ON users(email)&lt;/code&gt;. The better move, when the read path is stable, is a covering index using PostgreSQL 11’s &lt;code&gt;INCLUDE&lt;/code&gt; clause, which stores projected columns in the index payload so an index-only scan can answer the query without visiting the heap when visibility permits it.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Approach&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;What it optimizes&lt;/th&gt;&lt;th&gt;What it still pays for&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Ordinary B-tree index&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Finds matching tuple IDs quickly&lt;/td&gt;&lt;td&gt;Heap reads for projected columns and Multi-Version Concurrency Control (MVCC) visibility&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Covering index with &lt;code&gt;INCLUDE&lt;/code&gt;&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Keeps predicate and selected columns in one index&lt;/td&gt;&lt;td&gt;Larger index, write overhead, visibility map dependency&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Covering index plus vacuum discipline&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Avoids heap access for stable pages&lt;/td&gt;&lt;td&gt;Operational ownership of autovacuum and long transactions&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;PostgreSQL indexes do not store complete row visibility. They can point to candidate rows, but MVCC visibility is determined from heap state unless PostgreSQL can trust the visibility map. The official PostgreSQL documentation is explicit: index-only scans only win when the needed columns are available from the index and a significant fraction of heap pages have their all-visible bits set in the visibility map.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Projection misses the index&lt;/td&gt;&lt;td&gt;&lt;code&gt;SELECT username, status&lt;/code&gt; uses &lt;code&gt;idx_users_email(email)&lt;/code&gt; and still reads the heap&lt;/td&gt;&lt;td&gt;The index finds rows, but the table still serves the selected columns&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Visibility map is stale&lt;/td&gt;&lt;td&gt;Plan says &lt;code&gt;Index Only Scan&lt;/code&gt;, but reports &lt;code&gt;Heap Fetches: 12000&lt;/code&gt;&lt;/td&gt;&lt;td&gt;The scan is only “index-only” for pages marked all-visible&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Autovacuum threshold is too loose&lt;/td&gt;&lt;td&gt;Default &lt;code&gt;autovacuum_vacuum_scale_factor = 0.2&lt;/code&gt; can mean roughly 40M changed tuples on a 200M-row table before vacuum triggers&lt;/td&gt;&lt;td&gt;Large tables can accumulate heap pages that are not all-visible for too long&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Included column churn&lt;/td&gt;&lt;td&gt;Updating &lt;code&gt;status&lt;/code&gt; or &lt;code&gt;username&lt;/code&gt; touches an indexed column&lt;/td&gt;&lt;td&gt;PostgreSQL must maintain the index entry, and HOT updates are less likely&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Staging lies politely&lt;/td&gt;&lt;td&gt;Freshly loaded and manually vacuumed test data shows zero heap fetches&lt;/td&gt;&lt;td&gt;Production write churn, old snapshots, and delayed vacuum change the execution profile&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The core question is not “did we add an index?” It is: can PostgreSQL answer this production query from the index while proving that the referenced heap pages are visible to the current snapshot?&lt;/p&gt;
&lt;h2 id=&quot;design-the-index-around-the-read-path-and-the-visibility-map&quot;&gt;Design the Index Around the Read Path and the Visibility Map&lt;/h2&gt;
&lt;p&gt;The right architecture is a measured covering-index loop: identify the hot read path, build the narrowest covering index, verify heap avoidance with &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt;, and tune vacuum behavior for that table instead of celebrating the DDL.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Query[hot read query — predicate and projection] --&gt; Cover[covering B-tree index — key and included columns]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Cover --&gt; VM[visibility map — all visible bit]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    VM --&gt;|bit set| Return[index tuple returned]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    VM --&gt;|bit clear| Heap[heap visit for MVCC check]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Heap --&gt; Return&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Vacuum[VACUUM and autovacuum] --&gt; VM&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Writes[INSERT UPDATE DELETE on page] --&gt; VM&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Start from &lt;code&gt;pg_stat_statements&lt;/code&gt;, not intuition. Pick one query by total time and call count, then write down its &lt;code&gt;WHERE&lt;/code&gt;, &lt;code&gt;ORDER BY&lt;/code&gt;, and &lt;code&gt;SELECT&lt;/code&gt; columns.&lt;br&gt;
Verification: the candidate query has a stable fingerprint and enough calls to matter.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Put search columns in the key and projected columns in &lt;code&gt;INCLUDE&lt;/code&gt;. For the lookup path below, &lt;code&gt;email&lt;/code&gt; is the key; &lt;code&gt;username&lt;/code&gt; and &lt;code&gt;status&lt;/code&gt; are payload.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; CONCURRENTLY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_users_email_covering&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; users(email)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INCLUDE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (username, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verification: &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; finishes without blocking ordinary reads and writes, and the index size is acceptable via &lt;code&gt;pg_relation_size&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Run the real query with execution metrics.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; username, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; users&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; email &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;dev@example.com&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verification: look for &lt;code&gt;Index Only Scan&lt;/code&gt;, low shared buffer reads, and &lt;code&gt;Heap Fetches: 0&lt;/code&gt; or a number small enough to survive peak traffic.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Check visibility health, not just plan shape. PostgreSQL’s visibility map stores all-visible and all-frozen state per heap page, and its bits are set by vacuum and cleared by data-modifying operations.&lt;br&gt;
Verification: if heap fetches remain high after the index is used, inspect &lt;code&gt;last_autovacuum&lt;/code&gt;, &lt;code&gt;n_dead_tup&lt;/code&gt;, long-running transactions, and table-level autovacuum settings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Bound the write cost. Included columns are not search keys, but they still live in the index. A wide &lt;code&gt;text&lt;/code&gt;, &lt;code&gt;jsonb&lt;/code&gt;, or frequently updated status column can turn a read optimization into write amplification.&lt;br&gt;
Verification: compare &lt;code&gt;pg_stat_user_indexes.idx_scan&lt;/code&gt;, write latency, WAL volume, HOT update ratio, and index size before and after rollout.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;I am not going to invent a 2:14 AM incident with a heroic graph. The documented production pattern is enough, and the public PostgreSQL material gives a concrete measurement boundary.&lt;/p&gt;
&lt;p&gt;PostgreSQL 11 added covering indexes with &lt;code&gt;INCLUDE&lt;/code&gt;, documented in the project release notes and in the current index-only scan documentation. The documentation says the scan is physically possible when the index type supports it and the query’s referenced columns are available from the index. B-tree indexes satisfy the access-method requirement. The same documentation adds the operational catch: because visibility data is not stored in index entries, PostgreSQL checks the visibility map before skipping the heap.&lt;/p&gt;
&lt;p&gt;That behavior explains why a plan can contain &lt;code&gt;Index Only Scan&lt;/code&gt; and still do heap work. The plan node describes the access strategy; &lt;code&gt;Heap Fetches&lt;/code&gt; tells you how often the executor had to visit heap pages anyway. If heap fetches are high, the covering index may still reduce work, but it has not removed the table from the read path.&lt;/p&gt;
&lt;p&gt;A useful public comparison comes from Dalibo’s PostgreSQL 11 workshop, which uses a 10M-row table with columns &lt;code&gt;a&lt;/code&gt;, &lt;code&gt;b&lt;/code&gt;, and &lt;code&gt;c&lt;/code&gt;. With a unique index on &lt;code&gt;(a, b)&lt;/code&gt;, selecting only &lt;code&gt;a, b&lt;/code&gt; can use an index-only scan with &lt;code&gt;Heap Fetches: 0&lt;/code&gt;. Selecting &lt;code&gt;a, b, c&lt;/code&gt; from the same predicate cannot be answered by that index, so PostgreSQL uses an index scan and reads the table to get &lt;code&gt;c&lt;/code&gt;. After adding a covering index on &lt;code&gt;(a, b) INCLUDE (c)&lt;/code&gt;, the same &lt;code&gt;a, b, c&lt;/code&gt; query returns to an index-only scan with &lt;code&gt;Heap Fetches: 0&lt;/code&gt;.&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Public PostgreSQL 11 workshop measurement&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Plan shape&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Heap fetch signal&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Execution time&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Existing unique index on &lt;code&gt;(a, b)&lt;/code&gt;, query selects &lt;code&gt;a, b&lt;/code&gt;&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;&lt;code&gt;Index Only Scan&lt;/code&gt;&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;&lt;code&gt;0&lt;/code&gt;&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;&lt;code&gt;12.628 ms&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Existing unique index on &lt;code&gt;(a, b)&lt;/code&gt;, query selects &lt;code&gt;a, b, c&lt;/code&gt;&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;&lt;code&gt;Index Scan&lt;/code&gt;&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Heap access is inherent&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;&lt;code&gt;16.034 ms&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Covering unique index on &lt;code&gt;(a, b) INCLUDE (c)&lt;/code&gt;, query selects &lt;code&gt;a, b, c&lt;/code&gt;&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;&lt;code&gt;Index Only Scan&lt;/code&gt;&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;&lt;code&gt;0&lt;/code&gt;&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;&lt;code&gt;14.263 ms&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The more interesting part is not the small read-time delta in that example. It is the storage and write tradeoff. Dalibo reports &lt;code&gt;214 MB&lt;/code&gt; for the unique &lt;code&gt;(a, b)&lt;/code&gt; index and &lt;code&gt;387 MB&lt;/code&gt; for a separate &lt;code&gt;(a, b, c)&lt;/code&gt; index, or &lt;code&gt;602 MB&lt;/code&gt; if both are kept. Replacing that pair with one unique covering index on &lt;code&gt;(a, b) INCLUDE (c)&lt;/code&gt; is reported at &lt;code&gt;386 MB&lt;/code&gt;. The same workshop then inserts 100k rows: maintaining one covering index reports &lt;code&gt;502.594 ms&lt;/code&gt;; maintaining the two-index design reports &lt;code&gt;843.147 ms&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That is the design tradeoff senior engineers should care about. The covering index did not make writes free. It reduced a two-index design into one index while preserving uniqueness semantics on &lt;code&gt;(a, b)&lt;/code&gt;. If your alternative is no extra index, writes still pay. If your alternative is two overlapping indexes, a covering index may be the cheaper structure.&lt;/p&gt;
&lt;p&gt;The deeper production gotcha is autovacuum math. PostgreSQL documents &lt;code&gt;autovacuum_vacuum_threshold = 50&lt;/code&gt; and &lt;code&gt;autovacuum_vacuum_scale_factor = 0.2&lt;/code&gt; defaults. On small tables, that is fine. On a 200M-row relation, scale-factor-driven vacuum can wait for a very large number of changed tuples unless table storage parameters override it. That delay matters because visibility map bits are conservative: if PostgreSQL cannot prove a page is all-visible, it visits the heap.&lt;/p&gt;
&lt;p&gt;There is also a schema-design trap. Adding &lt;code&gt;INCLUDE (username, status)&lt;/code&gt; is reasonable for a hot lookup endpoint. Adding ten payload columns because “index-only scans are fast” is not engineering; it is moving the table into another structure with worse write economics. PostgreSQL will reject oversized index tuples, and before that hard failure, you pay with memory pressure, cache churn, WAL, and slower updates.&lt;/p&gt;
&lt;p&gt;The useful mental model is simple: a covering index is a read-path contract. Autovacuum, transaction age, and update patterns are the parties that can break it.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Index Only Scan&lt;/code&gt; still shows large &lt;code&gt;Heap Fetches&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Pages are not marked all-visible after recent &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, or &lt;code&gt;DELETE&lt;/code&gt; activity&lt;/td&gt;&lt;td&gt;Tune table-level autovacuum and remove long-running transactions holding old snapshots&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Covering index bloats quickly&lt;/td&gt;&lt;td&gt;&lt;code&gt;INCLUDE&lt;/code&gt; contains wide &lt;code&gt;text&lt;/code&gt;, &lt;code&gt;jsonb&lt;/code&gt;, or low-value projected columns&lt;/td&gt;&lt;td&gt;Keep payload columns narrow and tied to one hot query family&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Write latency rises after rollout&lt;/td&gt;&lt;td&gt;Included columns are frequently updated, preventing cheap heap-only behavior&lt;/td&gt;&lt;td&gt;Drop volatile payload columns or split read model from write-heavy table&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Planner ignores the new index&lt;/td&gt;&lt;td&gt;Query selects extra columns, uses mismatched predicates, or statistics are stale&lt;/td&gt;&lt;td&gt;Re-run &lt;code&gt;ANALYZE&lt;/code&gt;, verify exact projection, and compare with &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Staging benchmark overstates gains&lt;/td&gt;&lt;td&gt;Test data was bulk-loaded, vacuumed, and mostly static&lt;/td&gt;&lt;td&gt;Replay production write mix or test after churn before trusting heap-fetch counts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;RDS maintenance lags during peak write load&lt;/td&gt;&lt;td&gt;Autovacuum workers and cost limits cannot keep up with dead tuples&lt;/td&gt;&lt;td&gt;Use per-table autovacuum settings and monitor &lt;code&gt;pg_stat_user_tables&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Ordinary indexes still force heap access when the query projects columns outside the index or when MVCC visibility cannot be proven from the visibility map.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Build narrow covering indexes only for high-call-count read paths, then treat autovacuum health as part of the index design.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: The validation signal is not the presence of &lt;code&gt;Index Only Scan&lt;/code&gt;; it is low &lt;code&gt;Heap Fetches&lt;/code&gt;, stable buffer reads, acceptable index size, preserved HOT update ratio, and no write regression.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, take the top query from &lt;code&gt;pg_stat_statements&lt;/code&gt;, add one candidate covering index in staging, and compare &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt;, &lt;code&gt;pg_relation_size&lt;/code&gt;, write latency, WAL volume, and HOT update ratio before and after real write churn.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A fast PostgreSQL query is rarely the result of one clever index; it is the result of making the storage engine’s promises line up with the workload it is actually running.&lt;/p&gt;</content:encoded><category>databases</category><category>architecture</category><category>failures</category></item><item><title>When Autovacuum Becomes a Backpressure Signal</title><link>https://rajivonai.com/blog/2025-07-05-when-autovacuum-becomes-a-backpressure-signal/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-07-05-when-autovacuum-becomes-a-backpressure-signal/</guid><description>PostgreSQL vacuum stalls are often symptoms of lock pressure, table bloat, and missing operational visibility.</description><pubDate>Sat, 05 Jul 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Autovacuum is not background housekeeping; in a write-heavy PostgreSQL system, delayed vacuum is a backpressure signal from Multi-Version Concurrency Control before the application admits it is overloaded.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s default approach is to let autovacuum clean dead row versions in the background while application traffic continues. The alternative is to treat vacuum health as part of the write path: measured, alerted, tuned per table, and included in incident triage.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Approach&lt;/th&gt;&lt;th&gt;What it assumes&lt;/th&gt;&lt;th&gt;What production eventually proves&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Default autovacuum&lt;/td&gt;&lt;td&gt;Table churn is moderate and cleanup can trail safely&lt;/td&gt;&lt;td&gt;High-update tables create cleanup debt faster than defaults can retire it&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Manual emergency vacuum&lt;/td&gt;&lt;td&gt;Operators can intervene after latency spikes&lt;/td&gt;&lt;td&gt;The database is already paying interest on bloat by then&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Vacuum as backpressure telemetry&lt;/td&gt;&lt;td&gt;Dead tuples, transaction age, locks, and vacuum progress are monitored together&lt;/td&gt;&lt;td&gt;The incident is visible before p95 latency becomes the alert&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Autovacuum is often blamed because it is visible during the outage. That is usually too shallow. In PostgreSQL, &lt;code&gt;UPDATE&lt;/code&gt; and &lt;code&gt;DELETE&lt;/code&gt; create dead row versions under Multi-Version Concurrency Control; &lt;code&gt;VACUUM&lt;/code&gt; can only remove versions no active snapshot can still see. A single old transaction can hold back the cleanup horizon through &lt;code&gt;backend_xmin&lt;/code&gt;, which PostgreSQL exposes in &lt;code&gt;pg_stat_activity&lt;/code&gt;.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Long transaction age&lt;/td&gt;&lt;td&gt;Vacuum cannot remove dead tuples still visible to an old snapshot&lt;/td&gt;&lt;td&gt;Bloat grows even while autovacuum appears active&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Idle transaction sessions&lt;/td&gt;&lt;td&gt;&lt;code&gt;state = &apos;idle in transaction&apos;&lt;/code&gt; keeps a snapshot open without doing useful work&lt;/td&gt;&lt;td&gt;One abandoned app connection can pin cleanup behind thousands of writes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High-churn tables on defaults&lt;/td&gt;&lt;td&gt;&lt;code&gt;autovacuum_vacuum_scale_factor = 0.2&lt;/code&gt; waits for 20 percent table churn plus threshold&lt;/td&gt;&lt;td&gt;On a 200M-row table, that can mean tens of millions of dead tuples before cleanup starts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Lock conflicts&lt;/td&gt;&lt;td&gt;Plain &lt;code&gt;VACUUM&lt;/code&gt; uses &lt;code&gt;ShareUpdateExclusiveLock&lt;/code&gt;; &lt;code&gt;VACUUM FULL&lt;/code&gt; takes &lt;code&gt;AccessExclusiveLock&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Confusing the two during an incident can turn a slowdown into an outage&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Dead tuple percent alone&lt;/td&gt;&lt;td&gt;Small tables, append-heavy tables, and partitioned tables distort the signal&lt;/td&gt;&lt;td&gt;Alerts need relation size, last vacuum age, transaction age, and latency together&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;PostgreSQL’s own documentation is explicit about the mechanics: routine vacuuming removes dead row versions and prevents transaction ID wraparound, while old open transactions can block cleanup progress. The operational question is not “is autovacuum running?” The question is: which workload condition is forcing it to fall behind?&lt;/p&gt;
&lt;h2 id=&quot;treat-autovacuum-as-backpressure-telemetry&quot;&gt;Treat Autovacuum as Backpressure Telemetry&lt;/h2&gt;
&lt;p&gt;The right architecture is a vacuum control loop: observe the cleanup horizon, identify blockers, tune the few hot tables, and validate under write load. Do not start by changing global autovacuum settings across the cluster. That is how a maintenance problem becomes an I/O scheduling problem.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    App[application writes] --&gt; MVCC[MVCC row versions]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    MVCC --&gt; Dead[dead tuples accumulate]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Txn[old transaction xmin] --&gt; Horizon[cleanup horizon held back]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Dead --&gt; Auto[autovacuum worker]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Horizon --&gt; Auto&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Auto --&gt; Locks[ShareUpdateExclusiveLock]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DDL[DDL or index maintenance] --&gt; Locks&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Locks --&gt; Lag[vacuum lag]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Lag --&gt; Bloat[table and index bloat]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Bloat --&gt; Planner[slower plans and more IO]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Planner --&gt; App&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Lag --&gt; Alert[backpressure alert]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Build a vacuum incident view.&lt;/p&gt;
&lt;p&gt;Include active vacuum progress, oldest transaction age, idle-in-transaction sessions, dead tuple counts, table size, and blockers. &lt;code&gt;pg_stat_progress_vacuum&lt;/code&gt; has existed since PostgreSQL 9.6 and reports active vacuum workers, including autovacuum workers.&lt;/p&gt;
&lt;p&gt;Verification: during a load test, you can name the table being vacuumed, its phase, heap blocks scanned, and any blocking backend in under one minute.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Alert on cleanup debt, not just dead tuple percentage.&lt;/p&gt;
&lt;p&gt;A 40 percent dead tuple ratio on a 5 MB table is noise. Five percent on a 900 GB high-update table may be a serious future incident. Use a composite signal: &lt;code&gt;n_dead_tup&lt;/code&gt;, &lt;code&gt;pg_total_relation_size&lt;/code&gt;, &lt;code&gt;last_autovacuum&lt;/code&gt;, oldest &lt;code&gt;backend_xmin&lt;/code&gt;, and query latency for the table’s top statements.&lt;/p&gt;
&lt;p&gt;Verification: every alert points to one table, one suspected blocker class, and one next action.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tune high-churn tables per table.&lt;/p&gt;
&lt;p&gt;Lower scale factors on tables such as &lt;code&gt;orders&lt;/code&gt;, &lt;code&gt;sessions&lt;/code&gt;, and job queues. A setting like &lt;code&gt;autovacuum_vacuum_scale_factor = 0.01&lt;/code&gt; with a fixed threshold can make cleanup continuous instead of bursty. Keep cost delay and cost limit workload-aware; aggressive cleanup still competes for disk and cache.&lt;/p&gt;
&lt;p&gt;Verification: after tuning, &lt;code&gt;n_dead_tup&lt;/code&gt; forms a sawtooth with a lower ceiling under production-like write load.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Fix transaction hygiene before killing vacuum.&lt;/p&gt;
&lt;p&gt;Terminating autovacuum can reduce immediate pressure when it is competing with foreground work, but repeated termination increases bloat debt. The durable fix is shorter transactions, timeouts for idle sessions, safer migration locks, and partition or index maintenance where needed.&lt;/p&gt;
&lt;p&gt;Verification: oldest transaction age remains bounded during peak traffic, not only during maintenance windows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A useful runbook query starts here:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  usename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  application_name,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  wait_event_type,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  wait_event,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  age(clock_timestamp(), xact_start) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xact_age,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  age(clock_timestamp(), query_start) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query_age,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  backend_xmin,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  left&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(query, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;160&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &amp;#x3C;&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xact_start &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;NULLS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LAST&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The most useful public case study is not an anonymous war story; it is the AWS Database Blog write-up on tuning autovacuum for Amazon RDS for PostgreSQL 9.6.3 after an Oracle-to-PostgreSQL OLTP migration. The database was provisioned for 30,000 IOPS. During the first weeks after migration, several databases saw Read IOPS spike as high as 25,000 without a matching increase in application load. The visible symptom was not one slow query. It was cleanup work arriving late, in large chunks, on already-bloated tables.&lt;/p&gt;
&lt;p&gt;The concrete numbers are the part worth carrying into a runbook:&lt;/p&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Published observation&lt;/th&gt;&lt;th&gt;Value&lt;/th&gt;&lt;th&gt;Operational reading&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;table1&lt;/code&gt; live tuples&lt;/td&gt;&lt;td&gt;450,398,643&lt;/td&gt;&lt;td&gt;Large enough that percentage-based thresholds delay cleanup&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;table1&lt;/code&gt; dead tuples&lt;/td&gt;&lt;td&gt;459,406,616&lt;/td&gt;&lt;td&gt;More dead tuples than estimated live tuples&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;table2&lt;/code&gt; dead tuples&lt;/td&gt;&lt;td&gt;1,919,230,596&lt;/td&gt;&lt;td&gt;Vacuum debt was not isolated to one table&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;table3&lt;/code&gt; dead tuples&lt;/td&gt;&lt;td&gt;4,642,232,802&lt;/td&gt;&lt;td&gt;Cluster-level worker saturation becomes plausible&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Longest autovacuum session&lt;/td&gt;&lt;td&gt;2 days 16:03 on &lt;code&gt;sh.table1&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Vacuum was active but not converging fast enough&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Blocking session state&lt;/td&gt;&lt;td&gt;&lt;code&gt;idle in transaction&lt;/code&gt; for 2 days 22:25 on &lt;code&gt;table1&lt;/code&gt;&lt;/td&gt;&lt;td&gt;The cleanup horizon was pinned by transaction hygiene&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;RDS setting called out&lt;/td&gt;&lt;td&gt;&lt;code&gt;autovacuum_vacuum_scale_factor = 0.1&lt;/code&gt;, &lt;code&gt;autovacuum_max_workers = 3&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Millions of dead tuples accumulated before work started&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tuning result reported&lt;/td&gt;&lt;td&gt;&lt;code&gt;autovacuum_max_workers = 8&lt;/code&gt;, &lt;code&gt;autovacuum_vacuum_cost_limit = 4800&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Read IOPS during concurrent autovacuum was brought to about 10,000, one-third of provisioned capacity&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;That case is useful because it separates three failure modes operators often collapse into one. First, the trigger threshold was too high for tables with hundreds of millions of rows. Second, the default worker count meant a few large tables could occupy all autovacuum workers while other tables continued to accumulate dead tuples. Third, an &lt;code&gt;idle in transaction&lt;/code&gt; session kept old tuple versions visible, so autovacuum could run and still fail to reclaim enough space.&lt;/p&gt;
&lt;p&gt;The lock behavior is documented, not folklore. PostgreSQL’s explicit locking documentation states that plain &lt;code&gt;VACUUM&lt;/code&gt; acquires &lt;code&gt;ShareUpdateExclusiveLock&lt;/code&gt;, while &lt;code&gt;VACUUM FULL&lt;/code&gt; requires &lt;code&gt;AccessExclusiveLock&lt;/code&gt;. That distinction matters at 03:00. Plain vacuum is designed to coexist with normal reads and writes; &lt;code&gt;VACUUM FULL&lt;/code&gt; rewrites the table and blocks concurrent access. Reaching for it during a live checkout incident is usually the database equivalent of fixing a smoke alarm with a hammer.&lt;/p&gt;
&lt;p&gt;A separate public PGConf/OtterTune autovacuum case connects the same mechanics to request latency. The case describes an update-heavy workload where long-running queries blocked autovacuum, dead tuples accumulated by 600x, blocks read increased by 375x, non-HOT updates reached 100 percent, update latency increased from 12 ms to 710 ms, throughput dropped by 25 percent during the spike, and query latency spiked by 90x. The exact schema is less important than the shape of the failure: stale tuple versions made ordinary updates read and write far more than the application expected.&lt;/p&gt;
&lt;p&gt;The practical pattern is visible in named system behavior:&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;System behavior&lt;/th&gt;&lt;th&gt;Operational implication&lt;/th&gt;&lt;th&gt;Source&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Dead row versions remain until no active transaction can see them&lt;/td&gt;&lt;td&gt;Watch &lt;code&gt;backend_xmin&lt;/code&gt;, not only table size&lt;/td&gt;&lt;td&gt;&lt;a href=&quot;https://www.postgresql.org/docs/current/routine-vacuuming.html&quot;&gt;PostgreSQL routine vacuuming&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Autovacuum triggers from threshold plus scale factor&lt;/td&gt;&lt;td&gt;Large tables need per-table thresholds&lt;/td&gt;&lt;td&gt;&lt;a href=&quot;https://www.postgresql.org/docs/current/runtime-config-autovacuum.html&quot;&gt;Autovacuum settings&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Plain vacuum and DDL can conflict through table locks&lt;/td&gt;&lt;td&gt;Incident views need &lt;code&gt;pg_locks&lt;/code&gt;, not only connection counts&lt;/td&gt;&lt;td&gt;&lt;a href=&quot;https://www.postgresql.org/docs/current/explicit-locking.html&quot;&gt;PostgreSQL explicit locking&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Vacuum progress is visible while running&lt;/td&gt;&lt;td&gt;Treat active vacuum as observable work, not mystery load&lt;/td&gt;&lt;td&gt;&lt;a href=&quot;https://www.postgresql.org/docs/9.6/progress-reporting.html&quot;&gt;PostgreSQL progress reporting&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Large-table defaults can produce delayed, bursty cleanup&lt;/td&gt;&lt;td&gt;Tune hot tables before making broad cluster changes&lt;/td&gt;&lt;td&gt;&lt;a href=&quot;https://aws.amazon.com/blogs/database/a-case-study-of-tuning-autovacuum-in-amazon-rds-for-postgresql/&quot;&gt;AWS RDS autovacuum case study&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Long-running queries can turn vacuum lag into latency spikes&lt;/td&gt;&lt;td&gt;Track transaction age beside table bloat and top statement latency&lt;/td&gt;&lt;td&gt;&lt;a href=&quot;https://postgresconf.org/system/events/document/000/002/155/Autovacuum_PGCon.pdf&quot;&gt;PGConf autovacuum case study&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The more interesting production lesson is that vacuum lag is a system signal, not a storage metric. It often points at application behavior: oversized transactions, forgotten cursors, migration scripts without lock timeouts, reporting queries running at &lt;code&gt;REPEATABLE READ&lt;/code&gt;, or connection pools that keep sessions open after the request has ended.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Autovacuum workers saturated&lt;/td&gt;&lt;td&gt;Several large tables cross vacuum thresholds at the same time&lt;/td&gt;&lt;td&gt;Tune hot tables individually and review &lt;code&gt;autovacuum_max_workers&lt;/code&gt; with disk capacity&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cleanup horizon pinned&lt;/td&gt;&lt;td&gt;Old &lt;code&gt;backend_xmin&lt;/code&gt;, prepared transaction, or replication slot prevents tuple removal&lt;/td&gt;&lt;td&gt;Alert on transaction age, prepared transactions, and replication slot lag&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Foreground latency worsens after tuning&lt;/td&gt;&lt;td&gt;Lower scale factors create more frequent vacuum I/O under peak writes&lt;/td&gt;&lt;td&gt;Adjust cost limit, cost delay, and schedule manual maintenance for cold periods&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;VACUUM FULL&lt;/code&gt; blocks traffic&lt;/td&gt;&lt;td&gt;Operator uses it to reclaim disk on a live table&lt;/td&gt;&lt;td&gt;Prefer regular vacuum, &lt;code&gt;REINDEX CONCURRENTLY&lt;/code&gt;, partition rotation, or planned maintenance&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Bloat estimate misleads&lt;/td&gt;&lt;td&gt;Statistics are stale or relation layout makes estimates noisy&lt;/td&gt;&lt;td&gt;Pair estimates with &lt;code&gt;pg_stat_user_tables&lt;/code&gt;, relation size trends, and query plans&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Partitioned table hides hot child&lt;/td&gt;&lt;td&gt;Parent looks healthy while one partition churns heavily&lt;/td&gt;&lt;td&gt;Monitor child partitions and tune storage parameters per partition&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: PostgreSQL vacuum lag becomes dangerous when dead tuples, old snapshots, and lock waits are observed as separate symptoms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Build a single incident view that joins transaction age, blocked vacuum, table churn, relation size, and active vacuum progress.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A valid signal names the blocker class before p95 query latency crosses the page threshold, and it explains whether the issue is threshold delay, worker saturation, pinned cleanup horizon, or lock conflict.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, pick the top three write-heavy tables and set table-specific vacuum alerts before changing global autovacuum settings.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Autovacuum is the database telling you how much write-path debt your architecture is carrying; the mature response is to measure the debt before the bill arrives.&lt;/p&gt;</content:encoded><category>databases</category><category>failures</category><category>checklist</category></item><item><title>Top GitHub Breakouts: May 2025 — Operational Baseline in a Config File</title><link>https://rajivonai.com/blog/2025-06-22-github-stars-may-2025/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-06-22-github-stars-may-2025/</guid><description>Three May 2025 open-source projects replace multi-tool assembly in document ingestion, deployment governance, and PostgreSQL backup with single-binary or configuration-first alternatives.</description><pubDate>Sun, 22 Jun 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Before any AI agent can answer questions from a document corpus, before any deployment can reach production safely, before any PostgreSQL failure can be recovered within an RTO — someone has to do setup work that should not exist.&lt;/strong&gt; PDF parsing pipelines need hand-tuning for every document type. Deployment gating still lives in Slack threads and wiki pages. PostgreSQL continuous backup requires assembling pg_receivewal, a scheduler, a retention script, and monitoring separately. Three projects that emerged in May 2025 reduced each of those setups to a single configuration file.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Document preparation, release governance, and database disaster recovery share a common failure pattern: engineers know how to do each one, the components exist, but assembling them into a production-ready system takes long enough that teams either skip it or do it once and never revisit it. Each category also sits on the critical path of something that matters — RAG pipeline accuracy, deployment compliance, and recovery objectives. The cost of half-finishing any of them shows up in production.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Domain&lt;/th&gt;&lt;th&gt;Manual bottleneck&lt;/th&gt;&lt;th&gt;What it costs&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;System design&lt;/td&gt;&lt;td&gt;Tuning PDF parsers per document type for table and layout accuracy&lt;/td&gt;&lt;td&gt;RAG pipeline precision degrades on complex layouts without per-document tuning&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;System design&lt;/td&gt;&lt;td&gt;Building custom OCR pipelines for scanned documents&lt;/td&gt;&lt;td&gt;Every scanned PDF corpus requires custom preprocessing before LLM ingestion&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform&lt;/td&gt;&lt;td&gt;Manually coordinating deploy gates across CI, on-call, and approval flows&lt;/td&gt;&lt;td&gt;Policy-gated deploys live in Slack threads and break on team turnover&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform&lt;/td&gt;&lt;td&gt;No audit trail for which conditions triggered a release or who approved&lt;/td&gt;&lt;td&gt;Compliance review of deployment history requires manual log correlation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;Operating pg_receivewal, a scheduler, compression, and retention scripts separately&lt;/td&gt;&lt;td&gt;Four moving parts to maintain — failure in any one breaks the backup chain&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;No integrated monitoring for backup lag or WAL segment loss&lt;/td&gt;&lt;td&gt;Backup failures are silent until a restore attempt exposes them&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Can each of these be reduced to a single-binary or configuration-first deployment?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Operational Baseline Automation] --&gt; B[System Design — OpenDataLoader PDF]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; C[Platform — SuperPlane]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; D[Databases — pgrwl]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; E[Structured PDF extraction — no per-document parser tuning]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; F[Event-driven release gates — no Slack coordination required]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; G[Single-binary PostgreSQL backup — no multi-tool assembly]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;opendataloader-pdf--eliminates-per-document-type-parser-tuning-for-rag-ingestion&quot;&gt;OpenDataLoader PDF — eliminates per-document-type parser tuning for RAG ingestion&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: Every PDF corpus — multi-column research papers, financial reports, technical manuals — previously required a custom extraction pipeline tuned to its layout. Table extraction accuracy with off-the-shelf tools degraded to 60–70% on complex layouts, requiring manual post-processing before the content was useful for retrieval.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it replaces that task&lt;/strong&gt;: According to the project README, OpenDataLoader PDF achieves “#1 in benchmarks: 0.907 overall, 0.928 table accuracy across 200 real-world PDFs.” It operates in deterministic local mode (0.015s/page per README) or AI hybrid mode for complex pages, with built-in OCR supporting 80+ languages and structured output in Markdown, JSON with bounding boxes, and HTML.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The workflow&lt;/strong&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Before: tune extraction per document layout&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;from&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pdfminer.high_level &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; extract_text&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;text &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; extract_text(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;paper.pdf&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# No table structure, no layout, no OCR for scanned pages&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Requires: custom table detection, reading order correction, OCR pipeline&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# After: opendataloader-pdf&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;pip install opendataloader&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;pdf&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;from&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; opendataloader_pdf &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; extract&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;result &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; extract(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;paper.pdf&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Returns: structured Markdown + JSON with bounding boxes&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Works on digital PDFs, scanned PDFs, multi-column layouts&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: The AI hybrid mode requires an external AI service, adding latency and cost on complex pages. The deterministic local mode is fast but may underperform on layouts that hybrid mode handles. Java 11+ runtime is required — Python-only environments need JVM before the library is usable.&lt;/p&gt;
&lt;h3 id=&quot;superplane--eliminates-manual-release-coordination-across-ci-approvals-and-policy-gates&quot;&gt;SuperPlane — eliminates manual release coordination across CI, approvals, and policy gates&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: Policy-gated deployments — deploy only during business hours, require on-call approval, wait for rollout verification before proceeding — previously required coordinating across CI/CD systems, chat tools, and people, with no durable record of which conditions were met or who approved.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it replaces that task&lt;/strong&gt;: According to the README, SuperPlane lets teams define multi-step operational workflows as directed graphs (“Canvases”), triggered by events from CI/CD, observability, and incident tools. It executes the graph, tracks state, and exposes run history and debugging in a UI and CLI. The README describes the system as “agent-friendly” — coding agents can trigger workflows and investigate executions via the CLI.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The workflow&lt;/strong&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;yaml&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Before: deploy gate documented in wiki, enforced via Slack&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# &quot;check with on-call, wait for 10am window, post in #deploys, run deploy.sh&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# No enforcement, no audit trail, breaks on team turnover&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# After: SuperPlane Canvas definition&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;canvas&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  steps&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;wait_business_hours&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      component&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;time_gate&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      config&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: {&lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;start&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;09:00&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;end&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;17:00&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;timezone&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;UTC&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;require_approval&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      component&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;approval&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      config&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: {&lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;approvers&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: [&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;on-call&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;]}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      depends_on&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: [&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;wait_business_hours&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;trigger_deploy&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      component&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;ci_trigger&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      config&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: {&lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;pipeline&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;production-deploy&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      depends_on&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: [&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;require_approval&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: SuperPlane is in alpha — the README explicitly states “rough edges and occasional breaking changes while we stabilize the core model.” The integration surface is wide; workflows that depend on tooling without a built-in connector require custom component development. Teams with heavily customized CI pipelines should budget engineering time for connector work.&lt;/p&gt;
&lt;h3 id=&quot;pgrwl--eliminates-the-multi-tool-postgresql-backup-assembly&quot;&gt;pgrwl — eliminates the multi-tool PostgreSQL backup assembly&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: Production-grade PostgreSQL continuous backup requires assembling and operating pg_receivewal, a scheduled base backup job, compression, remote storage upload, retention management, and restore tooling — each separately configured, each a distinct failure mode.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it replaces that task&lt;/strong&gt;: According to the README, pgrwl “replaces that entire stack with a single process: WAL streaming, scheduled base backups, compression, encryption, S3/SFTP upload, retention management, and a restore helper — all driven by one binary.” It is described as a container-friendly alternative to pg_receivewal with automatic reconnects, partial WAL file handling, and integrated monitoring.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The workflow&lt;/strong&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Before: configure and operate 4+ tools&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;systemctl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; start&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pg_receivewal&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;          # WAL streaming daemon&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 2&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pg_basebackup&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -D&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /backup&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;     # base backups via cron&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# + write retention cleanup script&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# + configure S3 upload separately&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# + add monitoring for each component&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# After: pgrwl with a single config file&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# pgrwl.yaml&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;wal:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;  streaming:&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; true&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;  archive:&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; s3://my-bucket/wal&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;backup:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;  schedule:&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;0 2 * * *&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;  compression:&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; zstd&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;  retention:&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; 7d&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;monitoring:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;  prometheus:&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; true&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pgrwl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; start&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;  # one process, all components active&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: pgrwl was released May 22, 2025. No published production deployment case studies exist at the time of writing. Teams should run pgrwl in parallel with their existing backup tooling for at least 60 days and perform at least one PITR restore drill before decommissioning prior infrastructure. The restore helper is described in the README; detailed PITR validation documentation was not present in the initial release.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern for configuration-first setups relies on consolidating fragmented state. The underlying technologies behave as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OpenDataLoader PDF&lt;/strong&gt;: The documented pattern for PDF ingestion replaces separate layout detection and OCR passes with a unified pipeline. It uses hybrid fallback, meaning it defaults to local deterministic extraction and calls an external API only for complex layouts, standardizing the workflow into a single function call.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SuperPlane&lt;/strong&gt;: Policy-gated deployments depend on tracking multiple asynchronous conditions. SuperPlane’s documented behavior involves modeling these conditions as a directed graph (“Canvas”), executing them based on external events, and maintaining a centralized state ledger to replace fragmented CI and chat logs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pgrwl&lt;/strong&gt;: PostgreSQL’s &lt;code&gt;pg_receivewal&lt;/code&gt; behaves as a continuous streaming daemon, while base backups are distinct scheduled processes. pgrwl’s documented pattern consolidates these by maintaining a persistent WAL replication connection while executing base backups from the same binary, reducing the number of external dependencies required for point-in-time recovery.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;OpenDataLoader PDF local mode accuracy&lt;/td&gt;&lt;td&gt;Complex multi-column or heavily formatted layouts hit edge cases&lt;/td&gt;&lt;td&gt;Use hybrid mode for known-complex document types; budget for AI service cost&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;OpenDataLoader PDF Java runtime requirement&lt;/td&gt;&lt;td&gt;Python-only CI environments lack JVM&lt;/td&gt;&lt;td&gt;Pin Java 11+ in the build image before adding the library&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SuperPlane alpha API changes&lt;/td&gt;&lt;td&gt;Breaking changes in Canvas API affect running workflow definitions&lt;/td&gt;&lt;td&gt;Pin to a specific release tag; subscribe to changelog before upgrading&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SuperPlane connector gaps&lt;/td&gt;&lt;td&gt;Workflow depends on a tool without a built-in integration&lt;/td&gt;&lt;td&gt;Implement custom component using the SDK; expect engineering time investment&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;pgrwl restore path untested&lt;/td&gt;&lt;td&gt;Running for months without verifying a restore works&lt;/td&gt;&lt;td&gt;Schedule a quarterly PITR drill into a test environment&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;pgrwl early-release risk&lt;/td&gt;&lt;td&gt;No published production validation for the May 2025 release&lt;/td&gt;&lt;td&gt;Run parallel to existing backup tooling for 60 days before decommissioning&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Document ingestion for RAG, deployment policy enforcement, and PostgreSQL backup each require multi-tool setup that breaks in predictable and expensive ways — parser tuning failures reduce retrieval accuracy, untested backup stacks fail at recovery time, and manual deploy gates create compliance gaps when engineers leave.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: OpenDataLoader PDF for accurate multi-layout PDF extraction with no per-document tuning, SuperPlane for event-driven deployment governance with a durable audit trail, pgrwl for single-binary PostgreSQL WAL streaming and base backup.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A successful OpenDataLoader PDF extraction of a complex multi-column document returns structured Markdown with correct table boundaries; a pgrwl startup log shows WAL streaming active and base backup completed without manual scheduling configuration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run &lt;code&gt;pip install opendataloader-pdf&lt;/code&gt; and extract one representative PDF from your existing corpus — compare table accuracy against your current parser on a document that previously required manual post-processing.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>ai-engineering</category><category>architecture</category></item><item><title>Three Open-Source Tools Filling the Gaps in Database Operations (May 2025)</title><link>https://rajivonai.com/blog/2025-06-14-github-stars-jun-2025/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-06-14-github-stars-jun-2025/</guid><description>May 2025&apos;s most-starred new projects solve three specific database team problems: backup restores that are never verified, internal knowledge that can&apos;t be retrieved, and AI agents blind to your schema history.</description><pubDate>Sat, 14 Jun 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Database teams have gotten good at the hard parts — query plans, replication lag, index tuning — and quietly left the infrastructure around those databases in a state that would embarrass a 2018 DevOps team.&lt;/strong&gt; Three projects that broke into GitHub’s top monthly stars in May 2025 attack that gap directly: one proves your backups actually restore before an incident does, one brings your scattered runbooks and postmortems into a local AI retrieval system that runs on a laptop, and one gives AI coding agents real access to your full schema and migration history without the context-window cost.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;The operational layer around a database — backup pipelines, internal knowledge retrieval, AI-assisted schema work — has been treated as solved infrastructure while teams focused on query performance. It is not solved. Backup tools routinely verify checksums without running a restore. Internal runbooks and postmortems live in Confluence pages that no retrieval system can query efficiently. And when an engineer asks an AI coding agent to help with a migration, the agent sees only the files explicitly loaded into context — which for any real codebase never includes the full schema history.&lt;/p&gt;
&lt;p&gt;May 2025 produced three open-source tools, each crossing 7,000 stars within weeks of release, that treat each of these as an engineering problem with a specific, testable solution.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The failure modes are not hypothetical:&lt;/p&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Checksum-only backup validation&lt;/td&gt;&lt;td&gt;A corrupt or incomplete dump passes checksum; fails on restore&lt;/td&gt;&lt;td&gt;Teams discover unusable backups during incidents, not before&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Vector storage at runbook scale&lt;/td&gt;&lt;td&gt;A 1M-document embedding index (1536 dimensions) needs ~6 GB just for float32 vectors&lt;/td&gt;&lt;td&gt;Prohibitive for a local DB knowledge base; forces a vector DB server&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;AI agent schema blindness&lt;/td&gt;&lt;td&gt;Coding agents load only explicitly referenced files&lt;/td&gt;&lt;td&gt;ORM logic, migration history, and stored procedures are invisible to the agent&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Unverified RTO assumptions&lt;/td&gt;&lt;td&gt;Recovery time objectives are calculated against restores that have never been run&lt;/td&gt;&lt;td&gt;RTO figures are fiction until a real restore has been timed&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The core question for a database team in mid-2025: can these three gaps be closed with off-the-shelf open-source tooling, or does each require building something custom?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;These projects each target one failure mode. The architecture of how they connect to a typical database team’s workflow:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DBTeam[database team — operational gaps]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DBTeam --&gt; BackupGap[backups verified by checksum only]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DBTeam --&gt; KnowledgeGap[runbooks and postmortems not retrievable]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DBTeam --&gt; AgentGap[AI agents blind to schema and migration history]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    BackupGap --&gt; Databasus[databasus — automated restore verification pipeline]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    KnowledgeGap --&gt; LEANN[LEANN — local RAG with 97% less vector storage]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    AgentGap --&gt; ClaudeCtx[claude-context — semantic schema search via MCP]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Databasus --&gt; Outcome1[backup failure found before an incident]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    LEANN --&gt; Outcome2[institutional knowledge queryable in seconds]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    ClaudeCtx --&gt; Outcome3[AI agent writes migrations with full schema context]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;databasus--verify-the-restore-not-the-checksum&quot;&gt;databasus — Verify the Restore, Not the Checksum&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The problem it solves:&lt;/strong&gt; Your backup schedule is meaningless if you have never verified a restore succeeds. Most teams test this once, on setup, and never again. databasus makes restore verification part of every backup cycle.&lt;/p&gt;
&lt;p&gt;databasus is a self-hosted, open-source backup tool (Go, Docker/Kubernetes) for PostgreSQL 12–17, MySQL 5.7–9, MariaDB, and MongoDB. It backs up to S3, Google Drive, or FTP with Slack/Discord/Telegram notifications. The differentiating feature, according to the project documentation, is that after each backup it spins up a throwaway database container, runs the full restore, confirms data integrity at the row level, and only then marks the backup valid. This is not a file hash check — it is the same procedure an on-call DBA would run manually, automated into the pipeline.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;docker&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; run&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -d&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  -e&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; DATABASE_URL=&quot;postgresql://user:pass@host:5432/mydb&quot;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  -e&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; STORAGE_S3_BUCKET=&quot;db-backups-prod&quot;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  -e&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; BACKUP_SCHEDULE=&quot;0 4 * * *&quot;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  -e&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; RESTORE_VERIFICATION=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  databasus/databasus:latest&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Use case for the database team:&lt;/strong&gt; Run this against your staging environment first. Two weeks of nightly backups with restore verification will tell you what your current backup tooling has been silently missing. Any backup that fails restore verification but passes the existing checksum-only check represents a recovery gap that was invisible until now.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks:&lt;/strong&gt; Restore verification spins up a full database container, which for databases in the hundreds of gigabytes makes per-backup verification impractical within typical maintenance windows. The documentation recommends sampling: run full restore verification weekly and keep daily backups on checksum-only. That is still a material improvement over the current state at most teams.&lt;/p&gt;
&lt;h3 id=&quot;leann--your-runbooks-deserve-a-real-retrieval-system&quot;&gt;LEANN — Your Runbooks Deserve a Real Retrieval System&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The problem it solves:&lt;/strong&gt; Database teams accumulate enormous institutional knowledge — postmortems, runbooks, query plan archives, schema change decisions, incident timelines. This knowledge is almost never retrievable at the moment it is needed because building a proper semantic search system over it requires a vector database server, which is substantial infrastructure for a tool used internally by one team.&lt;/p&gt;
&lt;p&gt;LEANN (arXiv:2505.08276) is a vector index that stores the graph topology connecting vectors but computes the actual embedding values on demand at query time rather than persisting them. According to the paper and README, this “graph-based selective recomputation with high-degree preserving pruning” approach reduces storage by 97% compared to standard ANN indexes like FAISS, with no reported accuracy loss on standard benchmarks. At one million 1536-dimension vectors, FAISS needs roughly 6 GB of float32 storage; LEANN stores the graph structure (a fraction of that) and recomputes vectors during search.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;from&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; leann &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; LEANNIndex&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Index your team&apos;s runbooks, postmortems, schema docs&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;idx &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; LEANNIndex(&lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;storage_path&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;./db-knowledge&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;idx.add_texts(runbook_chunks)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Query at incident time&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;results &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx.query(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;how did we fix the Aurora replication lag in Q3?&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;results &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx.query(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;which migrations touched the payments schema in the last 6 months?&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;LEANN integrates directly with LangChain, LlamaIndex, and Ollama and includes native MCP support for agent pipelines. The entire system runs on a laptop without a vector database server.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use case for the database team:&lt;/strong&gt; Index your team’s Confluence export, postmortem archive, and schema changelog. Query it during incidents instead of searching Slack history. The knowledge base grows as the team adds more documents; re-indexing is incremental.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks:&lt;/strong&gt; On-demand recomputation adds query latency compared to a pre-materialized in-memory index. For interactive internal knowledge retrieval — where 200–500ms response is acceptable — this is a reasonable tradeoff. For high-throughput external RAG serving thousands of queries per second, benchmark before replacing a production vector store. GPU acceleration is not yet available; the project README tracks this as the highest-priority community request.&lt;/p&gt;
&lt;h3 id=&quot;claude-context--ai-agents-that-can-read-your-schema-history&quot;&gt;claude-context — AI Agents That Can Read Your Schema History&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;The problem it solves:&lt;/strong&gt; When a database team engineer asks Claude Code to write a migration, add a foreign key, or refactor an ORM model, the agent operates on whatever files happen to be in context. For a database layer with years of migrations, multiple ORM models, and scattered stored procedures, “whatever is in context” is never enough for a correct answer. The agent writes migrations that conflict with constraints it could not see.&lt;/p&gt;
&lt;p&gt;claude-context is an MCP server from Zilliz — the company that develops Milvus — that indexes a codebase into a vector database and exposes semantic search to AI coding agents via the Model Context Protocol. When Claude Code needs to understand a schema, it calls the MCP tool and retrieves only the semantically relevant code — not the entire codebase loaded wholesale into context. Per the README, the tool uses a Merkle tree for incremental re-indexing: after a schema migration, only the changed files are re-embedded, not the full repository.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;npx&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; @zilliz/claude-context-mcp&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; init&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Prompts for vector DB credentials and repo path&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Registers the MCP server in Claude Code settings automatically&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After indexing, when you ask Claude Code to add a column to a table referenced in a migration from 18 months ago, the agent retrieves the relevant migration history and schema definition without you having to specify the files. The agent’s schema knowledge scales with the codebase rather than being capped by the context window.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks:&lt;/strong&gt; The current implementation requires a Zilliz Cloud account (free tier available) or a self-hosted Milvus deployment. Teams with strict data residency policies need to verify the self-hosted path before indexing proprietary schemas. First-time indexing of a large monorepo can take 10–30 minutes; the documentation recommends running indexing in CI after each merge and serving from a pre-built index.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;All three descriptions above are grounded in the project READMEs and the LEANN arXiv paper (2505.08276). On LEANN’s storage claims specifically: the 97% reduction is measured against FAISS on standard ANN benchmarks under the documented experimental conditions. I have not run this against a production database runbook corpus at the scale of a real team’s knowledge base — teams should benchmark recall against their own query distribution before replacing a production vector store.&lt;/p&gt;
&lt;p&gt;databasus’s restore verification approach is consistent with the recommendation in PostgreSQL’s official documentation on backup and restore verification (under “Checking the Backup”). The innovation is automation rather than technique.&lt;/p&gt;
&lt;p&gt;claude-context’s Merkle-tree incremental indexing is documented in the README; it is the same general approach used by tools like Turborepo and Bazel for change detection, applied to embedding re-indexing.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Restore verification timeout&lt;/td&gt;&lt;td&gt;Databases &gt;100 GB with narrow backup windows&lt;/td&gt;&lt;td&gt;Switch to weekly full restore verification plus daily backup-only&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;LEANN recall degradation&lt;/td&gt;&lt;td&gt;Very sparse or domain-specific query distributions&lt;/td&gt;&lt;td&gt;Benchmark recall@10 on your actual queries before moving off FAISS&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;claude-context cold index latency&lt;/td&gt;&lt;td&gt;First indexing of a 500k+ line monorepo&lt;/td&gt;&lt;td&gt;Run indexing in CI on merge; serve from pre-built index&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;databasus version mismatch&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_dump&lt;/code&gt; version in container differs from the database major version&lt;/td&gt;&lt;td&gt;Pin container image to match database major version explicitly&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;LEANN query latency at scale&lt;/td&gt;&lt;td&gt;Large corpus + high recomputation cost&lt;/td&gt;&lt;td&gt;Tune &lt;code&gt;num_recompute&lt;/code&gt;; GPU support is on the project roadmap&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Database operations infrastructure lags behind query-layer tooling — backups are unverified, internal knowledge is dark, AI agents are schema-blind.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: databasus for verified backup pipelines, LEANN for local knowledge retrieval, claude-context for semantic schema access in AI coding agents.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run databasus with &lt;code&gt;RESTORE_VERIFICATION=true&lt;/code&gt; against staging for two weeks. Any backup that fails real restore but would have passed a checksum check is a recovery gap that existed silently until now.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, install LEANN (&lt;code&gt;pip install leann&lt;/code&gt;), index your team’s postmortem directory, and run three queries against incidents from the past year. If the results would have reduced time-to-resolution in any of them, you have a case for making it part of your incident response tooling.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>ai-engineering</category><category>architecture</category></item><item><title>MongoDB Queryable Encryption Architecture Review</title><link>https://rajivonai.com/blog/2025-05-12-mongodb-queryable-encryption-architecture-review/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-05-12-mongodb-queryable-encryption-architecture-review/</guid><description>A pre-go-live architecture review for MongoDB Queryable Encryption — key management, field classification, query type constraints, driver requirements, and key rotation.</description><pubDate>Mon, 12 May 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;MongoDB Queryable Encryption is not a feature you enable after the application is built — it is a schema and key management decision that constrains every query you can run on encrypted fields for the lifetime of the collection.&lt;/strong&gt; Getting the architecture review right before go-live is substantially cheaper than discovering a query constraint after the collection is populated and production traffic is live.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;The team has decided to use MongoDB Queryable Encryption to protect a subset of sensitive document fields — PII, payment instrument data, health records, or similar categories that require protection from privileged infrastructure access. The development environment has QE configured with a local key provider. Production go-live is planned.&lt;/p&gt;
&lt;p&gt;This runbook is the go-live gate review for a team implementing QE in MongoDB 8.0. For an introduction to what QE enables and how it differs from standard field-level encryption, see &lt;a href=&quot;https://rajivonai.com/blog/2024-10-15-mongodb-80-queryable-encryption-matters/&quot;&gt;MongoDB 8.0: Why Queryable Encryption Matters&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The pre-go-live review exists because three categories of mistakes are expensive to fix after data is encrypted at scale: wrong key management provider, wrong query type configuration per field, and insufficient performance testing for range queries. Each one requires either a collection rebuild (re-encrypt all documents with corrected configuration) or a material change to how the application queries the data.&lt;/p&gt;
&lt;p&gt;How do we systematically validate the MongoDB QE deployment before production traffic begins?&lt;/p&gt;
&lt;h2 id=&quot;pre-go-live-architecture-review&quot;&gt;Pre-Go-Live Architecture Review&lt;/h2&gt;
&lt;p&gt;The target architecture must satisfy stringent key management, driver, and query constraints.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[QE go-live review] --&gt; B{KMS configured for production?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no| C[Configure AWS KMS or GCP or Azure KV]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; B&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| D{All sensitive fields classified?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no| E[Create field inventory — QE vs standard FLE]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; D&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes| F{Driver version 6.0 plus with libmongocrypt?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|no| G[Upgrade driver and validate encryption round-trip]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|yes| H{Query types verified for each QE field?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|no| I[Audit application queries vs encrypted fields map]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt; H&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|yes| J{Range query performance tested in staging?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|no| K[Run range query benchmark — verify latency acceptable]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|yes| L{Key rotation procedure documented?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|no| M[Document CMK rotation and DEK re-wrap procedure]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|yes| N[Approved for production go-live]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;1-key-management-provider&quot;&gt;1. Key Management Provider&lt;/h3&gt;
&lt;p&gt;Verify that production configuration uses AWS KMS, GCP Cloud KMS, Azure Key Vault, or a KMIP-compliant provider.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Insecure: local provider (development only)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;const&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; kmsProviders&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  local: { key: localMasterKey }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;};&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Required for production: external KMS&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;const&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; kmsProviders&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  aws: {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    accessKeyId: process.env.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    secretAccessKey: process.env.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;};&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Any production deployment using the local provider has its entire encryption model broken — the key material is accessible to anyone with filesystem access to the application server.&lt;/p&gt;
&lt;h3 id=&quot;2-field-classification&quot;&gt;2. Field Classification&lt;/h3&gt;
&lt;p&gt;Not every sensitive field needs Queryable Encryption. Fields that are only written and read by the application without server-side filtering belong on standard FLE.&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Field&lt;/th&gt;&lt;th&gt;Sensitivity&lt;/th&gt;&lt;th&gt;Server-side queries needed&lt;/th&gt;&lt;th&gt;Recommendation&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;ssn&lt;/code&gt;&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;td&gt;Equality lookup only&lt;/td&gt;&lt;td&gt;QE — equality&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;salary&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Medium&lt;/td&gt;&lt;td&gt;Range queries needed&lt;/td&gt;&lt;td&gt;QE — range&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;medical_notes&lt;/code&gt;&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;td&gt;No server-side queries&lt;/td&gt;&lt;td&gt;Standard FLE&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h3 id=&quot;3-driver-version-and-dependencies&quot;&gt;3. Driver Version and Dependencies&lt;/h3&gt;
&lt;p&gt;MongoDB QE requires specific driver versions and the &lt;code&gt;libmongocrypt&lt;/code&gt; dependency:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Node.js driver: &lt;code&gt;mongodb&lt;/code&gt; 6.0+&lt;/li&gt;
&lt;li&gt;Python driver: &lt;code&gt;pymongo&lt;/code&gt; 4.4+ with &lt;code&gt;pymongo[encryption]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Java driver: 4.11+&lt;/li&gt;
&lt;li&gt;libmongocrypt: 1.8+&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Node.js&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;cat&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; package.json&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; grep&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;&quot;mongodb&quot;&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;4-query-type-configuration&quot;&gt;4. Query Type Configuration&lt;/h3&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;const&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; encryptedFieldsMap&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &quot;mydb.patients&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    fields: [&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        path: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;ssn&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        bsonType: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        queries: [{ queryType: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;equality&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; }]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    ]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;};&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Regex, &lt;code&gt;$text&lt;/code&gt;, &lt;code&gt;$where&lt;/code&gt;, and most aggregation expressions that operate on encrypted field content are not supported for server-side evaluation.&lt;/p&gt;
&lt;h3 id=&quot;5-dek-cache-ttl-and-rotation&quot;&gt;5. DEK Cache TTL and Rotation&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;ClientEncryption&lt;/code&gt; object caches Data Encryption Keys (DEKs) in application memory.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;const&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; clientEncryption&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; new&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; ClientEncryption&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(client, {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  keyVaultNamespace: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;encryption.__keyVault&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  kmsProviders,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  keyExpirationMS: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;60000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;});&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For key rotation to take effect promptly, the cache TTL must be shorter than the rotation response SLA.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;All patterns below are derived from MongoDB’s documented system behavior and MongoDB’s official QE documentation (&lt;a href=&quot;https://www.mongodb.com/docs/manual/core/queryable-encryption/&quot;&gt;MongoDB Queryable Encryption docs&lt;/a&gt;). I have not run QE at production scale personally; these are documented design behaviors, not field observations.&lt;/p&gt;
&lt;p&gt;Based on how MongoDB’s system actually behaves, migrating from a local provider to an external KMS requires re-writing the data. There is no migration path that converts existing encrypted documents in-place. If documents were encrypted with local-provider DEKs, they must be decrypted and re-encrypted with KMS-backed DEKs before production go-live.&lt;/p&gt;
&lt;p&gt;Range queries on QE-encrypted fields carry substantial performance overhead. The documented pattern is that range encryption introduces additional metadata index entries per document — MongoDB’s range index for an encrypted field stores multiple auxiliary entries per document (not just one per document as a standard B-tree index does), so index storage size grows significantly with collection volume. A collection with 50 million documents and two range-encrypted fields can accumulate an encrypted index substantially larger than equivalent unencrypted field indexes. Write latency also increases because each insert must write auxiliary range index metadata. The actual latency impact depends heavily on collection size, range bounds configuration, and range precision settings (&lt;code&gt;sparsity&lt;/code&gt; and &lt;code&gt;trimFactor&lt;/code&gt; in the &lt;code&gt;encryptedFields&lt;/code&gt; config). Benchmarking must be done at production scale:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;const&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; start&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Date.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;const&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; results&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; await&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;collection&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;patients&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;).&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;find&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  dob: { $gte: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;new&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; Date&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;1970-01-01&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;), $lte: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;new&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; Date&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;1990-12-31&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;}).&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;toArray&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;const&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; elapsed&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Date.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; start;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Multi-pod DEK cache consistency.&lt;/strong&gt; In multi-instance application deployments, each process holds its own in-memory DEK cache. When a DEK is revoked or a CMK is rotated, instances that have not yet evicted their cached DEK will continue to decrypt data using the old key until their &lt;code&gt;keyExpirationMS&lt;/code&gt; TTL elapses. During this window, some application pods succeed on encrypted reads and others fail after rotation takes effect on the KMS side — a split-brain failure mode where errors appear intermittently across instances. The operational requirement is to either set a short TTL (accepting higher KMS call volume) or coordinate a rolling restart of application pods immediately after key rotation to flush all caches.&lt;/p&gt;
&lt;p&gt;For key rotation, MongoDB’s behavior ensures that Customer Master Key (CMK) rotation in the KMS does not require re-encrypting document data. The documented pattern is to use the &lt;code&gt;rewrapManyDataKey&lt;/code&gt; command, which re-wraps the DEKs with the new CMK while leaving the underlying collection data untouched:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;await&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; clientEncryption.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;rewrapManyDataKey&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  {}, &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    provider: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;aws&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    masterKey: { region: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;us-east-1&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, key: process.env.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;NEW_AWS_CMK_ARN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Automating visibility into DEK health is a common operational pattern. DEK creation dates can be monitored via the key vault collection:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;getSiblingDB&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;encryption&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;).&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;getCollection&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;__keyVault&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;).&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;find&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  {},&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  { keyAltNames: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, creationDate: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, updateDate: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;).&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;forEach&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;key&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  const&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; ageDays&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (Date.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; key.creationDate) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 86400000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  if&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (ageDays &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 90&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;    print&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;DEK may need rotation:&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, key.keyAltNames, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;age:&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, Math.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;round&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(ageDays), &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;days&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;});&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Symptoms of an Incomplete QE Design&lt;/strong&gt;&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Local key provider in production config&lt;/td&gt;&lt;td&gt;&lt;code&gt;ClientEncryption&lt;/code&gt; initialization in app code&lt;/td&gt;&lt;td&gt;Security model broken — key material accessible without KMS&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Driver version below 6.0&lt;/td&gt;&lt;td&gt;&lt;code&gt;package.json&lt;/code&gt; or &lt;code&gt;requirements.txt&lt;/code&gt;&lt;/td&gt;&lt;td&gt;libmongocrypt not supported — QE will fail at runtime&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;QE field queried with regex in application&lt;/td&gt;&lt;td&gt;Application code search&lt;/td&gt;&lt;td&gt;Unsupported query type — will fail or require application-layer workaround&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;No key rotation procedure documented&lt;/td&gt;&lt;td&gt;Architecture documentation&lt;/td&gt;&lt;td&gt;CMK rotation unplanned — compliance risk&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Range query on equality-only field&lt;/td&gt;&lt;td&gt;Encrypted fields map vs query code&lt;/td&gt;&lt;td&gt;Runtime error when range query hits equality-only encrypted field&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;DEK cached indefinitely in application&lt;/td&gt;&lt;td&gt;ClientEncryption configuration&lt;/td&gt;&lt;td&gt;Key rotation does not take effect until cache expires&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Design Tradeoffs and Failure Modes&lt;/strong&gt;&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Design Decision&lt;/th&gt;&lt;th&gt;Benefit&lt;/th&gt;&lt;th&gt;Tradeoff / Failure Mode&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Standard FLE vs QE&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Simpler setup, lower overhead, no strict query constraints.&lt;/td&gt;&lt;td&gt;Cannot run any server-side queries (equality or range) on the encrypted data.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Equality vs Range&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Equality has faster performance and generates less metadata.&lt;/td&gt;&lt;td&gt;Runtime errors will occur if the application attempts a range query on an equality-only field.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;External KMS Dependency&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Meets compliance standards; security model is maintained.&lt;/td&gt;&lt;td&gt;&lt;strong&gt;KMS Unavailability:&lt;/strong&gt; If the KMS endpoint becomes unreachable, the application cannot encrypt new writes or decrypt reads. Plan for KMS high availability.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Short DEK Cache TTL&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Application responds quickly to CMK rotations and revocations.&lt;/td&gt;&lt;td&gt;Increases request volume to the external KMS, potentially impacting latency and increasing costs.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;In-place Schema Changes&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;N/A&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Post-Go-Live Rigidity:&lt;/strong&gt; MongoDB does not support in-place schema changes for QE. Changing &lt;code&gt;queryType&lt;/code&gt; requires a multi-hour collection rebuild, decrypting and re-encrypting all data.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Queryable Encryption configurations are permanent; making the wrong choice on query types or KMS providers requires expensive collection rebuilds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Execute a pre-go-live architecture review validating field classification, driver versions, query constraints, and performance overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Benchmarking range queries at production scale and validating the &lt;code&gt;rewrapManyDataKey&lt;/code&gt; rotation process ensures the infrastructure behaves correctly under real-world conditions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Implement the five verification checks listed above before deploying the encrypted fields map to the production cluster, and schedule an automated job to monitor DEK age.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category><category>checklist</category></item><item><title>Per-Application Postgres on Kubernetes Is an Isolation Strategy</title><link>https://rajivonai.com/blog/2025-04-26-per-application-postgres-on-kubernetes-is-an-isolation-strat/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-04-26-per-application-postgres-on-kubernetes-is-an-isolation-strat/</guid><description>How CloudNativePG, GitOps, and External Secrets turn Postgres-on-Kubernetes into an operational isolation pattern.</description><pubDate>Sat, 26 Apr 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Postgres-on-Kubernetes is not a cheaper managed database; it is a decision to turn each application database into its own auditable, recoverable, failure-contained operating unit.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Teams are pushing more stateful infrastructure into Kubernetes because the rest of the delivery system already lives there: GitOps, policy admission, secrets, observability, and rollout control. CloudNativePG gives PostgreSQL a Kubernetes-native control plane, but the architectural question is not “can the operator run Postgres?” It can.&lt;/p&gt;
&lt;p&gt;The better question is whether per-application clusters are worth the operational multiplication.&lt;/p&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Default approach&lt;/th&gt;&lt;th&gt;Alternative&lt;/th&gt;&lt;th&gt;What changes&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Shared managed PostgreSQL instance&lt;/td&gt;&lt;td&gt;Per-application CloudNativePG cluster&lt;/td&gt;&lt;td&gt;Isolation moves from database names to failure domains&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Ticket-driven database provisioning&lt;/td&gt;&lt;td&gt;GitOps database manifests&lt;/td&gt;&lt;td&gt;Provisioning becomes reviewable infrastructure state&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Central backup policy&lt;/td&gt;&lt;td&gt;Declared backup per cluster&lt;/td&gt;&lt;td&gt;Recovery becomes an application contract&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;One upgrade path&lt;/td&gt;&lt;td&gt;Independent cluster lifecycle&lt;/td&gt;&lt;td&gt;Coordination cost moves to platform standards&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Shared PostgreSQL looks efficient until one application’s database lifecycle starts behaving like everyone’s outage. A migration that takes an &lt;code&gt;ACCESS EXCLUSIVE&lt;/code&gt; lock, a connection storm after a deploy, a bad &lt;code&gt;DELETE FROM&lt;/code&gt;, or a noisy autovacuum cycle does not respect team boundaries just because the schemas have different names.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Shared compute and I/O&lt;/td&gt;&lt;td&gt;One workload consumes CPU, memory, WAL bandwidth, or storage IOPS&lt;/td&gt;&lt;td&gt;PostgreSQL isolation inside one instance is weaker than Kubernetes isolation across pods, PVCs, and quotas&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Shared upgrade window&lt;/td&gt;&lt;td&gt;PostgreSQL 15 to 16, extension changes, or parameter restarts affect unrelated apps&lt;/td&gt;&lt;td&gt;Teams lose independent lifecycle control even when their schema is not changing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Shared blast radius&lt;/td&gt;&lt;td&gt;A rogue migration, bad application deploy, or dropped table lands inside a common operational boundary&lt;/td&gt;&lt;td&gt;Recovery decisions become political: restore one app and risk everyone else, or do surgery under pressure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GitOps drift&lt;/td&gt;&lt;td&gt;Argo CD can reconcile Deployments while the database remains a manually created external dependency&lt;/td&gt;&lt;td&gt;The application appears declarative, but its most important dependency is still tribal memory&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Failover optimism&lt;/td&gt;&lt;td&gt;The database promotes a replica, but clients keep dead TCP sessions or stale DNS targets&lt;/td&gt;&lt;td&gt;The operator can move the primary; it cannot prove the application survived&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;CloudNativePG addresses part of this by giving each &lt;code&gt;Cluster&lt;/code&gt; resource its own primary, replicas, services, WAL archive, backups, and Kubernetes lifecycle. The trap is thinking that means the hard part is solved. The real design question is: how do you get the isolation benefit without creating fifty tiny database platforms?&lt;/p&gt;
&lt;h2 id=&quot;per-application-clusters-as-an-isolation-plane&quot;&gt;Per-Application Clusters as an Isolation Plane&lt;/h2&gt;
&lt;p&gt;The right architecture is a platform contract: every application gets its own PostgreSQL cluster, but every cluster is created through the same operator, GitOps layout, secret flow, backup policy, monitoring labels, and recovery drill.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Dev[developer change] --&gt; Git[git repository — apps and databases]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Git --&gt; Argo[Argo CD ApplicationSet]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Argo --&gt; App[application namespace]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Argo --&gt; DB[CloudNativePG Cluster]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Vault[cloud secret manager] --&gt; ESO[External Secrets operator]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    ESO --&gt; AppSecret[Kubernetes Secret — app credentials]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    ESO --&gt; DBSecret[Kubernetes Secret — backup credentials]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DB --&gt; RW[read write service]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DB --&gt; RO[read only service]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DB --&gt; WAL[WAL archive — object storage]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Prom[Prometheus] --&gt; Dash[Grafana dashboard]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DB --&gt; Prom&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    App --&gt; RW&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Separate application and database manifests, but reconcile both from Git.&lt;/strong&gt;&lt;br&gt;
Use a layout such as &lt;code&gt;apps/linkding/overlays/dev&lt;/code&gt; and &lt;code&gt;databases/linkding/overlays/dev&lt;/code&gt;, with separate Argo CD &lt;code&gt;ApplicationSet&lt;/code&gt; definitions. The separation matters because application rollout and database lifecycle have different risk profiles. A Deployment rollback is not the same thing as rewinding a database.&lt;br&gt;
&lt;strong&gt;Verification:&lt;/strong&gt; a fresh namespace can be rebuilt from Git without a manual database creation step.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use CloudNativePG services as the only in-cluster database entry point.&lt;/strong&gt;&lt;br&gt;
CloudNativePG manages &lt;code&gt;rw&lt;/code&gt;, &lt;code&gt;ro&lt;/code&gt;, and &lt;code&gt;r&lt;/code&gt; services; the &lt;code&gt;rw&lt;/code&gt; service points at the current primary, while &lt;code&gt;ro&lt;/code&gt; points at replicas where available, according to the &lt;a href=&quot;https://cloudnative-pg.io/docs/1.28/service_management/&quot;&gt;CloudNativePG service management documentation&lt;/a&gt;. Do not connect applications directly to pod DNS names. That is how failover tests pass in the database layer and fail in the application layer.&lt;br&gt;
&lt;strong&gt;Verification:&lt;/strong&gt; delete the current primary pod, then confirm the application writes through &lt;code&gt;&amp;#x3C;cluster&gt;-rw&lt;/code&gt; after promotion.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Externalize secrets before the first cluster exists.&lt;/strong&gt;&lt;br&gt;
Database owner credentials, application passwords, Azure Blob or S3 credentials, and backup access should come from a cloud secret manager through External Secrets. Kubernetes Secrets are the runtime projection, not the source of authority.&lt;br&gt;
&lt;strong&gt;Verification:&lt;/strong&gt; rotating the upstream secret updates the projected Kubernetes Secret and triggers the expected application or pooler reload path.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Treat WAL archiving as a production requirement, not a backup checkbox.&lt;/strong&gt;&lt;br&gt;
CloudNativePG 1.29 documents point-in-time recovery as dependent on a valid WAL archive, and recovery bootstraps a new cluster rather than restoring in place (&lt;a href=&quot;https://cloudnative-pg.io/docs/1.29/recovery&quot;&gt;recovery docs&lt;/a&gt;). That distinction is operationally important: your restore manifest is a runbook, not a patch to the broken cluster.&lt;br&gt;
&lt;strong&gt;Verification:&lt;/strong&gt; create a temporary namespace, restore from the latest base backup plus WAL, and run application-level read checks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Standardize admission policy before the tenth database.&lt;/strong&gt;&lt;br&gt;
Per-app clusters multiply everything: PVCs, PodDisruptionBudgets, backup jobs, certificates, metrics, alerts, and upgrade queues. Use Kyverno or OPA Gatekeeper to require resource requests, backup retention, owner labels, network policies, and anti-affinity.&lt;br&gt;
&lt;strong&gt;Verification:&lt;/strong&gt; a malformed &lt;code&gt;Cluster&lt;/code&gt; manifest is rejected before Argo CD can apply it.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;One version-specific gotcha: CloudNativePG scheduled backups use a six-field cron expression with seconds, not the five-field Unix format; &lt;code&gt;0 0 0 * * *&lt;/code&gt; means midnight in CNPG, while Kubernetes CronJobs would use &lt;code&gt;0 0 * * *&lt;/code&gt; (&lt;a href=&quot;https://cloudnative-pg.io/docs/1.29/backup&quot;&gt;CNPG backup docs&lt;/a&gt;). That is exactly the kind of small mismatch that becomes a failed audit three months later.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern is not theoretical. Zalando wrote in 2017 that the gap between an engineer wanting PostgreSQL and the database team creating it was still a ticketing workflow; their stated direction was to trigger PostgreSQL cluster setup from engineers committing to Git through the Kubernetes API (&lt;a href=&quot;https://engineering.zalando.com/posts/2017/06/postgresql-in-a-time-of-kubernetes.html&quot;&gt;Zalando Engineering, 2017&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;By 2018, Zalando reported using its Postgres operator to manage more than 400 PostgreSQL clusters across Kubernetes installations, with the operator watching declarative manifests and carrying out create, update, and delete operations (&lt;a href=&quot;https://engineering.zalando.com/posts/2018/11/postgres-operator.html&quot;&gt;Zalando Engineering, 2018&lt;/a&gt;). That is the important lesson: the operator was not valuable because YAML is charming. It was valuable because manual operations had become impossible at fleet scale.&lt;/p&gt;
&lt;p&gt;CloudNativePG is a different operator, but the system behavior maps cleanly. A &lt;code&gt;Cluster&lt;/code&gt; custom resource describes desired database state. The operator reconciles pods, replication, services, backups, and status. Kubernetes becomes the control plane, and Git becomes the audit trail. The production pattern is per-application autonomy inside platform-enforced boundaries.&lt;/p&gt;
&lt;p&gt;The part the tutorial usually underplays is client behavior during failover. CloudNativePG can promote a replica and repoint the &lt;code&gt;rw&lt;/code&gt; service, but a Java service using HikariCP, a Django app with persistent connections, or PgBouncer in transaction pooling mode still has to discard broken sessions and reconnect. Kubernetes service updates do not magically heal a process holding a dead TCP socket. Your HA test is not complete until writes succeed through the normal application code path after primary loss.&lt;/p&gt;
&lt;p&gt;Schema changes also need their own protocol. GitOps is good at reconciling declarative infrastructure; it is not a migration ordering engine. PostgreSQL DDL can block, rewrite, or invalidate assumptions depending on the operation and version. Postgres 11 reduced pain for adding columns with constant defaults, but lock acquisition still matters. The practical rule is simple: deploy backward-compatible schema first, ship compatible application code second, remove old schema last. The database cluster being per-app makes this easier, not automatic.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Control-plane overload&lt;/td&gt;&lt;td&gt;Dozens of three-instance clusters create hundreds of pods, PVCs, Services, Secrets, PodMonitors, and backup objects&lt;/td&gt;&lt;td&gt;Set namespace quotas, require owner labels, cap default instance counts, and watch Kubernetes API latency&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Fake failover success&lt;/td&gt;&lt;td&gt;&lt;code&gt;kubectl delete pod&lt;/code&gt; promotes a replica, but app clients hold stale TCP sessions&lt;/td&gt;&lt;td&gt;Test through the real app and pooler; enforce connection lifetime, retry policy, and startup probes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Backup theater&lt;/td&gt;&lt;td&gt;WAL ships to object storage, but no one has restored a cluster since launch&lt;/td&gt;&lt;td&gt;Schedule restore drills; measure recovery point objective and recovery time objective with restored application reads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GitOps fights the operator&lt;/td&gt;&lt;td&gt;Argo CD prunes generated objects or overwrites operator-managed fields&lt;/td&gt;&lt;td&gt;Scope Argo CD ownership to declared resources; ignore generated status and operator-owned children&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Migration lock incident&lt;/td&gt;&lt;td&gt;A large table migration blocks writes or waits behind long transactions&lt;/td&gt;&lt;td&gt;Add lock timeout budgets, split schema and code deploys, and run preflight checks for blocking sessions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Version skew&lt;/td&gt;&lt;td&gt;Tutorial pins CNPG chart &lt;code&gt;0.20.1&lt;/code&gt; and PostgreSQL &lt;code&gt;16.1&lt;/code&gt;, while the platform has moved to CNPG 1.29 and newer Postgres images&lt;/td&gt;&lt;td&gt;Pin operator, CRDs, image catalogs, and Postgres major versions explicitly; rehearse operator upgrades outside production&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Restore collision&lt;/td&gt;&lt;td&gt;A recovered cluster writes WAL into the same archive prefix as the source&lt;/td&gt;&lt;td&gt;Use unique server names and bucket paths; CNPG 1.29 includes archive safety checks for this class of mistake&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Read replica misuse&lt;/td&gt;&lt;td&gt;Application sends correctness-sensitive reads to &lt;code&gt;ro&lt;/code&gt; and observes replication lag&lt;/td&gt;&lt;td&gt;Use replicas for tolerant analytical reads; keep read-after-write paths on &lt;code&gt;rw&lt;/code&gt; unless the app handles lag explicitly&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Shared PostgreSQL hides unrelated applications inside the same failure and recovery boundary.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Move one application at a time to its own CloudNativePG cluster, but require the same GitOps layout, external secret source, WAL archive, monitoring labels, resource limits, and admission policy for every cluster.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: The rollout is valid only when the application writes successfully through &lt;code&gt;&amp;#x3C;cluster&gt;-rw&lt;/code&gt; after primary deletion, restores into a temporary namespace from base backup plus WAL, and passes an application-level read check against the restored database.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, choose one non-critical service and run the checklist: create a three-instance CNPG cluster, wire credentials through External Secrets, archive WAL to object storage, add Prometheus alerts, enforce namespace quota and owner labels, delete the primary pod, restore into a temporary namespace, and document the recovery command sequence in the repository.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The mature version of Postgres-on-Kubernetes is not bravado about running stateful workloads; it is the discipline to make every small database boring in exactly the same way.&lt;/p&gt;</content:encoded><category>databases</category><category>cloud</category><category>architecture</category></item><item><title>GitHub Breakouts: Q1 2025 — The Quarter&apos;s Top Productivity Shifts</title><link>https://rajivonai.com/blog/2025-04-15-github-stars-2025-q1/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-04-15-github-stars-2025-q1/</guid><description>Six high-traction open-source projects from Q1 2025 converged on eliminating the manual integration layer between AI assistants and production systems across databases, platform operations, and developer tooling.</description><pubDate>Tue, 15 Apr 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;In Q1 2025, the Model Context Protocol crossed from specification to production ecosystem in 90 days.&lt;/strong&gt; Three separate engineering domains — developer tooling, platform operations, and database access — each shipped MCP-native open-source projects within the same quarter. The shared pattern was not accidental: every project replaced the same manual step, the task of building and maintaining the integration layer between an AI assistant and a live production system. That task had been ad-hoc, fragile, and expensive since AI coding assistants went mainstream. Q1’s breakouts replaced it with a standardized protocol any tool can implement once and reuse everywhere.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Before Q1 2025, connecting an AI assistant to a live production system — a database, a Kubernetes cluster, a private document store — required custom integration code on every tool that wanted to surface that context. There was no standard handshake. Engineers pasted schemas by hand, wrote bespoke prompt-stuffing scripts, or ran unsandboxed tool servers as bare processes with no access control. MCP was an emerging specification, but the ecosystem around it was sparse. Six high-traction open-source projects launched within the same 90-day window and each treated MCP as the assumed integration primitive rather than something to be argued about.&lt;/p&gt;
&lt;h3 id=&quot;quarter-at-a-glance&quot;&gt;Quarter at a Glance&lt;/h3&gt;















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Repository&lt;/th&gt;&lt;th&gt;Domain&lt;/th&gt;&lt;th&gt;Eliminated Manual Task&lt;/th&gt;&lt;th&gt;Stars&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;upstash/context7&lt;/td&gt;&lt;td&gt;System Design&lt;/td&gt;&lt;td&gt;Manually pasting library docs into AI prompts&lt;/td&gt;&lt;td&gt;55,958&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;humanlayer/12-factor-agents&lt;/td&gt;&lt;td&gt;System Design&lt;/td&gt;&lt;td&gt;Building agents without production design principles&lt;/td&gt;&lt;td&gt;21,923&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GoogleCloudPlatform/kubectl-ai&lt;/td&gt;&lt;td&gt;Platform Engineering&lt;/td&gt;&lt;td&gt;Writing kubectl commands and YAML manifests from memory&lt;/td&gt;&lt;td&gt;7,470&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;stacklok/toolhive&lt;/td&gt;&lt;td&gt;Platform Engineering&lt;/td&gt;&lt;td&gt;Running and governing MCP server processes manually&lt;/td&gt;&lt;td&gt;1,818&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;bytebase/dbhub&lt;/td&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;Setting up SQL context for AI agents by hand&lt;/td&gt;&lt;td&gt;2,819&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;zilliztech/deep-searcher&lt;/td&gt;&lt;td&gt;Databases — Data Infra&lt;/td&gt;&lt;td&gt;Building custom RAG pipelines for private data research&lt;/td&gt;&lt;td&gt;7,841&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Domain&lt;/th&gt;&lt;th&gt;Manual bottleneck&lt;/th&gt;&lt;th&gt;Engineering cost&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;System Design&lt;/td&gt;&lt;td&gt;Copy-paste library docs into every AI chat session before writing code&lt;/td&gt;&lt;td&gt;Every session started with 10–20 minutes of context assembly&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;System Design&lt;/td&gt;&lt;td&gt;No established patterns for production agent design; each team reinvented scaffolding&lt;/td&gt;&lt;td&gt;Agents that passed evals failed in production due to brittle control flow&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform Engineering&lt;/td&gt;&lt;td&gt;kubectl syntax requires full cluster-state awareness; wrong flags corrupt workloads&lt;/td&gt;&lt;td&gt;New engineers caused production incidents on unfamiliar clusters&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform Engineering&lt;/td&gt;&lt;td&gt;Running MCP servers as bare OS processes: no sandboxing, no audit log, no access policy&lt;/td&gt;&lt;td&gt;Any compromised MCP server had unrestricted access to all connected tools&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;AI agents querying databases required manual schema exports and prompt injection scripts&lt;/td&gt;&lt;td&gt;Schema context drifted; agents generated SQL for tables that had been migrated&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Databases — Data Infra&lt;/td&gt;&lt;td&gt;Private data research required assembling a custom vector store, embedding model, and LLM chain per project&lt;/td&gt;&lt;td&gt;Weeks of setup before a team could query their own documents&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The core question Q1 tried to answer: can a single standardized protocol eliminate these manual integration steps without forcing a complete platform rewrite?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[MCP Integration Layer — Q1 2025] --&gt; B[System Design]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; C[Platform Engineering]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; D[Databases and Data Infrastructure]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; E[context7 — eliminates doc-pasting into prompts]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; F[12-factor-agents — eliminates ad-hoc agent scaffolding]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; G[kubectl-ai — eliminates manual kubectl syntax lookup]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; H[toolhive — eliminates bare MCP process management]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; I[dbhub — eliminates SQL context setup for AI agents]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; J[deep-searcher — eliminates custom RAG pipeline construction]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;system-design--architecture&quot;&gt;System Design — Architecture&lt;/h3&gt;
&lt;h4 id=&quot;context7--eliminates-manually-pasting-library-documentation-into-ai-prompts&quot;&gt;context7 — eliminates manually pasting library documentation into AI prompts&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Before — the manual workflow&lt;/strong&gt;: Every AI coding session that involved a third-party library started with the same setup tax: locate the right version of the docs, copy the relevant sections, paste them into the chat window before asking anything.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Before: manually assembling docs context before each coding session&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# 1. Open nextjs.org/docs/app/api-reference/functions/use-router&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# 2. Copy 300 lines of API reference&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# 3. Paste into chat before every session&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# 4. Repeat for every library in the project&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;After — with context7&lt;/strong&gt;: According to the project README, adding “use context7” to a prompt causes the MCP server to fetch current, version-specific documentation and inject it into the context automatically.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;txt&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;# After: ask the model directly, docs fetched automatically&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Create a Next.js middleware that checks for a valid JWT in cookies&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;and redirects unauthenticated users to /login. use context7&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;The productivity delta&lt;/strong&gt;: According to the project README, context7 places “up-to-date, version-specific documentation and code examples straight from the source… directly into your prompt,” eliminating the manual doc-assembly step.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;: context7 is an MCP server that indexes documentation from open-source libraries. When a prompt includes “use context7,” the MCP client calls the server, which retrieves the relevant documentation and injects it directly into the model’s context before the response is generated.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: context7 only covers libraries indexed in its public database. Proprietary internal libraries and private APIs are not available. Teams working primarily with internal tooling will not benefit until they run a self-hosted instance with custom sources.&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id=&quot;humanlayer12-factor-agents--eliminates-ad-hoc-agent-scaffolding-without-production-design-principles&quot;&gt;humanlayer/12-factor-agents — eliminates ad-hoc agent scaffolding without production design principles&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Before — the manual workflow&lt;/strong&gt;: The dominant pattern for agent development in early 2025 was “system prompt + bag of tools + loop.” This worked in demos but collapsed under production load: state leaked across turns, retry logic was inconsistent, and human intervention had no defined hook.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Before: the &quot;bag of tools + loop&quot; pattern that fails at production boundary&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;agent &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; LLMAgent(&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;    system_prompt&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;prompt,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;    tools&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;[search, query_db, send_email],&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;    max_iterations&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;10&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;agent.run(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;resolve incident #4421&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;After — with 12-factor-agents&lt;/strong&gt;: The project documents 12 production principles for agent design, in the spirit of the original 12-Factor App. Factors include owning the context window explicitly (Factor 3), treating tools as structured outputs (Factor 4), and building human-in-the-loop checkpoints as first-class tool calls (Factor 7).&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# After: structured state machine with explicit context ownership&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Factor 3: Own Your Context Window — manage what the model sees&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Factor 4: Tools Are Just Structured Outputs&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Factor 7: Contact Humans With Tool Calls&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;class&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; IncidentAgent&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    def&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; __init__&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(self):&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;        self&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.context &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ContextManager(&lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;max_tokens&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;4000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    def&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; step&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(self, state: AgentState) -&gt; AgentState:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;        # Deterministic routing; LLM invoked only at decision points&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;        ...&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;The productivity delta&lt;/strong&gt;: According to the project documentation, 12-factor-agents eliminates the need for each team to independently discover why their “prompt + loop” agent fails in production by providing principles grounded in observed failure modes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;: The project is a documented set of principles and patterns, not a runtime framework. Each factor addresses a specific production failure mode. The README describes the author’s observation that most production agents “are mostly deterministic code, with LLM steps sprinkled in at just the right points.”&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: The project provides principles, not an opinionated runtime. Teams that need battle-tested orchestration with built-in state persistence, retries, and observability still need to implement those pieces themselves or choose a framework that does not contradict the factors.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;platform-engineering&quot;&gt;Platform Engineering&lt;/h3&gt;
&lt;h4 id=&quot;googlecloudplatformkubectl-ai--eliminates-manual-kubectl-syntax-lookup-and-yaml-authoring&quot;&gt;GoogleCloudPlatform/kubectl-ai — eliminates manual kubectl syntax lookup and YAML authoring&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Before — the manual workflow&lt;/strong&gt;: Every Kubernetes troubleshooting session required knowing or looking up the correct combination of kubectl subcommands, flags, and namespace arguments. A five-step debug session routinely involved eight or more separate commands with cluster-specific values.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Before: multi-step debugging requiring exact kubectl syntax&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; get&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pods&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; production&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; describe&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pod&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; my-app-7d9f8b5c4-xk2pv&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; production&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; logs&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; my-app-7d9f8b5c4-xk2pv&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; production&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --previous&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; get&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; events&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; production&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --sort-by=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;.lastTimestamp&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; top&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pod&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -n&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; production&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;After — with kubectl-ai&lt;/strong&gt;: According to the README, kubectl-ai translates natural language intent into precise Kubernetes operations. It also supports MCP server mode, so it can be called from any MCP-compatible AI assistant.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# After: natural language to kubectl&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;curl&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -sSL&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; https://raw.githubusercontent.com/GoogleCloudPlatform/kubectl-ai/main/install.sh&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; bash&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl-ai&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;how&apos;s nginx app doing in my cluster&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Or via krew&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; krew&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; install&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; ai&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;kubectl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; ai&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;show me pods with high memory usage in production&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;The productivity delta&lt;/strong&gt;: According to the README, kubectl-ai serves as an “intelligent interface, translating user intent into precise Kubernetes operations, making Kubernetes management more accessible and efficient.”&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;: kubectl-ai uses configurable LLM backends (Gemini, OpenAI, Vertex AI, Ollama) to translate natural language queries into kubectl operations. MCP server mode means kubectl-ai can be integrated into a broader AI toolchain rather than used only as a standalone CLI.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: kubectl-ai executes operations against a live cluster. An ambiguous prompt — “clean up old pods” — could affect unintended namespaces. The README does not document a dry-run mode as of Q1 2025; treat it as a command generator to review before running, not an autonomous operator.&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id=&quot;stackloktoolhive--eliminates-bare-mcp-server-process-management&quot;&gt;stacklok/toolhive — eliminates bare MCP server process management&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Before — the manual workflow&lt;/strong&gt;: Running MCP servers before toolhive meant starting them as bare OS processes — no container isolation, no access control, no audit trail.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Before: MCP servers as unmanaged background processes&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;node&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /usr/local/bin/mcp-server-filesystem&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /data&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; &amp;#x26;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;uvx&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; mcp-server-postgres&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; postgresql://localhost/mydb&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; &amp;#x26;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# No sandboxing; any compromised server reaches all connected tools&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# No visibility into which tools were called or by whom&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;After — with toolhive&lt;/strong&gt;: According to the README, toolhive wraps every MCP server in an isolated container and enforces access policy per request.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# After: containerized, permission-controlled MCP server lifecycle&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;thv&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; run&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --name&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; postgres-db&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; ghcr.io/modelcontextprotocol/server-postgres&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;thv&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; list&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;        # shows running servers with status&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;thv&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; stop&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; postgres-db&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;The productivity delta&lt;/strong&gt;: According to the project README, toolhive’s semantic tool search “reduce[s] your token usage by up to 85%.” The isolation model eliminates the problem of a bare MCP process reaching credentials it was never intended to access.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;: toolhive runs each MCP server in a container with a minimal permission file. It includes a Kubernetes operator for teams running MCP infrastructure at cluster scale, emits OpenTelemetry traces, and integrates with external identity providers for per-request authorization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: toolhive’s security guarantees depend on the quality of each server’s permission file. A server published with an overly permissive file passes toolhive’s enforcement layer unchanged. Review permission files for every public MCP server before deploying via toolhive.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;databases--data-infrastructure&quot;&gt;Databases — Data Infrastructure&lt;/h3&gt;
&lt;h4 id=&quot;bytebasedbhub--eliminates-manual-sql-context-setup-for-ai-database-queries&quot;&gt;bytebase/dbhub — eliminates manual SQL context setup for AI database queries&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Before — the manual workflow&lt;/strong&gt;: Giving an AI assistant accurate context about a production database required exporting schema definitions, pasting table structures into the system prompt, and repeating the process after every schema migration.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Before: manual schema context assembly for AI-assisted SQL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;psql&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -c&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;\d+ users&quot;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; mydb&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /tmp/schema.txt&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;psql&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -c&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;\d+ orders&quot;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; mydb&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /tmp/schema.txt&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Paste contents into AI assistant system prompt&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Repeat after every schema migration&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;After — with dbhub&lt;/strong&gt;: According to the README, dbhub is a zero-dependency MCP server that connects AI clients directly to live databases using just two MCP tools.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;json&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// After: Claude Desktop config referencing DBHub (from README)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  &quot;mcpServers&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    &quot;dbhub-postgres&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;      &quot;command&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;npx&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;      &quot;args&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: [&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;-y&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;@bytebase/dbhub&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;               &quot;--transport&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;stdio&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;               &quot;--dsn&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;postgres://user:pass@localhost:5432/mydb&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;The productivity delta&lt;/strong&gt;: According to the README, dbhub uses “just two MCP tools to maximize context window” — &lt;code&gt;execute_sql&lt;/code&gt; and &lt;code&gt;search_objects&lt;/code&gt; — replacing static schema exports with live introspection against the actual database.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;: dbhub acts as a gateway between any MCP-compatible AI client and a multi-database backend (PostgreSQL, MySQL, MariaDB, SQL Server, SQLite). The &lt;code&gt;search_objects&lt;/code&gt; tool performs progressive schema discovery, returning only the tables and columns relevant to the current query. Read-only mode, row limits, and query timeouts are configurable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: Read-only mode requires explicit opt-in via &lt;code&gt;--read-only&lt;/code&gt;. The README positions dbhub as “local development first” — high-concurrency agent workloads and connection pool exhaustion in production are not addressed in the current documentation.&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id=&quot;zilliztechdeep-searcher--eliminates-custom-rag-pipeline-construction-for-private-data&quot;&gt;zilliztech/deep-searcher — eliminates custom RAG pipeline construction for private data&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Before — the manual workflow&lt;/strong&gt;: Every team that needed AI-assisted research against private data assembled a retrieval pipeline from scratch: chunking, embedding, vector store setup, retrieval logic, LLM integration.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Before: assembling a RAG pipeline manually&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;from&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; langchain.vectorstores &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Milvus&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;from&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; langchain.embeddings &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; OpenAIEmbeddings&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;embeddings &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; OpenAIEmbeddings()&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;vectorstore &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; Milvus.from_documents(&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    documents, embeddings,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;    connection_args&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;{&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;host&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;localhost&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;port&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;19530&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;retriever &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; vectorstore.as_retriever(&lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;search_kwargs&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;{&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;k&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;5&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;})&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;qa_chain &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; RetrievalQA.from_chain_type(&lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;llm&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;llm, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;retriever&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;retriever)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;After — with deep-searcher&lt;/strong&gt;: According to the README, deep-searcher combines LLMs and vector databases into a single search-and-reasoning pipeline for private data.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# After: private data research with deep-searcher (from README quickstart)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;from&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; deepsearcher &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; configuration, online_query&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;configuration.set_embedding(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;OpenAIEmbedding&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;configuration.set_llm(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;DeepSeek&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;model_name&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;deepseek-reasoner&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;result, token_usage &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; online_query(&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;    &quot;What are the top support ticket categories this quarter?&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;The productivity delta&lt;/strong&gt;: According to the README, deep-searcher “maximizes the utilization of enterprise internal data while ensuring data security” and supports flexible embedding models and multiple LLMs, eliminating the per-project setup cost of assembling a compatible RAG stack.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;: deep-searcher combines a vector database backend (Milvus or Zilliz Cloud), a configurable embedding model, and a reasoning LLM into a single query interface. The tool partitions data by source for efficient retrieval and supports multi-step reasoning over search results.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: deep-searcher requires Milvus or Zilliz Cloud as the vector backend. Teams invested in pgvector, Qdrant, or Weaviate will need to run a second system or fork the provider layer. The README documents web crawling for hybrid private/public research as “under development” — as of Q1 2025 it is private-data-only.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;upstash/context7&lt;/strong&gt;: The “use context7” prompt trigger and automatic documentation injection are described in the project README. The claim that it eliminates manual doc-pasting is inferred from the documented workflow. Production adoption at scale has not been personally verified.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;humanlayer/12-factor-agents&lt;/strong&gt;: All 12 factors are documented in the repository. The author’s observation that “most of the products billing themselves as AI Agents are mostly deterministic code, with LLM steps sprinkled in at just the right points” is a direct quote from the README. Code examples are derived from the documented patterns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GoogleCloudPlatform/kubectl-ai&lt;/strong&gt;: Installation commands and the natural language query example are sourced directly from the README. MCP server mode support is listed in the README’s table of contents. Dry-run behavior is not documented in the README as of Q1 2025.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;stacklok/toolhive&lt;/strong&gt;: Container isolation, per-request access policy, and the Kubernetes operator are described in the README. The “up to 85% token reduction” figure is a verbatim quote from the README. Enterprise and Kubernetes operator features reference linked documentation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;bytebase/dbhub&lt;/strong&gt;: The two-tool MCP architecture, JSON config format, and “local development first” positioning are documented in the README. The default write-enabled behavior is inferred from the README’s explicit mention of read-only mode as a configurable option rather than the default.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;zilliztech/deep-searcher&lt;/strong&gt;: Installation via pip, configuration API, and query interface are documented in the README. The web crawling “under development” note and Milvus dependency are stated in the README’s features and quickstart sections.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;productivity-scorecard&quot;&gt;Productivity Scorecard&lt;/h3&gt;






















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Tool&lt;/th&gt;&lt;th&gt;Domain&lt;/th&gt;&lt;th&gt;Task Eliminated&lt;/th&gt;&lt;th&gt;Documented Impact&lt;/th&gt;&lt;th&gt;Key Caveat&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;upstash/context7&lt;/td&gt;&lt;td&gt;System Design&lt;/td&gt;&lt;td&gt;Manual doc-pasting per AI session&lt;/td&gt;&lt;td&gt;”Up-to-date, version-specific documentation… placed directly into your prompt” (README)&lt;/td&gt;&lt;td&gt;Public libraries only; internal APIs require self-hosting&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;humanlayer/12-factor-agents&lt;/td&gt;&lt;td&gt;System Design&lt;/td&gt;&lt;td&gt;Ad-hoc production agent design&lt;/td&gt;&lt;td&gt;12 principles derived from observed production failure modes (README)&lt;/td&gt;&lt;td&gt;Principles only — no opinionated runtime&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GoogleCloudPlatform/kubectl-ai&lt;/td&gt;&lt;td&gt;Platform Engineering&lt;/td&gt;&lt;td&gt;kubectl syntax lookup and YAML authoring&lt;/td&gt;&lt;td&gt;”Translating user intent into precise Kubernetes operations” (README)&lt;/td&gt;&lt;td&gt;No documented dry-run mode as of Q1 2025&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;stacklok/toolhive&lt;/td&gt;&lt;td&gt;Platform Engineering&lt;/td&gt;&lt;td&gt;Bare MCP process management&lt;/td&gt;&lt;td&gt;”Reduce your token usage by up to 85%” via semantic tool search (README)&lt;/td&gt;&lt;td&gt;Security depends on per-server permission file quality&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;bytebase/dbhub&lt;/td&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;Manual schema context assembly&lt;/td&gt;&lt;td&gt;”Zero dependency, token efficient with just two MCP tools to maximize context window” (README)&lt;/td&gt;&lt;td&gt;Read-only mode requires explicit opt-in&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;zilliztech/deep-searcher&lt;/td&gt;&lt;td&gt;Databases — Data Infra&lt;/td&gt;&lt;td&gt;Custom RAG pipeline construction&lt;/td&gt;&lt;td&gt;”Maximizes utilization of enterprise internal data” with flexible LLM and embedding configs (README)&lt;/td&gt;&lt;td&gt;Milvus or Zilliz Cloud required; web crawling incomplete&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;













































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;context7 returns stale docs&lt;/td&gt;&lt;td&gt;Library version is newer than the last index crawl&lt;/td&gt;&lt;td&gt;Pin the library version in the prompt; verify the doc version context7 injected before trusting generated code&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;kubectl-ai executes against the wrong namespace&lt;/td&gt;&lt;td&gt;Natural language query is ambiguous about scope&lt;/td&gt;&lt;td&gt;Specify namespace explicitly in every prompt; treat output as a command to review before running&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;toolhive container escape via overpermissioned server&lt;/td&gt;&lt;td&gt;Third-party MCP server published with a permissive permission file&lt;/td&gt;&lt;td&gt;Review permission files for every public MCP server before deploying&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;dbhub agent writes to production&lt;/td&gt;&lt;td&gt;Read-only mode not configured; AI client generates a write operation&lt;/td&gt;&lt;td&gt;Pass &lt;code&gt;--read-only&lt;/code&gt; on every production DBHub deployment; use a read replica DSN&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;deep-searcher misses updated documents&lt;/td&gt;&lt;td&gt;Content changed after initial indexing; no automatic re-ingestion&lt;/td&gt;&lt;td&gt;Re-ingest documents on a schedule; incremental indexing is not documented as of Q1 2025&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;12-factor principles conflict with chosen framework&lt;/td&gt;&lt;td&gt;Framework accumulates context automatically, violating Factor 3&lt;/td&gt;&lt;td&gt;Audit framework context management behavior before layering 12-factor principles on top&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;context7 and dbhub token collision&lt;/td&gt;&lt;td&gt;Both inject large context blocks simultaneously; combined usage exceeds model limits&lt;/td&gt;&lt;td&gt;Use dbhub’s &lt;code&gt;search_objects&lt;/code&gt; for targeted schema discovery; limit context7 to the specific library sections needed&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: The manual integration layer between AI assistants and live production systems — schema exports, doc-pasting, kubectl syntax lookups, and custom RAG pipelines — still costs engineering teams hours per week even after adopting AI coding tools, because no single protocol connected them all until Q1 2025.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: dbhub for database context (exposes live schemas directly to AI clients without manual export), kubectl-ai for cluster operations (translates natural language to kubectl), and context7 for development documentation (injects version-correct docs automatically) — each targeting the highest-frequency manual integration step in its domain.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: For context7, the signal is a coding session where the model produces correct API usage for a library you did not manually document in the prompt. For dbhub, the signal is an AI-generated SQL query that correctly references current table and column names without a preceding schema export step.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Install dbhub this week against a non-production database — &lt;code&gt;npx @bytebase/dbhub --transport stdio --dsn &amp;#x3C;your-connection-string&gt; --read-only&lt;/code&gt; — configure it in Claude Desktop or your MCP client, then ask the model to describe your schema. If it answers correctly without a prior schema paste, the integration is working.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>ai-engineering</category><category>architecture</category><category>databases</category><category>cloud</category></item><item><title>Python Automation Framework for DB and Cloud Ops: Architecture and Failure Model</title><link>https://rajivonai.com/blog/2025-04-08-python-automation-framework-for-db-and-cloud-ops-architecture-and-failure-model/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-04-08-python-automation-framework-for-db-and-cloud-ops-architecture-and-failure-model/</guid><description>DB and cloud automation fails when partial failures leave the database, cloud account, and ticketing system describing different operation states.</description><pubDate>Tue, 08 Apr 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Automation does not fail because a script exits nonzero; it fails when nobody can tell whether the database, cloud account, ticket, pipeline, and operator are describing the same operation.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Python has become the default control language for internal infrastructure automation. It is expressive enough for database maintenance, cloud provisioning, CI orchestration, secret rotation, inventory reconciliation, and operational reporting. It has mature SDKs for PostgreSQL, MySQL, AWS, GCP, Azure, Kubernetes, GitHub, and ticketing systems. It also has a low ceremony path from “one script that fixes today” to “the platform workflow everyone now depends on.”&lt;/p&gt;
&lt;p&gt;That is the trap.&lt;/p&gt;
&lt;p&gt;A database and cloud operations framework is not just a directory of scripts. It is a control plane with side effects. It opens connections, mutates state, emits audit trails, retries partial work, and coordinates with systems that have their own consistency models. The framework is responsible for deciding what should happen, proving what actually happened, and making recovery boring when the two diverge.&lt;/p&gt;
&lt;p&gt;The architecture question is therefore not “how do we organize Python files?” It is “how do we design an automation system whose failure modes are explicit enough that operators can trust it during incidents?”&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Most internal automation begins as imperative glue:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;python&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; resize_cluster.py&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --env&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; prod&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --cluster&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; analytics&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;python&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; rotate_password.py&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --database&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; billing&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;python&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; rebuild_replica.py&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --region&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; us-east-1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This works until the workflow crosses a reliability boundary. A cloud API accepts the request but the resource remains pending. A database migration succeeds on the primary but the status update fails. A CI job retries the same step while the original operation is still running. A script times out after creating an IAM role but before attaching the policy. A human reruns the command because the output is ambiguous.&lt;/p&gt;
&lt;p&gt;The failure is not Python. The failure is that the automation has no durable model of intent, progress, ownership, or reconciliation.&lt;/p&gt;
&lt;p&gt;Database and cloud operations are especially unforgiving because the systems being automated are already distributed. PostgreSQL may accept a transaction while a downstream notification fails. AWS APIs may return before eventual consistency has converged. Kubernetes may reconcile a desired object long after the client exits. CI systems may retry a job without understanding whether the remote side effect was idempotent.&lt;/p&gt;
&lt;p&gt;A framework that treats these as ordinary function calls will eventually produce duplicate resources, orphaned credentials, blocked schema changes, broken replicas, or silent drift.&lt;/p&gt;
&lt;p&gt;The core question is: how should a Python automation framework be structured so that every workflow has a durable intent record, bounded side effects, safe retries, and an operator-readable recovery path?&lt;/p&gt;
&lt;h2 id=&quot;core-concept-build-a-workflow-control-plane&quot;&gt;Core Concept: Build a Workflow Control Plane&lt;/h2&gt;
&lt;p&gt;The right architecture separates command intake from execution, execution from reconciliation, and reconciliation from reporting. Python remains the implementation language, but the system behaves like a small control plane.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[operator request — typed command] --&gt; B[workflow registry — policy and schema]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; C[intent store — durable operation record]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; D[executor — bounded side effects]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; E[resource adapters — database and cloud APIs]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; F[observed state — inventory and probes]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt; G[reconciler — compare desired and actual]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  G --&gt; C&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; H[audit stream — logs metrics events]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  H --&gt; I[operator console — status and recovery]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The framework has six core parts.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;workflow registry&lt;/strong&gt; defines every supported operation as a typed contract: inputs, authorization rules, preflight checks, execution steps, rollback posture, retry policy, timeout budget, and required evidence. This prevents production automation from becoming arbitrary code execution with good intentions.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;intent store&lt;/strong&gt; records the requested operation before side effects begin. It should contain workflow name, parameters, requester, approval state, idempotency key, current phase, timestamps, attempt count, and external resource identifiers discovered during execution. A relational database is usually sufficient. The important property is not exotic storage; it is that intent survives process death.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;executor&lt;/strong&gt; performs bounded units of work. Each step should be small enough to retry or inspect independently. It should write progress after meaningful transitions, not only at the end. Long-running operations should checkpoint external identifiers as soon as they are known.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;resource adapters&lt;/strong&gt; isolate system-specific behavior. A PostgreSQL adapter knows how to acquire advisory locks, check replication lag, run migrations in transactions where possible, and classify SQLSTATE errors. A cloud adapter knows which calls are naturally idempotent, which require client tokens, which are eventually consistent, and which need read-after-write verification.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;reconciler&lt;/strong&gt; is the safety mechanism. It compares durable intent with observed state and decides whether the workflow is complete, still converging, retryable, blocked, or unsafe. This is the architectural difference between automation that merely runs and automation that can recover.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;audit stream&lt;/strong&gt; produces evidence for humans and machines: structured logs, metrics, traces, events, and final summaries. Every workflow should answer four questions without reading source code: what was requested, what changed, what remains uncertain, and what action is available now?&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Kubernetes documents the controller pattern as a reconciliation loop: controllers watch cluster state and move actual state toward desired state. The documented pattern is not “run a script once”; it is persistent comparison between declared intent and observed reality.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; A Python DB and cloud automation framework should borrow that pattern. Store the desired operation durably, probe the external systems repeatedly, and let a reconciler classify progress. For example, “create read replica” is not complete when the cloud API returns a replica identifier. It is complete when the replica exists, is reachable, has expected configuration, and satisfies the replication health predicate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The operational result is clearer failure handling. If the executor dies after the API call, the next run does not create a second replica. It reads the intent record, sees the existing external identifier, probes state, and resumes from observation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Treat cloud and database operations as convergence problems, not synchronous procedure calls.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Terraform popularized the plan and apply model for infrastructure changes. The documented pattern separates proposed change, operator review, state tracking, and execution against providers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Python automation should preserve a similar boundary for high-risk operations. Preflight should produce a plan: target resources, expected mutations, lock requirements, blast radius, rollback limits, and verification checks. Execution should attach the plan hash to the intent record so operators can tell whether the approved operation is the one being applied.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; This reduces ambiguity during incidents. A failed operation can be resumed, canceled, or manually completed against a known plan rather than reverse-engineered from logs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Approval without a stable plan is weak control. Execution without state is weak recovery.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; PostgreSQL exposes transactions, lock primitives, and advisory locks. These are documented database behaviors, not framework inventions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Use them deliberately. Schema and maintenance workflows should acquire operation-specific locks, keep transactional sections short, set statement timeouts, verify replica lag before risky changes, and separate transactional database changes from nontransactional cloud side effects.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The framework avoids two common hazards: concurrent operators applying incompatible changes, and long automation runs holding locks that block application traffic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Database safety belongs inside the workflow model, not as a checklist outside it.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Why it happens&lt;/th&gt;&lt;th&gt;Design response&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Duplicate side effects&lt;/td&gt;&lt;td&gt;CI retry or operator rerun repeats a non-idempotent call&lt;/td&gt;&lt;td&gt;Idempotency keys, durable intent, external identifier checkpointing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;False success&lt;/td&gt;&lt;td&gt;API accepted work but resource never converged&lt;/td&gt;&lt;td&gt;Postcondition probes and reconciler status&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Hidden partial state&lt;/td&gt;&lt;td&gt;Process dies after remote mutation but before local update&lt;/td&gt;&lt;td&gt;Write intent first, checkpoint after every discovered identifier&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Unsafe rollback&lt;/td&gt;&lt;td&gt;Workflow spans transactional and nontransactional systems&lt;/td&gt;&lt;td&gt;Declare rollback posture per step, prefer compensate over pretend rollback&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Lock contention&lt;/td&gt;&lt;td&gt;Automation holds database locks too long&lt;/td&gt;&lt;td&gt;Preflight lock analysis, short transactions, timeouts, advisory locks&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Eventual consistency&lt;/td&gt;&lt;td&gt;Cloud read model lags write model&lt;/td&gt;&lt;td&gt;Backoff, convergence windows, explicit uncertain state&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Secret exposure&lt;/td&gt;&lt;td&gt;Logs capture credentials or connection strings&lt;/td&gt;&lt;td&gt;Structured redaction at adapter boundary&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Operator confusion&lt;/td&gt;&lt;td&gt;Status says failed without next action&lt;/td&gt;&lt;td&gt;Terminal states must include recovery guidance&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The most dangerous state is not &lt;code&gt;failed&lt;/code&gt;. It is &lt;code&gt;unknown&lt;/code&gt;. A mature framework treats unknown as a first-class status with a required reconciliation path.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Python automation for database and cloud operations often starts as imperative scripts, but production workflows fail across process, network, database, CI, and cloud consistency boundaries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Build the framework as a workflow control plane: typed registry, durable intent store, bounded executor, system-specific adapters, reconciler, and audit stream.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Kubernetes controllers, Terraform plan and apply, and PostgreSQL locking and transaction semantics all point to the same architectural lesson: reliable operations require durable intent, observed state, and explicit convergence.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Start by rewriting one risky workflow. Add an intent table, idempotency key, step checkpointing, postcondition probes, and operator-readable terminal states. Do not expand the framework until that single workflow can survive timeout, retry, process death, and partial external success.&lt;/p&gt;</content:encoded><category>architecture</category><category>cloud</category><category>databases</category></item><item><title>Natural Language SQL Agents Need Guardrails Before Orchestration</title><link>https://rajivonai.com/blog/2025-03-01-natural-language-sql-agents-need-guardrails-before-orchestra/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-03-01-natural-language-sql-agents-need-guardrails-before-orchestra/</guid><description>How Postgres chat agents turn intent into SQL, and why production systems need schema controls, validation, and auditability.</description><pubDate>Sat, 01 Mar 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The default pattern for natural-language Structured Query Language (SQL) agents is a chat box that asks a large language model to write a query and hands it to an automation workflow; the production pattern is a database-agent control plane that treats generated SQL as untrusted code until policy, cost, schema, and audit checks prove otherwise.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL chat agents are becoming the new analyst interface: a user asks for “high-risk transactions in Q3,” an orchestrator generates SQL, a workflow tool such as n8n executes it, and a summarizer sends the result to Slack, email, or an embedded CopilotKit panel.&lt;/p&gt;
&lt;p&gt;That is useful, but it moves the hard part. The risk is no longer whether a model can write a plausible &lt;code&gt;SELECT&lt;/code&gt;. The risk is whether the system can prove that the generated query is safe, bounded, semantically correct, and reviewable after something goes wrong.&lt;/p&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Approach&lt;/th&gt;&lt;th&gt;Default implementation&lt;/th&gt;&lt;th&gt;Production implementation&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Natural language to SQL&lt;/td&gt;&lt;td&gt;Prompt an LLM with schema text&lt;/td&gt;&lt;td&gt;Route intent through allowlisted data products&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Execution&lt;/td&gt;&lt;td&gt;n8n PostgreSQL node runs generated SQL&lt;/td&gt;&lt;td&gt;Read-only role, timeout, &lt;code&gt;EXPLAIN&lt;/code&gt;, row limit, audit entry&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Result delivery&lt;/td&gt;&lt;td&gt;Summarize rows directly&lt;/td&gt;&lt;td&gt;Mask, shape, validate, then summarize&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Trust model&lt;/td&gt;&lt;td&gt;Prompt instructions&lt;/td&gt;&lt;td&gt;Database permissions and policy gates&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The failure mode is not only “the model writes invalid SQL.” PostgreSQL will reject invalid syntax cleanly. The expensive failures are valid SQL statements that answer the wrong question, scan the wrong table, cross tenant boundaries, or leak fields through the summary layer.&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Schema grounding&lt;/td&gt;&lt;td&gt;The model joins &lt;code&gt;transactions.user_id&lt;/code&gt; when the business question meant &lt;code&gt;store_id&lt;/code&gt;&lt;/td&gt;&lt;td&gt;The query succeeds and produces a confident false answer&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Access control&lt;/td&gt;&lt;td&gt;Prompt says “read-only,” but the database role can still &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, or call unsafe functions&lt;/td&gt;&lt;td&gt;Prompt text is not a security boundary; PostgreSQL privileges are&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cost control&lt;/td&gt;&lt;td&gt;Generated SQL omits &lt;code&gt;LIMIT&lt;/code&gt; or joins two wide tables without selective predicates&lt;/td&gt;&lt;td&gt;A single chat request can become a production incident on a shared Aurora PostgreSQL writer&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tenant isolation&lt;/td&gt;&lt;td&gt;The query omits &lt;code&gt;tenant_id = current_setting(&apos;app.tenant_id&apos;)&lt;/code&gt; or equivalent policy context&lt;/td&gt;&lt;td&gt;Cross-customer disclosure is a compliance incident, not a dashboard bug&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Result summarization&lt;/td&gt;&lt;td&gt;The SQL is allowed, but the summarizer repeats sensitive columns from returned rows&lt;/td&gt;&lt;td&gt;Policy has to apply after execution, not only before it&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Auditability&lt;/td&gt;&lt;td&gt;Only the natural-language prompt is logged&lt;/td&gt;&lt;td&gt;Incident review needs prompt, generated SQL, role, plan, latency, row count, and delivery channel&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;PostgreSQL gives you the pieces: privileges, row-level security, &lt;code&gt;statement_timeout&lt;/code&gt;, &lt;code&gt;EXPLAIN&lt;/code&gt;, views, schemas, and extensions such as &lt;code&gt;pg_stat_statements&lt;/code&gt;. The agent has to assemble them into an operating model. The core question is not “can an LLM write SQL?” It is: &lt;strong&gt;what must be true before generated SQL is allowed to touch production data?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;guardrail-the-sql-agent-as-a-control-plane&quot;&gt;Guardrail the SQL Agent as a Control Plane&lt;/h2&gt;
&lt;p&gt;The right architecture is a narrow control plane around the model. The model proposes. The database and policy layer dispose.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    User[User question] --&gt; Intent[Intent classifier — analytical task]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Intent --&gt; Catalog[Approved catalog — tables and metrics]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Catalog --&gt; Generator[SQL generator — constrained prompt]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Generator --&gt; Parser[SQL parser — abstract syntax tree]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Parser --&gt; Policy[Policy gate — role tenant limit]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Policy --&gt; Plan[Plan gate — explain and cost]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Plan --&gt; Execute[PostgreSQL replica — read only]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Execute --&gt; Shape[Result shaping — masking and limits]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Shape --&gt; Summary[LLM summary — bounded context]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Summary --&gt; Delivery[Delivery channel — UI Slack email]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Execute --&gt; Audit[Audit log — prompt SQL rows latency]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Policy --&gt; Reject[Reject with reason]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Plan --&gt; Reject&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Start with approved data products, not raw schema dumps.&lt;/strong&gt;&lt;br&gt;
Give the agent a catalog of approved views, metric definitions, join keys, and allowed filters. A production catalog should say “&lt;code&gt;finance.v_high_risk_transactions&lt;/code&gt; is the approved surface for fraud review,” not “here are 180 tables, good luck.” PostgreSQL views are the cheapest boundary; materialized views are reasonable when the approved question is repeatedly expensive.&lt;br&gt;
Verification: run the evaluation set against only approved views and fail any query that references a base table directly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use a read-only database role with a short statement timeout.&lt;/strong&gt;&lt;br&gt;
The execution role should have &lt;code&gt;SELECT&lt;/code&gt; on approved schemas only, no ownership of application tables, no write grants, and no ability to mutate session state beyond approved settings. PostgreSQL documents &lt;code&gt;statement_timeout&lt;/code&gt; as a server-side limit that aborts statements exceeding the configured duration, so set it at the role or connection level, not inside the prompt. A typical starting point for an analyst agent is &lt;code&gt;statement_timeout = &apos;5s&apos;&lt;/code&gt; and &lt;code&gt;idle_in_transaction_session_timeout = &apos;10s&apos;&lt;/code&gt;, then tune after observing real plans.&lt;br&gt;
Verification: connect as the agent role and prove &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, &lt;code&gt;CREATE&lt;/code&gt;, and direct access to restricted schemas fail.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Parse SQL before execution.&lt;/strong&gt;&lt;br&gt;
Do not validate SQL with &lt;code&gt;startswith(&quot;SELECT&quot;)&lt;/code&gt;. A generated statement can hide risk in common table expressions, functions, comments, multiple statements, or dialect edge cases. Parse into an abstract syntax tree with a PostgreSQL-aware parser, reject multiple statements, reject write operations, reject disallowed functions, and require a top-level row limit unless the approved view already enforces one.&lt;br&gt;
Verification: maintain negative tests for &lt;code&gt;COPY&lt;/code&gt;, &lt;code&gt;CREATE TEMP TABLE&lt;/code&gt;, &lt;code&gt;SELECT pg_sleep(60)&lt;/code&gt;, multi-statement payloads, and unrestricted scans.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Run &lt;code&gt;EXPLAIN&lt;/code&gt; as a cost gate.&lt;/strong&gt;&lt;br&gt;
PostgreSQL &lt;code&gt;EXPLAIN&lt;/code&gt; can return JSON, which makes it usable as a machine check rather than a string review. The gate should reject plans with sequential scans over large relations, missing tenant predicates, or estimated row counts above the channel limit. This is not perfect; planner estimates drift when statistics are stale. It is still better than discovering the plan after the workflow is already waiting on a hot query.&lt;br&gt;
Verification: compare accepted plans against a blocked corpus of known bad joins and full-table scans.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Shape results before summarization.&lt;/strong&gt;&lt;br&gt;
The summarizer should receive the smallest useful result: selected columns, masked sensitive fields, row caps, aggregate outputs where possible, and explicit caveats. If the user asks for “anomalies,” return the rule used to classify anomaly, not just a dramatic sentence.&lt;br&gt;
Verification: assert that restricted columns such as Social Security numbers, access tokens, patient identifiers, or cardholder fields cannot appear in the summarizer input.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Audit the complete chain.&lt;/strong&gt;&lt;br&gt;
Store &lt;code&gt;user_id&lt;/code&gt;, prompt, resolved intent, generated SQL, rejected reason, execution role, execution latency, row count, delivery channel, model name, and schema catalog version. &lt;code&gt;pg_stat_statements&lt;/code&gt; can help correlate normalized query patterns at the database layer, but it does not replace application-level audit context.&lt;br&gt;
Verification: pick any delivered answer and reconstruct who asked, what SQL ran, what policy allowed it, and what rows were exposed.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern is already visible in production database and agent tooling. These are not anecdotes; they are public design constraints that point in the same direction.&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Public source&lt;/th&gt;&lt;th&gt;Documented behavior&lt;/th&gt;&lt;th&gt;Engineering implication&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https://www.postgresql.org/docs/17/ddl-rowsecurity.html&quot;&gt;PostgreSQL Row Security Policies&lt;/a&gt;&lt;/td&gt;&lt;td&gt;PostgreSQL row security policies restrict which rows can be returned or modified by normal queries and data modification commands&lt;/td&gt;&lt;td&gt;Tenant isolation belongs in database policy or approved views, not only in LLM instructions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https://www.postgresql.org/docs/17/runtime-config-client.html&quot;&gt;PostgreSQL &lt;code&gt;statement_timeout&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td&gt;PostgreSQL cancels statements that exceed the configured timeout; the setting can be applied per session or role rather than globally&lt;/td&gt;&lt;td&gt;Query cost control should live in the connection or role configuration, not in prompt text&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https://www.postgresql.org/docs/current/using-explain.html&quot;&gt;PostgreSQL &lt;code&gt;EXPLAIN&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;&lt;td&gt;PostgreSQL exposes estimated cost and row counts, and machine-readable &lt;code&gt;EXPLAIN&lt;/code&gt; formats such as JSON&lt;/td&gt;&lt;td&gt;A control plane can reject bad plans before execution, while still treating planner estimates as imperfect signals&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https://api.python.langchain.com/en/latest/sql/langchain_experimental.sql.base.SQLDatabaseChain.html&quot;&gt;LangChain &lt;code&gt;SQLDatabaseChain&lt;/code&gt; security note&lt;/a&gt;&lt;/td&gt;&lt;td&gt;LangChain warns that SQL database credentials should be narrowly scoped because the chain may attempt destructive commands if prompted&lt;/td&gt;&lt;td&gt;The execution credential must be least-privilege even when the application claims to be analytical&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https://supabase.com/docs/guides/database/postgres/row-level-security&quot;&gt;Supabase Row Level Security guidance&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Supabase tells teams to enable RLS on exposed schemas and treat RLS as defense in depth around PostgreSQL data access&lt;/td&gt;&lt;td&gt;Cloud-hosted PostgreSQL does not remove the need for database-enforced policy&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https://aws.amazon.com/blogs/machine-learning/text-to-sql-solution-powered-by-amazon-bedrock/&quot;&gt;AWS Bedrock text-to-SQL architecture&lt;/a&gt;&lt;/td&gt;&lt;td&gt;AWS describes a text-to-SQL architecture that routes questions through context retrieval, enforces Row-Level Security, validates SQL, executes against Redshift, and emits traces to CloudWatch&lt;/td&gt;&lt;td&gt;Public reference architectures put orchestration, policy, validation, execution, and observability into separate control points&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;This is why a simple Crafted AI Framework, n8n, CopilotKit, and PostgreSQL demo is useful but incomplete. The walkthrough shows the control flow: question, orchestration, SQL execution, summarization, delivery. Production requires the missing gates between those boxes.&lt;/p&gt;
&lt;p&gt;A generated query like this is syntactically ordinary:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;transaction_id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;user_id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;amount&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;risk_level&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; transactions t&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; countries c&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;destination_country&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;country_code&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;amount&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; BETWEEN&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DATE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2024-07-01&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DATE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2024-09-30&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;risk_level&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;high&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The control-plane question is whether it is &lt;em&gt;authorized&lt;/em&gt;. Does &lt;code&gt;user_id&lt;/code&gt; mean customer, employee, merchant, or account owner? Should the filter be &lt;code&gt;store_id = 123&lt;/code&gt;, as the user asked, or &lt;code&gt;user_id = 12345&lt;/code&gt;, as the generated SQL guessed? Is &lt;code&gt;countries.risk_level&lt;/code&gt; the approved compliance source or a stale enrichment table? Is the query running on a replica with a 5-second timeout or on the writer behind checkout traffic?&lt;/p&gt;
&lt;p&gt;That is the gap between a demo and a system a platform lead can defend in a post-incident review.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Plausible wrong metric&lt;/td&gt;&lt;td&gt;User asks for “revenue,” model uses gross transaction amount instead of recognized revenue&lt;/td&gt;&lt;td&gt;Force metric names through a semantic catalog with owner-approved SQL definitions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Expensive valid query&lt;/td&gt;&lt;td&gt;PostgreSQL 15 or 16 planner chooses a sequential scan because statistics are stale after a large load&lt;/td&gt;&lt;td&gt;Run &lt;code&gt;ANALYZE&lt;/code&gt;, reject high estimated row counts, and route heavy questions to precomputed views&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tenant leak&lt;/td&gt;&lt;td&gt;Agent omits tenant predicate on a shared table&lt;/td&gt;&lt;td&gt;Use Row Level Security or tenant-scoped views and set tenant context server-side&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Prompt injection through data&lt;/td&gt;&lt;td&gt;A table row contains text instructing the model to reveal hidden fields&lt;/td&gt;&lt;td&gt;Treat database content as untrusted input and summarize only shaped, masked results&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Summary overclaim&lt;/td&gt;&lt;td&gt;LLM says “fraud detected” when SQL only found transactions over a threshold&lt;/td&gt;&lt;td&gt;Require summaries to cite the rule, row count, and time window used&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Workflow sprawl&lt;/td&gt;&lt;td&gt;n8n workflow grows ad hoc branches for every executive request&lt;/td&gt;&lt;td&gt;Keep orchestration thin; move policy into code, database roles, and versioned catalog files&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Audit blind spot&lt;/td&gt;&lt;td&gt;Slack message survives, generated SQL does not&lt;/td&gt;&lt;td&gt;Insert audit rows before execution and update them with outcome, latency, and row count&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replica lag&lt;/td&gt;&lt;td&gt;Agent reads from an Aurora PostgreSQL read replica during high write volume&lt;/td&gt;&lt;td&gt;Expose freshness metadata and reject questions requiring current transactional state&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Natural-language SQL agents fail when generated queries are treated as trusted database clients.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Put a control plane between the model and PostgreSQL: approved catalog, parser, policy gate, &lt;code&gt;EXPLAIN&lt;/code&gt; gate, read-only execution role, result shaping, and audit logging.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A useful validation signal is an evaluation set where ambiguous time windows, missing tenant filters, expensive joins, restricted columns, and prompt-injected table content are rejected before execution.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, build the smallest safe version: three approved views, one read-only role, &lt;code&gt;statement_timeout = &apos;5s&apos;&lt;/code&gt;, mandatory &lt;code&gt;LIMIT 100&lt;/code&gt;, JSON &lt;code&gt;EXPLAIN&lt;/code&gt;, and an &lt;code&gt;ai_query_audit&lt;/code&gt; table.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A SQL agent earns production access only when the database would still be safe if the model made the worst plausible choice.&lt;/p&gt;</content:encoded><category>databases</category><category>ai-engineering</category><category>architecture</category></item><item><title>Double Write Buffers Fail at the I/O Boundary</title><link>https://rajivonai.com/blog/2025-02-22-double-write-buffers-fail-at-the-i-o-boundary/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-02-22-double-write-buffers-fail-at-the-i-o-boundary/</guid><description>Why porting InnoDB’s double write buffer to PostgreSQL breaks on buffered I/O, fsync semantics, and background writer design.</description><pubDate>Sat, 22 Feb 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A double write buffer only protects a database if the second write crosses the same durability boundary as the first; port InnoDB’s double write buffer into PostgreSQL without that boundary, and you have built a corruption machine with better comments.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;AI coding agents are now good enough to produce plausible systems code inside mature engines like PostgreSQL. That changes the review problem: the first failure is no longer “does it compile?” but “does the generated design preserve the subsystem’s recovery invariants?”&lt;/p&gt;
&lt;p&gt;The default PostgreSQL protection is write-ahead log (WAL) full page writes (FPW): after each checkpoint, the first modification of a page writes the whole page image into WAL. The tempting alternative is an InnoDB-style double write buffer (DWB): write a safe copy of the page elsewhere, flush it, then write the page to its final data-file location.&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Approach&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Recovery copy&lt;/th&gt;&lt;th&gt;Durability boundary&lt;/th&gt;&lt;th&gt;Primary cost&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL FPW&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Full 8KB page image in WAL&lt;/td&gt;&lt;td&gt;WAL flush through &lt;code&gt;wal_sync_method&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Higher WAL volume after checkpoints&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;InnoDB DWB&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Page copy in doublewrite files&lt;/td&gt;&lt;td&gt;DWB flush before final data-file write&lt;/td&gt;&lt;td&gt;Extra data writes and recovery state&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Naive PostgreSQL DWB port&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;Page copy in a new buffer area&lt;/td&gt;&lt;td&gt;Often mistaken as &lt;code&gt;smgrwrite()&lt;/code&gt; or &lt;code&gt;sync_file_range()&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Silent loss of the only safe copy&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The non-obvious failure is that InnoDB’s DWB and PostgreSQL’s FPW solve the same torn-page problem under different I/O contracts. MySQL documents InnoDB’s DWB as a storage area written before pages go to their proper locations, with a single &lt;code&gt;fsync()&lt;/code&gt; for the doublewrite chunk in the normal design (&lt;a href=&quot;https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-buffer.html&quot;&gt;MySQL 8.0 manual&lt;/a&gt;). PostgreSQL documents FPW as necessary because an operating-system crash can leave a page containing a mix of old and new data, and row-level WAL alone cannot repair that page (&lt;a href=&quot;https://www.postgresql.org/docs/current/runtime-config-wal.html&quot;&gt;PostgreSQL WAL settings&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The dangerous part is that the APIs look boring. &lt;code&gt;write()&lt;/code&gt;, &lt;code&gt;fsync()&lt;/code&gt;, &lt;code&gt;sync_file_range()&lt;/code&gt;, background writer, checkpointer. An AI agent can assemble those names into code that resembles a storage feature. The database will still start. Basic tests will still pass. Then the first crash at the wrong microsecond becomes your design review.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;smgrwrite()&lt;/code&gt; treated as durable&lt;/td&gt;&lt;td&gt;PostgreSQL has handed bytes to the kernel page cache, not necessarily persistent media&lt;/td&gt;&lt;td&gt;A DWB slot can be reused before the destination page is safe&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;sync_file_range()&lt;/code&gt; treated as &lt;code&gt;fsync()&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Linux documents &lt;code&gt;SYNC_FILE_RANGE_WRITE&lt;/code&gt; as asynchronous and warns it is not suitable for data integrity operations (&lt;a href=&quot;https://man7.org/linux/man-pages/man2/sync_file_range2.2.html&quot;&gt;man7&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;The code can believe flushing started when recovery needs proof flushing finished&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;BgWriter given synchronous DWB work&lt;/td&gt;&lt;td&gt;&lt;code&gt;bgwriter_delay&lt;/code&gt; defaults to 200ms and &lt;code&gt;bgwriter_lru_maxpages&lt;/code&gt; bounds per-round writes in PostgreSQL’s background writer design (&lt;a href=&quot;https://www.postgresql.org/docs/16/runtime-config-resource.html&quot;&gt;PostgreSQL resource settings&lt;/a&gt;)&lt;/td&gt;&lt;td&gt;A process designed to smooth dirty-buffer pressure becomes an fsync bottleneck&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FPW removed before DWB proves equivalence&lt;/td&gt;&lt;td&gt;PostgreSQL’s &lt;code&gt;full_page_writes&lt;/code&gt; default is &lt;code&gt;on&lt;/code&gt;, and docs warn disabling it can cause unrecoverable or silent corruption after failure&lt;/td&gt;&lt;td&gt;You save WAL bytes by deleting the recovery source of truth&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slot metadata reused early&lt;/td&gt;&lt;td&gt;The page copy may be durable, but the mapping from page identity to DWB slot is no longer valid&lt;/td&gt;&lt;td&gt;The hardest corruption is not a torn page; it is confidence in a backup you already overwrote&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The core question is not whether PostgreSQL can have a double write buffer. It is whether the design can prove, at every crash point, that either WAL or DWB contains a complete page image newer than the torn data-file page.&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;A correct PostgreSQL DWB design has to be staged around recovery truth, not modeled as an extra function call in &lt;code&gt;FlushBuffer()&lt;/code&gt;. The invariant is simple enough to write on a whiteboard: do not reuse the DWB slot until the final page location has been confirmed durable after the page write.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Dirty[dirty buffer selected] --&gt; Copy[copy page to DWB slot]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Copy --&gt; DwbFsync[fsync DWB file]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DwbFsync --&gt; WalCheck[confirm WAL ordering]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    WalCheck --&gt; DataWrite[write page to tablespace]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DataWrite --&gt; DataSync[fsync tablespace file]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DataSync --&gt; Reclaim[reclaim DWB slot]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Crash[crash recovery] --&gt; Inspect[inspect page checksum and LSN]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Inspect --&gt;|page torn| Restore[restore from DWB or WAL]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Inspect --&gt;|page valid| Replay[continue WAL replay]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Define the authoritative recovery copy per page version.&lt;br&gt;
If FPW remains enabled, WAL is authoritative for first-touch pages after checkpoint. If DWB is intended to replace FPW, the DWB slot plus metadata must become authoritative. Verification: write a crash-state matrix for DWB write, DWB fsync, tablespace write, tablespace fsync, checkpoint record, and slot reuse.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Separate page copy from durability confirmation.&lt;br&gt;
Copying an 8KB PostgreSQL page into a DWB slot is not the expensive part. The expensive part is proving that copy is on persistent storage, with its page identity, block number, relation fork, page LSN, and checksum intact. Verification: a crash after DWB copy but before DWB fsync must recover from WAL or ignore the incomplete DWB entry.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Delay slot reuse until the destination file crosses a real sync boundary.&lt;br&gt;
In PostgreSQL’s buffered I/O model, a successful data-file write is not enough. &lt;code&gt;sync_file_range()&lt;/code&gt; can start writeback, but Linux explicitly does not make it a portable crash-safety primitive. Verification: a crash after tablespace write but before tablespace fsync must still find the DWB slot valid.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Keep synchronous I/O out of the single BgWriter loop.&lt;br&gt;
PostgreSQL spreads checkpoint writes over time with &lt;code&gt;checkpoint_completion_target&lt;/code&gt;, defaulting to 0.9 in current releases, specifically to avoid bursty I/O (&lt;a href=&quot;https://www.postgresql.org/docs/current/runtime-config-wal.html&quot;&gt;PostgreSQL checkpoint settings&lt;/a&gt;). A DWB implementation needs a manager, batched slots, and completion accounting, not a per-buffer fsync in the background writer. Verification: track &lt;code&gt;buffers_backend&lt;/code&gt;, checkpoint duration, WAL generation, and p99 write latency under &lt;code&gt;pgbench&lt;/code&gt; before and after enabling the prototype.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Make recovery boring.&lt;br&gt;
Recovery must not infer intent from partially updated state. It should read DWB metadata, validate checksums and LSNs, restore only complete entries, and ignore anything whose durability boundary was not crossed. Verification: run crash injection at every transition, including slot metadata update and slot reuse.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented comparison is already enough to reject the naive port.&lt;/p&gt;
&lt;p&gt;PostgreSQL’s own documentation says &lt;code&gt;full_page_writes&lt;/code&gt; stores the whole disk page in WAL on the first modification after checkpoint because a torn data page cannot be repaired from row-level WAL alone. It also states the default is &lt;code&gt;on&lt;/code&gt; and that disabling it can lead to unrecoverable or silent corruption after a system failure. That is not a tuning hint. That is a contract.&lt;/p&gt;
&lt;p&gt;MySQL’s InnoDB documentation describes a different contract: pages flushed from the buffer pool are first written to the doublewrite area, and crash recovery can use that good copy if the final data-file write was interrupted. Since MySQL 8.0.20, those doublewrite pages live in doublewrite files rather than the old system tablespace location; since MySQL 8.0.30, &lt;code&gt;innodb_doublewrite&lt;/code&gt; also supports &lt;code&gt;DETECT_AND_RECOVER&lt;/code&gt; and &lt;code&gt;DETECT_ONLY&lt;/code&gt;. The design is not merely “write the page twice.” It is “write the page twice with ordered recovery metadata and a known flush point.”&lt;/p&gt;
&lt;p&gt;The documented pattern is clear: if generated code reclaims a DWB slot after &lt;code&gt;smgrwrite()&lt;/code&gt; or after an advisory range flush, it has confused a buffered write with a durable write. That is enough to violate the recovery invariant. The system can lose the durable DWB copy while the data-file page is still only dirty kernel state.&lt;/p&gt;
&lt;p&gt;This is exactly where AI-assisted systems work gets risky. Language models are strong at local similarity: InnoDB has a DWB, PostgreSQL has dirty pages, both have write paths, so assemble the bridge. But storage engines are not CRUD apps with worse naming. The important behavior lives between process architecture, kernel writeback, filesystem semantics, WAL ordering, and the crash replay path. The code shape is the least interesting part.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Premature DWB slot reuse&lt;/td&gt;&lt;td&gt;Slot is freed after &lt;code&gt;smgrwrite()&lt;/code&gt; returns on PostgreSQL with buffered I/O&lt;/td&gt;&lt;td&gt;Reclaim only after confirmed destination &lt;code&gt;fsync()&lt;/code&gt; or equivalent durable sync after the page write&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;False confidence from &lt;code&gt;sync_file_range()&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Linux &lt;code&gt;SYNC_FILE_RANGE_WRITE&lt;/code&gt; starts asynchronous writeback and does not flush volatile disk caches&lt;/td&gt;&lt;td&gt;Use it only as a writeback hint; keep &lt;code&gt;fsync()&lt;/code&gt; or &lt;code&gt;fdatasync()&lt;/code&gt; as the durability boundary&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;BgWriter latency collapse&lt;/td&gt;&lt;td&gt;Per-page DWB fsync added to a loop governed by &lt;code&gt;bgwriter_delay&lt;/code&gt; and &lt;code&gt;bgwriter_lru_maxpages&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Move DWB fsync into batched workers with completion queues and backpressure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Checkpoint storms&lt;/td&gt;&lt;td&gt;DWB fsync work prevents dirty buffers from being cleaned ahead of checkpoints&lt;/td&gt;&lt;td&gt;Budget DWB throughput against &lt;code&gt;checkpoint_completion_target&lt;/code&gt;, &lt;code&gt;max_wal_size&lt;/code&gt;, and observed checkpoint sync time&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;WAL invariant drift&lt;/td&gt;&lt;td&gt;DWB metadata claims protection for a page whose WAL record was not flushed in the expected order&lt;/td&gt;&lt;td&gt;Tie DWB entries to page LSNs and WAL flush state; reject entries recovery cannot order&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Recovery ambiguity&lt;/td&gt;&lt;td&gt;DWB slot has page bytes but stale relation, fork, block, checksum, or LSN metadata&lt;/td&gt;&lt;td&gt;Make metadata durable with the slot and validate all identifiers before restore&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Misleading benchmark win&lt;/td&gt;&lt;td&gt;FPW disabled on a clean shutdown benchmark with no crash injection&lt;/td&gt;&lt;td&gt;Require power-fail tests, torn-page injection, and recovery validation before comparing WAL volume&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Version-specific InnoDB copying&lt;/td&gt;&lt;td&gt;MySQL 8.0.20 moved DWB storage to doublewrite files; older mental models still cite &lt;code&gt;ibdata1&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Treat engine version as part of the design, not trivia&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: AI-generated storage code can compile while breaking the only invariant that matters: after a crash, one complete page image must exist.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Review DWB as a recovery protocol with explicit durable states, not as a write-path optimization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: The validation signal is not a passing smoke test; it is crash injection across every DWB, WAL, tablespace write, fsync, checkpoint, and slot-reuse transition.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, take one generated systems patch and write its durability matrix: recovery source of truth, sync boundary, reclaim condition, and invalid crash states.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A database does not care that the code looked like the reference architecture; it only cares which bytes survived the crash.&lt;/p&gt;</content:encoded><category>databases</category><category>ai-engineering</category><category>failures</category></item><item><title>The 2027 Cloud Database Architecture Roadmap</title><link>https://rajivonai.com/blog/2024-12-11-the-2027-cloud-database-architecture-roadmap/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-12-11-the-2027-cloud-database-architecture-roadmap/</guid><description>A 2027 cloud database architecture roadmap for teams that can no longer satisfy consistency, latency, residency, and recovery SLOs with a single engine.</description><pubDate>Wed, 11 Dec 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The next cloud database failure will not come from picking the wrong engine; it will come from pretending one engine can carry every consistency model, latency budget, residency rule, and recovery objective the business now depends on.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Cloud databases have moved from managed infrastructure to application architecture. The old decision was simple: choose Postgres, MySQL, DynamoDB, Spanner, Cassandra, Redis, or a warehouse, then make the application conform to the database. That worked when the product had one dominant workload and one dominant failure mode.&lt;/p&gt;
&lt;p&gt;By 2027, the database layer is no longer a single backing service. It is a fleet: regional OLTP, globally consistent ledgers, event logs, search indexes, vector retrieval, analytical replicas, tenant archives, and policy-aware data products. The operational boundary has shifted from “is the database up?” to “does the system still preserve the correct contract when part of the data plane is stale, relocated, throttled, replayed, or isolated?”&lt;/p&gt;
&lt;p&gt;The staff-level roadmap is therefore not a vendor matrix. It is a control-plane problem. Teams need to define which data must be strongly ordered, which data may be asynchronous, which data must stay in a geography, which data can be regenerated, and which data must remain queryable during a regional event.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Most database incidents are contract incidents disguised as capacity incidents.&lt;/p&gt;
&lt;p&gt;A write path is scaled horizontally, but the uniqueness guarantee still depends on a single regional primary. A read replica is added for latency, but a workflow quietly assumes read-your-writes behavior. A cache absorbs load, but the invalidation path becomes the real system of record during a failover. A vector index is introduced for retrieval, but nobody defines how embedding freshness relates to transactional truth. A data residency policy is implemented at the network layer, while asynchronous jobs still copy customer records into a global queue.&lt;/p&gt;
&lt;p&gt;These failures are rarely caused by ignorance. They are caused by architecture that does not name its database contracts explicitly. The application says “save order.” The database architecture silently decides ordering, durability, idempotency, placement, indexing, and recovery.&lt;/p&gt;
&lt;p&gt;The 2027 question is not “Which cloud database should we standardize on?” It is: &lt;strong&gt;which data contracts deserve first-class architecture, and which engines should be assigned only after those contracts are visible?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;The answer is a contract-first database platform: a small number of explicitly governed persistence patterns, each with a named consistency model, failure mode, and recovery procedure.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[product workflow — user intent] --&gt; B[contract classifier — data criticality]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; C[ledger store — strict ordering]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; D[regional OLTP — low latency writes]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; E[event log — replayable facts]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; F[derived indexes — search and retrieval]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; G[analytical plane — historical queries]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; H[policy engine — residency and retention]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; H&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; H&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt; H&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  G --&gt; H&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  H --&gt; I[control plane — placement and recovery]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  I --&gt; J[verification suite — failover drills]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  I --&gt; K[observability — contract metrics]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This roadmap has five architectural moves.&lt;/p&gt;
&lt;p&gt;First, classify data before selecting engines. Ledgers, inventory reservations, financial balances, identity state, entitlement decisions, and audit trails are not generic rows. They require explicit ordering, idempotency keys, reconciliation flows, and restore tests. Product metadata, recommendations, notifications, activity feeds, and search documents can often tolerate asynchronous propagation if the user contract is clear.&lt;/p&gt;
&lt;p&gt;Second, split systems of record from systems of interaction. The system of record preserves facts. The system of interaction optimizes reads, search, ranking, and locality. Treating an index, cache, or embedding store as authoritative creates silent correctness debt.&lt;/p&gt;
&lt;p&gt;Third, make geography part of the schema. Region, tenant, retention class, and residency boundary should be visible in data modeling and routing. If placement is only a Terraform concern, the application will eventually leak data across an unintended path.&lt;/p&gt;
&lt;p&gt;Fourth, make recovery a queryable property. Every persistence pattern should declare restore point objective, restore time objective, replay source, backfill procedure, and validation query. A backup that cannot prove semantic recovery is storage, not resilience.&lt;/p&gt;
&lt;p&gt;Fifth, centralize database policy without centralizing every database. A platform team should own paved-road contracts, reference implementations, test harnesses, and operational scorecards. Application teams should still choose the simplest approved pattern that satisfies their workflow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Strict global order&lt;/strong&gt;: Distributed SQL for externally consistent transactions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Regional low latency&lt;/strong&gt;: Regional relational primary with local replicas.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Massive key access&lt;/strong&gt;: Partitioned key-value store for predictable throughput.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replayable integration&lt;/strong&gt;: Event log for a durable append stream.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic retrieval&lt;/strong&gt;: Index store for derived embeddings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Historical analysis&lt;/strong&gt;: Warehouse or lakehouse for batch and streaming ingest.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; The documented pattern in Amazon Aurora is that cloud-native relational systems can move substantial storage responsibility out of the database host and into a distributed storage layer. The Aurora paper describes a design where the database instance ships redo records to storage nodes instead of performing the full page-oriented storage work on the compute node: &lt;a href=&quot;https://www.amazon.science/publications/amazon-aurora-design-considerations-for-high-throughput-cloud-native-relational-databases&quot;&gt;Amazon Aurora design considerations&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The architectural action is to stop treating compute and storage as one scaling unit. For 2027 systems, the roadmap should separate write admission, transaction execution, log durability, page reconstruction, backup, and read scaling as distinct design surfaces.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The documented result is not “Aurora fits every workload.” The result is narrower and more useful: separating database compute from distributed storage changes the bottleneck map. Network write amplification, recovery behavior, replica lag, and storage quorum health become first-order operational signals.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; The pattern is that managed relational databases are no longer just hosted VMs. They are distributed systems with relational interfaces. Teams that operate them as single-node databases will miss the failure modes that matter.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Google Spanner documents a different contract: externally consistent transactions using TrueTime and replicated consensus. The public documentation describes external consistency as the strongest transaction ordering guarantee Spanner exposes when using serializable isolation: &lt;a href=&quot;https://cloud.google.com/spanner/docs/true-time-external-consistency&quot;&gt;Spanner TrueTime and external consistency&lt;/a&gt;. The original OSDI paper explains the globally distributed design: &lt;a href=&quot;https://research.google.com/archive/spanner-osdi2012.pdf&quot;&gt;Spanner paper&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The architectural action is to reserve globally ordered databases for workflows that truly need global ordering. Use them for ledgers, entitlement changes, cross-region inventory, and other facts where “which write happened first” is part of correctness.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The documented pattern is that global consistency has an explicit coordination cost. The roadmap should therefore avoid putting every user preference, page view, notification, and recommendation write into the same globally ordered path.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Strong consistency is a product contract, not a prestige feature. If the product does not need the contract, the architecture should not pay for it on every request.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Amazon DynamoDB documents a partitioned, fully managed key-value architecture built for predictable performance at scale: &lt;a href=&quot;https://www.amazon.science/publications/amazon-dynamodb-a-scalable-predictably-performant-and-fully-managed-nosql-database-service&quot;&gt;Amazon DynamoDB paper&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The architectural action is to design access patterns before table shape. High-scale key-value systems reward known query paths, bounded item sizes, explicit partition keys, and deliberate secondary indexes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The documented pattern is that predictable performance comes from constraining the data model around access. Teams that expect ad hoc relational query flexibility from a key-value store usually move complexity into application code, backfills, and secondary indexing pipelines.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; The database roadmap should not ask one store to be both the high-throughput serving path and the exploratory query surface. Serve hot paths from constrained models; analyze history elsewhere.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; CockroachDB documents multi-region abstractions and transaction behavior for distributed SQL, including region-aware capabilities and serializable transaction semantics: &lt;a href=&quot;https://www.cockroachlabs.com/docs/stable/multiregion-overview&quot;&gt;CockroachDB multi-region overview&lt;/a&gt; and &lt;a href=&quot;https://www.cockroachlabs.com/docs/stable/architecture/transaction-layer&quot;&gt;transaction layer&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The architectural action is to model locality and contention together. A globally distributed table with hot transactional rows is not equivalent to a region-local table with replicated reference data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The documented pattern is that multi-region design is a schema and workload problem, not only a cluster topology problem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Geography belongs in architecture reviews before launch, not in incident response after latency and residency collide.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Roadmap choice&lt;/th&gt;&lt;th&gt;What improves&lt;/th&gt;&lt;th&gt;Where it breaks&lt;/th&gt;&lt;th&gt;Verification step&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Contract-first persistence&lt;/td&gt;&lt;td&gt;Clear ownership of consistency and recovery&lt;/td&gt;&lt;td&gt;Slower upfront design&lt;/td&gt;&lt;td&gt;Review every critical workflow for ordering, idempotency, and replay&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Distributed SQL for global facts&lt;/td&gt;&lt;td&gt;Stronger cross-region correctness&lt;/td&gt;&lt;td&gt;Coordination latency and transaction retries&lt;/td&gt;&lt;td&gt;Run contention tests from every active region&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Regional OLTP by default&lt;/td&gt;&lt;td&gt;Lower write latency and simpler operations&lt;/td&gt;&lt;td&gt;Cross-region workflows need explicit reconciliation&lt;/td&gt;&lt;td&gt;Test regional isolation and delayed replication&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Event log for integration&lt;/td&gt;&lt;td&gt;Replayable downstream state&lt;/td&gt;&lt;td&gt;Consumers may treat events as current truth&lt;/td&gt;&lt;td&gt;Compare materialized views against source facts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Derived search and vector indexes&lt;/td&gt;&lt;td&gt;Fast retrieval and ranking&lt;/td&gt;&lt;td&gt;Staleness becomes user-visible&lt;/td&gt;&lt;td&gt;Track freshness lag as a product metric&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Central database platform&lt;/td&gt;&lt;td&gt;Fewer unsafe one-off patterns&lt;/td&gt;&lt;td&gt;Platform can become a bottleneck&lt;/td&gt;&lt;td&gt;Publish approved contracts with self-service templates&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Your database architecture probably names engines more clearly than it names contracts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Build a persistence catalog with approved patterns for ledgers, regional OLTP, event streams, derived indexes, analytical stores, and archives.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; For each pattern, require a failover drill, restore drill, replay drill, and consistency test that a product engineer can understand.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Before adding the next database, write the contract first: ordering, freshness, placement, recovery, ownership, and the query that proves the system is correct after failure.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>databases</category><category>cloud</category></item><item><title>PostgreSQL 16/17 Features That Matter to Operators</title><link>https://rajivonai.com/blog/2024-10-24-postgresql-16-17-features-that-matter-to-operators/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-10-24-postgresql-16-17-features-that-matter-to-operators/</guid><description>Which PostgreSQL 16 and 17 changes operators actually need to prepare for: logical replication improvements, vacuum visibility, connection limits, and monitoring additions that change on-call behavior.</description><pubDate>Thu, 24 Oct 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;PostgreSQL 16 and 17 each added dozens of features. Most of them are developer-facing: new SQL syntax, function improvements, improved type support. The ones that matter to operators are a shorter list — but they change how you observe I/O, configure replication, manage access control, and run backups.&lt;/strong&gt; Upgrading to PG16 or PG17 without reviewing these operational changes means your dashboards break silently, your replication topology adds unexpected complexity, and your backup process changes in ways your runbooks do not reflect.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL follows a yearly release cadence. PG16 shipped in September 2023 and PG17 in October 2024. Both releases continue the pattern of adding features that benefit application developers — but they also change or add several infrastructure-level capabilities that operators care about more than developers do.&lt;/p&gt;
&lt;p&gt;This post covers only operationally significant changes: new system views, replication topology changes, backup improvements, and access control changes. Developer-facing features (new SQL functions, JSON improvements, etc.) are out of scope.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Operators who upgrade without reviewing the release notes typically encounter problems in three categories: monitoring breaks (a metric they relied on moved or changed format), replication complexity increases (a new capability requires opting in or opting out), or a backup workflow changes (new flags or new manifest requirements).&lt;/p&gt;
&lt;p&gt;The specific risk with PG16’s &lt;code&gt;pg_stat_io&lt;/code&gt; view: if your monitoring stack queries the old I/O metrics from &lt;code&gt;pg_stat_bgwriter&lt;/code&gt; and &lt;code&gt;pg_stat_database&lt;/code&gt;, those views still exist in PG16, but the granularity and definitions changed. Dashboards built on those views produce misleading numbers without an explicit migration.&lt;/p&gt;
&lt;p&gt;The core question for each release: which changes require action before you upgrade, and which require action after?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;The operational surface area of PostgreSQL is evolving to provide more granular observability and more flexible replication, while pushing more complexity into backup management.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Upgrade[PostgreSQL Upgrade] --&gt; Observability[Observability]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Upgrade --&gt; Replication[Replication]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Upgrade --&gt; Backup[Backup and Restore]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Observability --&gt; IO[Migrate to pg_stat_io]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Replication --&gt; Lag[Monitor standby logical lag]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Backup --&gt; Manifest[Manage backup manifests]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;pg16-operational-changes&quot;&gt;PG16 Operational Changes&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;1. &lt;code&gt;pg_stat_io&lt;/code&gt; — new I/O observability view&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PG16 introduces &lt;code&gt;pg_stat_io&lt;/code&gt;, a new system view that breaks I/O statistics down by backend type (&lt;code&gt;client backend&lt;/code&gt;, &lt;code&gt;autovacuum worker&lt;/code&gt;, &lt;code&gt;WAL writer&lt;/code&gt;, &lt;code&gt;checkpointer&lt;/code&gt;, etc.), I/O object (&lt;code&gt;relation&lt;/code&gt;, &lt;code&gt;temp relation&lt;/code&gt;), and I/O context (&lt;code&gt;normal&lt;/code&gt;, &lt;code&gt;vacuum&lt;/code&gt;, &lt;code&gt;bulkread&lt;/code&gt;). This is the most significant monitoring change in years.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; backend_type, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;object&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, context, reads, writes, extends, evictions&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_io&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; reads &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before PG16, I/O was only observable in aggregate via &lt;code&gt;pg_stat_bgwriter&lt;/code&gt; and &lt;code&gt;pg_stat_database&lt;/code&gt;. After PG16, you can see that autovacuum workers are responsible for 80% of your block reads during a vacuum storm, or that WAL writes are saturating a specific I/O context. If your existing monitoring uses &lt;code&gt;pg_stat_bgwriter.buffers_clean&lt;/code&gt; or &lt;code&gt;pg_stat_database.blks_hit&lt;/code&gt;, those fields are still present but mean something different from &lt;code&gt;pg_stat_io&lt;/code&gt; — do not mix them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Logical replication from standby servers&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PG16 allows a physical standby (streaming replica) to act as a logical replication publication source. Before PG16, you could only create a logical replication publication on a primary. With PG16, you can offload the logical decoding CPU and I/O cost to a standby.&lt;/p&gt;
&lt;p&gt;This is valuable when logical replication fans out to many subscribers and the decoding overhead affects primary throughput. The tradeoff: if the standby falls behind the primary, logical subscribers reading from the standby see higher replication lag. You now have two lag dimensions to monitor: physical lag (primary → standby) and logical lag (standby → subscriber).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Role membership — &lt;code&gt;GRANT ... WITH INHERIT&lt;/code&gt; behavior change&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PG16 split the previously conflated &lt;code&gt;INHERIT&lt;/code&gt; and &lt;code&gt;SET ROLE&lt;/code&gt; privileges. Before PG16, &lt;code&gt;GRANT role TO user&lt;/code&gt; always implicitly granted both inheritance and the ability to &lt;code&gt;SET ROLE&lt;/code&gt;. In PG16, these are separate:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; role&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; user &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WITH&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; INHERIT TRUE;   &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- inherits privileges automatically&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; role&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; user &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WITH&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; TRUE;       &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- can SET ROLE to switch to the role&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The default behavior did not change for most cases, but explicit &lt;code&gt;GRANT ... WITH INHERIT FALSE&lt;/code&gt; statements from before PG16 may behave differently in PG16 if you also relied on &lt;code&gt;SET ROLE&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. &lt;code&gt;pg_hba.conf&lt;/code&gt; and &lt;code&gt;pg_ident.conf&lt;/code&gt; now have system views&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;pg_hba_file_rules&lt;/code&gt; and &lt;code&gt;pg_ident_file_mappings&lt;/code&gt; are now reliable system views that reflect the actual loaded configuration, including any syntax errors. This replaces the need to parse config files manually for audit purposes.&lt;/p&gt;
&lt;h3 id=&quot;pg17-operational-changes&quot;&gt;PG17 Operational Changes&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;1. Incremental backup with &lt;code&gt;pg_basebackup&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PG17 added &lt;code&gt;--incremental&lt;/code&gt; support to &lt;code&gt;pg_basebackup&lt;/code&gt;. An incremental backup records only the page changes since the last full or incremental backup, using a backup manifest to track which pages changed. The full and incremental backup set must be combined with &lt;code&gt;pg_combinebackup&lt;/code&gt; before restore.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Full backup (save the manifest)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_basebackup&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -D&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /backup/base&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --checkpoint=fast&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Incremental backup&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_basebackup&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -D&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /backup/incr1&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --incremental=/backup/base/backup_manifest&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Combine before restore&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_combinebackup&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /backup/base&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /backup/incr1&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -o&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /backup/restored&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This changes the backup workflow: you will need to store and manage backup manifests, and the restore process requires the combine step. Teams that automate restore testing need to update their scripts before moving to PG17 backups.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Vacuum improvements — skip frozen pages&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PG17 improved VACUUM’s ability to skip pages that are already fully frozen (all tuples have transaction IDs old enough to be safe). This reduces the I/O footprint of anti-wraparound vacuums on tables with stable old data. No configuration change is needed — this is automatic. The observable effect is shorter elapsed time for VACUUM operations on large tables with significant frozen page counts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Logical replication of sequences (partial)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PG17 added initial sequence replication support. Sequence values can be included in a publication and replicated to a subscriber. This addresses part of the long-standing gap where logical replication subscribers had diverged sequences after promotion. This is an opt-in addition to a publication (&lt;code&gt;FOR ALL SEQUENCES&lt;/code&gt; or named sequences) and does not replicate every increment — it sends periodic snapshots of sequence state.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. MERGE — full support for &lt;code&gt;NOT MATCHED BY SOURCE&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PG17 completed the MERGE statement implementation by adding &lt;code&gt;NOT MATCHED BY SOURCE&lt;/code&gt; — the ability to delete or update rows in the target that have no matching row in the source, completing the full SQL standard MERGE semantics. This is primarily a developer feature, but it affects ETL pipelines that previously required separate DELETE and MERGE logic.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The PostgreSQL 16 release notes (postgresql.org/docs/16/release-16.html) document &lt;code&gt;pg_stat_io&lt;/code&gt; as a new view with explicit field definitions. The release notes note that several counters previously in &lt;code&gt;pg_stat_bgwriter&lt;/code&gt; are now more granularly available in &lt;code&gt;pg_stat_io&lt;/code&gt;, and that &lt;code&gt;pg_stat_bgwriter&lt;/code&gt; fields related to buffer I/O are deprecated in favor of &lt;code&gt;pg_stat_io&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The PostgreSQL 17 release documentation (&lt;a href=&quot;https://www.postgresql.org/docs/17/app-pgbasebackup.html&quot;&gt;postgresql.org/docs/17/app-pgbasebackup.html&lt;/a&gt;) specifies that &lt;code&gt;pg_combinebackup&lt;/code&gt; is the required tool for restore — it is not optional. Backup manifests are required inputs for incremental backups and must be retained between backup cycles.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Upgrading to PG16 without updating monitoring&lt;/td&gt;&lt;td&gt;I/O dashboards show stale or misleading data&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_io&lt;/code&gt; changes the metric namespace; old views still exist but have different granularity&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Logical replication from standby&lt;/td&gt;&lt;td&gt;Subscribers see elevated lag when standby falls behind primary&lt;/td&gt;&lt;td&gt;Two lag dimensions compound: physical replication lag plus logical decoding lag&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PG17 incremental backup without manifest management&lt;/td&gt;&lt;td&gt;Restore fails at &lt;code&gt;pg_combinebackup&lt;/code&gt; step&lt;/td&gt;&lt;td&gt;Incremental backups are unusable without the backup manifest from the previous full backup&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Upgrading PostgreSQL without reviewing operational changes breaks monitoring, backup automation, and replication lag calculations without any visible error at upgrade time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: For PG16, migrate I/O monitoring to &lt;code&gt;pg_stat_io&lt;/code&gt; before decommissioning old dashboard queries; for PG17, update backup scripts to retain manifests and add a &lt;code&gt;pg_combinebackup&lt;/code&gt; step to restore runbooks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After upgrading to PG16, query &lt;code&gt;pg_stat_io&lt;/code&gt; and confirm your monitoring system is capturing backend_type-level I/O breakdown; after upgrading to PG17, execute a test incremental restore and confirm &lt;code&gt;pg_combinebackup&lt;/code&gt; completes without error.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Before upgrading to either version, grep your monitoring configuration for references to &lt;code&gt;pg_stat_bgwriter.buffers_*&lt;/code&gt; and &lt;code&gt;pg_stat_database.blks_*&lt;/code&gt; — these are the most commonly broken queries after PG16 adoption.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category></item><item><title>MongoDB 8.0: Why Queryable Encryption Matters</title><link>https://rajivonai.com/blog/2024-10-15-mongodb-80-queryable-encryption-matters/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-10-15-mongodb-80-queryable-encryption-matters/</guid><description>MongoDB Queryable Encryption stores and queries sensitive fields in encrypted form — what it enables, how it differs from standard FLE, and where the query type constraints bite.</description><pubDate>Tue, 15 Oct 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;MongoDB Queryable Encryption lets specific document fields be queried on the server without the server ever seeing their plaintext values — a fundamentally different security model from field-level encryption, which requires decryption before any server-side filtering can happen.&lt;/strong&gt; The distinction matters for compliance contexts where the database host, DBA access, or cloud infrastructure staff must be excluded from seeing sensitive data, even while the application queries that data.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most encryption-at-rest and field-level encryption (FLE) schemes protect data from attackers who steal storage media or backups. They do not protect data from someone with direct database access — a DBA with credentials, a cloud provider with storage access, or an attacker who compromises the database host. Encrypted at rest, but decrypted in memory when any query touches the field.&lt;/p&gt;
&lt;p&gt;MongoDB Queryable Encryption (QE), generally available in MongoDB 7.0 with range query support expanded significantly in 8.0, changes that model. Specific document fields are encrypted at the client before they reach the MongoDB server. The server stores ciphertext. When the application queries those fields, it sends an encrypted query token; the server evaluates the query against encrypted data using a deterministic scheme that does not require the server to decrypt the field. The server returns matching documents, still encrypted. Only the client — with access to the encryption keys — can read the plaintext.&lt;/p&gt;
&lt;p&gt;This means DBAs, MongoDB Atlas operations staff, and anyone with direct database access see only ciphertext for encrypted fields. The data is not just protected at rest; it is protected from privileged infrastructure access during normal operation.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The failure mode for teams new to QE is query type mismatch. Queryable Encryption does not support arbitrary query patterns. The server can only evaluate queries that the underlying cryptographic scheme supports: equality (deterministic encryption, GA in MongoDB 7.0) and range (expanded in MongoDB 8.0 with prefix and suffix query support). The server cannot run regex, text search, full-document comparison, or most aggregation pipeline operations on QE-encrypted fields without decryption.&lt;/p&gt;
&lt;p&gt;A team that implements QE on a sensitive field and later discovers that a new feature requires a case-insensitive text search or a LIKE-equivalent pattern on that field is stuck: the field is encrypted in a way that only equality and range queries can be evaluated server-side. Text search falls back to requiring application-layer filtering — fetch all documents, decrypt, filter in memory — which is functionally correct but operationally expensive at scale.&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;Queryable Encryption requires three components: a MongoDB driver with libmongocrypt support (6.0+), a key management configuration, and a schema that identifies which fields are QE-encrypted and which query type each supports.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Client[&quot;Application Client — Holds Keys&quot;] --&gt;|Encrypts data with DEK| Token[&quot;Encrypted Query Token&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Token --&gt;|Sends token| Server[&quot;MongoDB Server 8.0&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Server --&gt;|Evaluates ciphertext| Matches[&quot;Matched Encrypted Documents&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Matches --&gt;|Returns ciphertext| Client&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Client --&gt;|Decrypts with DEK| Plaintext[&quot;Plaintext Result&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Required components:&lt;/strong&gt;&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Component&lt;/th&gt;&lt;th&gt;Purpose&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;MongoDB driver with libmongocrypt&lt;/td&gt;&lt;td&gt;Client-side encryption and decryption&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Customer Master Key (CMK)&lt;/td&gt;&lt;td&gt;Root key, stored in KMS (AWS KMS, GCP KMS, Azure Key Vault, KMIP, or local for dev)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Data Encryption Key (DEK)&lt;/td&gt;&lt;td&gt;Per-field key, encrypted by CMK and stored in a key vault collection&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Encrypted fields map&lt;/td&gt;&lt;td&gt;Tells the driver which fields to encrypt and what query types they support&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;QE vs standard FLE:&lt;/strong&gt;&lt;/p&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;/th&gt;&lt;th&gt;Standard FLE&lt;/th&gt;&lt;th&gt;Queryable Encryption&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Server-side queries&lt;/td&gt;&lt;td&gt;Not supported — client must decrypt before filtering&lt;/td&gt;&lt;td&gt;Supported for equality and range query types&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Storage format&lt;/td&gt;&lt;td&gt;Deterministic or random encryption&lt;/td&gt;&lt;td&gt;Deterministic (equality) or range-scheme encryption&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Who can query&lt;/td&gt;&lt;td&gt;Client with key access only&lt;/td&gt;&lt;td&gt;Server evaluates; client decrypts results&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Supported queries&lt;/td&gt;&lt;td&gt;Any (post-decryption)&lt;/td&gt;&lt;td&gt;Equality (GA, 7.0), range (expanded in 8.0)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Supported query types in 8.0:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;MongoDB 8.0 expanded range query support to include prefix range, suffix range, and inequality queries on QE-encrypted fields. The types that remain unsupported for server-side evaluation include regex, text search, &lt;code&gt;$elemMatch&lt;/code&gt; on nested QE fields, and most aggregation expressions that operate on field content.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Setting up QE (schema-level declaration):&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Encrypted fields map — specified at collection creation&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;const&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; encryptedFieldsMap&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &quot;fields&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: [&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      path: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;ssn&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      bsonType: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      queries: [{ queryType: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;equality&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; }]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    },&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      path: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;salary&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      bsonType: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;int&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      queries: [{ queryType: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;range&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, min: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, max: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1000000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; }]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  ]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;};&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The encryption and decryption happen transparently in the driver via the &lt;code&gt;ClientEncryption&lt;/code&gt; API. Queries against encrypted fields use the same MongoDB query syntax — the driver translates them to encrypted tokens before sending to the server.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;MongoDB Queryable Encryption was announced as Generally Available in MongoDB 7.0, with the GA announcement documented in the MongoDB 7.0 release notes and the QE documentation available in the MongoDB Manual (chapter “Queryable Encryption”). The expansion of range query support in MongoDB 8.0 is documented in the MongoDB 8.0 release notes (October 2024) and the Queryable Encryption compatibility page.&lt;/p&gt;
&lt;p&gt;The documented pattern is that QE-encrypted fields cannot use standard B-tree indexes. As stated in the MongoDB QE manual, encrypted fields use a special metadata index structure managed by the QE subsystem, not a standard index that appears in &lt;code&gt;db.collection.getIndexes()&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Application adds regex or text search on QE field&lt;/td&gt;&lt;td&gt;Query cannot run server-side&lt;/td&gt;&lt;td&gt;QE encryption scheme does not support text evaluation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Range query on QE field without range query type configured&lt;/td&gt;&lt;td&gt;Error at query time&lt;/td&gt;&lt;td&gt;Field configured for equality-only QE cannot process range queries&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Key management in dev mode in production&lt;/td&gt;&lt;td&gt;Security model broken&lt;/td&gt;&lt;td&gt;Local provider gives all server-side access to key material&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Teams implement QE on sensitive fields and later discover that new query types — text search, regex, complex aggregations — cannot run server-side against QE-encrypted data, requiring expensive application-layer workarounds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Map every query pattern required for each sensitive field before implementing QE; use QE only for fields where equality and range queries are sufficient; keep non-queryable sensitive fields on standard FLE or separate encryption.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Test all application query patterns against the encrypted field in staging before deploying; any unsupported pattern fails at query execution time, not at configuration time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, document the required query types for each sensitive field your application needs to protect — equality, range, or open-ended — and verify that QE’s supported query types cover them before committing to the encryption scheme.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Queryable Encryption solves a real problem — privileged infrastructure access to plaintext sensitive data — but it imposes real query constraints. Understanding those constraints before schema design is the difference between a compliance win and a schema migration at the worst possible time.&lt;/p&gt;</content:encoded><category>databases</category><category>architecture</category></item><item><title>Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works</title><link>https://rajivonai.com/blog/2024-10-15-prometheus-grafana-database-engineers/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-10-15-prometheus-grafana-database-engineers/</guid><description>How to position Prometheus and Grafana as the open-source baseline for teams that cannot send every byte of database telemetry to managed services.</description><pubDate>Tue, 15 Oct 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;If you blindly enable every database metric exporter without understanding high-cardinality data, your monitoring stack will collapse before your database does.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Managed observability platforms like Datadog and CloudWatch are exceptionally powerful, but their pricing models are fundamentally misaligned with high-volume database metrics. If you operate massive, self-managed database fleets on bare metal or Kubernetes, sending every connection state, wait event, and table-level metric to a SaaS provider quickly becomes a top-three line item on your cloud bill.&lt;/p&gt;
&lt;p&gt;For teams running their own infrastructure, the Prometheus and Grafana stack remains the definitive open-source baseline. OpenTelemetry’s unified model for logs, metrics, and traces provides the standard vocabulary, but Prometheus is the engine that pulls the metrics. However, database engineers often struggle with Prometheus because its pull-based architecture and label-based querying (PromQL) require a different mental model than traditional agent-based monitoring.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Out of the box, a tool like &lt;code&gt;postgres_exporter&lt;/code&gt; or &lt;code&gt;mysqld_exporter&lt;/code&gt; will scrape hundreds of metrics. The immediate trap that database teams fall into is “cardinality explosion.”&lt;/p&gt;
&lt;p&gt;If you configure an exporter to scrape the execution count of every unique normalized SQL query from &lt;code&gt;pg_stat_statements&lt;/code&gt;, and you have a high-churn ORM generating thousands of unique query shapes, Prometheus will attempt to store each of those as a unique time series. Memory consumption on the Prometheus server will skyrocket, OOM kills will follow, and you will lose visibility precisely when you need it most.&lt;/p&gt;
&lt;h2 id=&quot;the-open-source-database-observability-stack&quot;&gt;The Open-Source Database Observability Stack&lt;/h2&gt;
&lt;p&gt;A production-grade open-source monitoring stack for databases requires three strictly managed layers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The Exporter Layer:&lt;/strong&gt; This is a lightweight process (e.g., &lt;code&gt;postgres_exporter&lt;/code&gt;) running alongside the database. It translates internal database states into the text-based exposition format Prometheus expects.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Scrape Configuration:&lt;/strong&gt; The Prometheus server pulls data from the exporter at a defined interval (e.g., every 15 seconds). This is where you must aggressively filter out high-cardinality labels using &lt;code&gt;metric_relabel_configs&lt;/code&gt; to drop metrics you do not actively alert on.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Alerting Rules:&lt;/strong&gt; Raw metrics are useless during an incident. You must define Prometheus recording rules to pre-calculate expensive metrics (like the 5-minute rate of disk I/O) and alerting rules (e.g., alert if the connection pool is &gt;90% saturated for 3 minutes).&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern for surviving Prometheus at scale involves ruthless metric dropping.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; The &lt;code&gt;mysqld_exporter&lt;/code&gt; default configuration exposes &lt;code&gt;mysql_perf_schema_events_statements_total&lt;/code&gt;, which creates one time series per unique normalized query digest tracked by the Performance Schema. On an ORM-driven application generating thousands of unique query shapes, this single metric produces hundreds of thousands of unique time series. Prometheus’s documentation on instrumentation best practices explicitly warns that unbounded label values — like &lt;code&gt;digest&lt;/code&gt; or &lt;code&gt;query_hash&lt;/code&gt; — cause memory growth proportional to the number of unique label combinations, and recommends against high-cardinality dimensions in metric labels (&lt;a href=&quot;https://prometheus.io/docs/practices/instrumentation/&quot;&gt;Prometheus: Instrumentation best practices&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The documented mitigation is a &lt;code&gt;metric_relabel_configs&lt;/code&gt; block with a &lt;code&gt;drop&lt;/code&gt; action targeting &lt;code&gt;mysql_perf_schema_events_statements_total&lt;/code&gt; in the Prometheus scrape configuration, combined with a replacement custom collector query that exports only the top-N slowest statements by total execution time from &lt;code&gt;performance_schema.events_statements_summary_by_digest&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The Prometheus TSDB status page (&lt;code&gt;/tsdb-status&lt;/code&gt;) exposes the top-10 highest-cardinality metrics by series count — this is the diagnostic that reveals which exporter metric is consuming the majority of Prometheus server memory before it OOM-kills.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Prometheus is an operational alerting database, not a data lake. The test for any scraped metric: does it drive an alert or a live dashboard panel? If not, drop it at the scrape layer rather than ingesting it and paying the memory cost.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;
&lt;p&gt;Relying on Prometheus and Grafana involves significant operational tradeoffs compared to managed services:&lt;/p&gt;























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Approach&lt;/th&gt;&lt;th&gt;Advantage&lt;/th&gt;&lt;th&gt;Disadvantage&lt;/th&gt;&lt;th&gt;Failure Mode&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Prometheus (Self-Hosted)&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Zero variable cost for high data volume; complete control over scrape intervals.&lt;/td&gt;&lt;td&gt;You must manage the storage, backups, and high availability of the monitoring stack yourself.&lt;/td&gt;&lt;td&gt;The Prometheus server runs out of disk space and stops recording metrics during an outage.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Datadog / Managed SaaS&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Zero maintenance; built-in correlation between logs, traces, and metrics.&lt;/td&gt;&lt;td&gt;High-cardinality custom metrics incur massive monthly costs.&lt;/td&gt;&lt;td&gt;Finance forces engineering to drop critical metrics to meet budget constraints.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Database teams deploy &lt;code&gt;postgres_exporter&lt;/code&gt; or &lt;code&gt;mysqld_exporter&lt;/code&gt; with default settings, then watch the Prometheus server OOM-kill itself from cardinality explosion within days — the monitoring stack fails before the database does.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Apply &lt;code&gt;metric_relabel_configs&lt;/code&gt; to drop high-cardinality per-query metrics on every new exporter deployment, and replace them with a targeted custom collector that exports only top-N slowest queries by total execution time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Check your Prometheus TSDB status page (&lt;code&gt;/tsdb-status&lt;/code&gt;) — if any single metric family consumes more than 10% of total series, you have a cardinality problem that will eventually crash the server under incident load.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Audit current exporters via the TSDB status page this week and drop any metric not tied to an active alerting rule or dashboard panel — treat every unalerted metric as operational overhead with a memory cost.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category><category>failures</category><category>checklist</category></item><item><title>Datadog Database Monitoring: PostgreSQL, MySQL, and Aurora Setup</title><link>https://rajivonai.com/blog/2024-10-14-datadog-database-monitoring-setup-postgres-mysql-aurora/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-10-14-datadog-database-monitoring-setup-postgres-mysql-aurora/</guid><description>How to configure Datadog Database Monitoring for PostgreSQL, MySQL, and Aurora — query samples, explain plans, wait event analysis, and the specific Agent settings that make the difference between metric collection and real observability.</description><pubDate>Mon, 14 Oct 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Datadog Database Monitoring is not just metrics collection with a nicer UI — it ships query-level explain plans, wait event breakdown, and connection pool visibility without requiring &lt;code&gt;pg_stat_statements&lt;/code&gt; configuration or custom PromQL recording rules. The mistake is enabling it and leaving all sampling and explain plan collection at defaults, which produces query data that is too sparse to diagnose production slowdowns.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Teams running Datadog for application performance monitoring have a strong reason to use it for database monitoring too: one dashboard, one query language, and automatic correlation between slow application traces and the database queries those traces hit. The alternative — running a separate Prometheus stack with postgres_exporter, custom recording rules, and Grafana — is operationally heavier for teams that are not already Prometheus-native.&lt;/p&gt;
&lt;p&gt;Datadog Database Monitoring (DBM) covers PostgreSQL, MySQL, Aurora PostgreSQL, Aurora MySQL, SQL Server, and Oracle. This post focuses on PostgreSQL and MySQL/Aurora MySQL — the two most common open-source targets.&lt;/p&gt;
&lt;p&gt;The challenge is not installation. The challenge is that defaults produce incomplete data: explain plans are sampled at a low rate, wait event tracking requires explicit enabling, and the Agent needs database-side configuration (a dedicated monitoring user with the right grants) that Datadog’s quickstart guide underspecifies.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;

































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Symptom in Datadog DBM&lt;/th&gt;&lt;th&gt;Likely cause&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Query samples show “no explain plan available”&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_statements&lt;/code&gt; not in &lt;code&gt;shared_preload_libraries&lt;/code&gt;, or explain plan sampling rate is too low&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slow query visible in APM but not in DBM&lt;/td&gt;&lt;td&gt;Query duration is below DBM’s configured min duration threshold&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Wait events show only “ClientRead”&lt;/td&gt;&lt;td&gt;&lt;code&gt;track_activity_query_size&lt;/code&gt; too small; truncating queries before DBM can match them&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Aurora read replicas not appearing in DBM&lt;/td&gt;&lt;td&gt;Agent not configured to connect to the reader endpoint separately&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High DBM Agent CPU on the database host&lt;/td&gt;&lt;td&gt;Explain plan collection running too frequently; throttle via &lt;code&gt;explain_statement_min_duration&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Connection count in DBM does not match &lt;code&gt;pg_stat_activity&lt;/code&gt;&lt;/td&gt;&lt;td&gt;DBM is reading from &lt;code&gt;pg_stat_activity&lt;/code&gt; but the monitoring user lacks &lt;code&gt;pg_monitor&lt;/code&gt; role&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;1. Is the monitoring user configured with the right grants?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For PostgreSQL:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; USER&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; datadog&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; WITH&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; password&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;use-secret-manager-here&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_monitor &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; datadog;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Required for query samples and explain plans:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SCHEMA&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; datadog&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; USAGE &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SCHEMA&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; datadog &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; datadog;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; USAGE &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SCHEMA&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; public &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; datadog;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_read_all_stats &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; datadog;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Function required for DBM explain plan collection:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE OR REPLACE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FUNCTION&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; datadog&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.explain_statement(&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;   l_query &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TEXT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;   OUT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; explain &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JSON&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;RETURNS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SETOF &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JSON&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $$&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DECLARE&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;curs REFCURSOR;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;plan &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JSON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BEGIN&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;   OPEN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; curs &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FOR&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXECUTE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; pg_catalog&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;concat&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;EXPLAIN (FORMAT JSON) &apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, l_query);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;   FETCH&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; curs &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INTO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; plan;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;   CLOSE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; curs;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;   RETURN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; QUERY &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; plan;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;END&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$$&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LANGUAGE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;plpgsql&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;RETURNS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NULL&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NULL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; INPUT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SECURITY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; DEFINER;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;SECURITY DEFINER&lt;/code&gt; function is required because DBM collects explain plans for queries run by other users — the monitoring role does not have execution rights on arbitrary user queries.&lt;/p&gt;
&lt;p&gt;For MySQL/Aurora MySQL:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; USER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; &apos;&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;datadog&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;&apos;@&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; IDENTIFIED &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WITH&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; mysql_native_password &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BY&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;use-secret-manager-here&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; REPLICATION CLIENT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TO&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;datadog&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;@&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; PROCESS &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TO&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;datadog&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;@&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; performance_schema.&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TO&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;datadog&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;@&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- For explain plan collection:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; sys.&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TO&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;datadog&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;@&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;2. Is &lt;code&gt;pg_stat_statements&lt;/code&gt; enabled?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW shared_preload_libraries;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Must include &apos;pg_stat_statements&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- If missing, add to postgresql.conf and restart:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- shared_preload_libraries = &apos;pg_stat_statements&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- After restart, verify:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_extension &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; extname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pg_stat_statements&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- If absent: CREATE EXTENSION pg_stat_statements;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Tune:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SYSTEM&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; pg_stat_statements&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;max&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SYSTEM&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; pg_stat_statements&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;track&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;all&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SYSTEM&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; track_activity_query_size &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 4096&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_reload_conf();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;track_activity_query_size&lt;/code&gt; defaults to 1024 bytes in PostgreSQL 13 and earlier. Queries longer than this are truncated in &lt;code&gt;pg_stat_activity&lt;/code&gt;, which prevents DBM from matching query samples to their explain plans.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Is the Datadog Agent configured for DBM?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In &lt;code&gt;/etc/datadog-agent/conf.d/postgres.d/conf.yaml&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;yaml&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;init_config&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;instances&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;host&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;your-db-host&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    port&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;5432&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    username&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;datadog&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    password&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;ENC[your-secret]&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;   # use Datadog secret management&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    dbname&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;your_database&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;    # Enable Database Monitoring:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    dbm&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;    # Query metrics — increase statement cache:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    query_metrics&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      enabled&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;    # Query samples — how often to collect explain plans:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    query_samples&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      enabled&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      explain_statement_min_duration&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;500&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;   # ms — only collect plans for queries over 500ms&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      samples_per_second&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;                  # Reduce if CPU pressure on the Agent host&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;    # Wait events (PostgreSQL 9.6+):&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    query_activity&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      enabled&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      collection_interval&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;10&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;    # seconds&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    tags&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      - &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;env:production&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      - &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;service:your-app&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      - &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;db_engine:postgres&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For MySQL:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;yaml&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;instances&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;host&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;your-mysql-host&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    user&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;datadog&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    pass&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;ENC[your-secret]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    port&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;3306&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    dbm&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    query_metrics&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      enabled&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    query_samples&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      enabled&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      explain_statement_min_duration&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;500&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    query_activity&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      enabled&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;4. Are explain plans being collected?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In Datadog UI: &lt;strong&gt;APM → Database Monitoring → Query Samples&lt;/strong&gt;. Filter to your database host. If queries show “no explain plan,” verify:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;datadog.explain_statement&lt;/code&gt; function exists in the target database&lt;/li&gt;
&lt;li&gt;&lt;code&gt;explain_statement_min_duration&lt;/code&gt; is not set too high (default 5000ms misses most slow OLTP queries — set to 500ms)&lt;/li&gt;
&lt;li&gt;The query is not a DDL or &lt;code&gt;COPY&lt;/code&gt; statement (explain plans are not collected for these)&lt;/li&gt;
&lt;li&gt;The Agent’s &lt;code&gt;datadog&lt;/code&gt; user has &lt;code&gt;USAGE&lt;/code&gt; on the schema where the queried tables live&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;5. Are wait events visible?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In Datadog UI: &lt;strong&gt;Database Monitoring → Query Metrics&lt;/strong&gt; → click a query → &lt;strong&gt;Wait Events&lt;/strong&gt; tab. If the tab is empty:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Verify &lt;code&gt;query_activity.enabled: true&lt;/code&gt; in &lt;code&gt;conf.yaml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Verify the &lt;code&gt;datadog&lt;/code&gt; user has &lt;code&gt;pg_monitor&lt;/code&gt; role&lt;/li&gt;
&lt;li&gt;Check Agent logs: &lt;code&gt;datadog-agent check postgres&lt;/code&gt; — look for errors on the &lt;code&gt;pg_stat_activity&lt;/code&gt; collection&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Set up Datadog DBM] --&gt; B[Create monitoring user with correct grants]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; C{PostgreSQL or MySQL?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|PostgreSQL| D[Enable pg_stat_statements — add to shared_preload_libraries]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|MySQL| E[Grant SELECT on performance_schema and sys]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; F[Create datadog.explain_statement SECURITY DEFINER function]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; G[Set dbm:true in Agent conf.yaml]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt; H[Set explain_statement_min_duration to 500ms]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt; I[Enable query_activity for wait events]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt; J{Verify data appears}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|Query samples empty| K[Check pg_stat_statements.track — set to all — check track_activity_query_size]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|No explain plans| L[Verify explain_statement function — check USAGE grant on all schemas]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|No wait events| M[Verify pg_monitor grant — check query_activity.enabled in conf.yaml]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|All data visible| N[Set alert thresholds on p99 query latency and connection saturation]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;p&gt;If DBM is causing database load:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Reduce &lt;code&gt;query_samples.samples_per_second&lt;/code&gt; to &lt;code&gt;0.1&lt;/code&gt; or disable query sampling entirely: &lt;code&gt;query_samples.enabled: false&lt;/code&gt;. Query metrics (without explain plans) have minimal database impact.&lt;/li&gt;
&lt;li&gt;Increase &lt;code&gt;explain_statement_min_duration&lt;/code&gt; to 2000ms to reduce explain plan frequency.&lt;/li&gt;
&lt;li&gt;If the monitoring connection itself is causing connection count pressure, reduce Agent check frequency: &lt;code&gt;min_collection_interval: 30&lt;/code&gt; (seconds).&lt;/li&gt;
&lt;li&gt;Disable &lt;code&gt;query_activity&lt;/code&gt; collection if the &lt;code&gt;pg_stat_activity&lt;/code&gt; query is slow on instances with many databases or connections.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;datadog.explain_statement&lt;/code&gt; function runs &lt;code&gt;EXPLAIN&lt;/code&gt; on sampled queries. On very high-throughput databases, this adds measurable load. Disable plan collection and rely on query metrics only if the database is already under pressure.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Provision monitoring user via Terraform&lt;/strong&gt;: manage the &lt;code&gt;datadog&lt;/code&gt; PostgreSQL user and grants through the same Terraform module that provisions the database. Store the password in AWS Secrets Manager or Vault, not in the Agent config file directly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Agent configuration as code&lt;/strong&gt;: manage &lt;code&gt;conf.yaml&lt;/code&gt; through Ansible or a Helm chart value. The &lt;code&gt;explain_statement_min_duration&lt;/code&gt; threshold and &lt;code&gt;collection_interval&lt;/code&gt; settings should be tunable per environment without touching the Agent host directly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Alert from DBM metrics&lt;/strong&gt;: create Datadog monitors on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;postgresql.connections&lt;/code&gt; &gt; 80% of &lt;code&gt;max_connections&lt;/code&gt; — warning; 90% critical&lt;/li&gt;
&lt;li&gt;&lt;code&gt;postgresql.replication.delay&lt;/code&gt; &gt; 60s warning; 300s critical&lt;/li&gt;
&lt;li&gt;&lt;code&gt;postgresql.queries.avg_time&lt;/code&gt; P99 spike &gt; 2× baseline — warning&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mysql.replication.seconds_behind_master&lt;/code&gt; &gt; 30s warning; null = critical (broken replication)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;p&gt;Datadog Database Monitoring closes the gap between APM traces and database behavior. When an application trace is slow, DBM lets the team click through to the specific SQL, its explain plan at the time of the slowdown, and the wait events that show what the database was waiting on. Without DBM configured correctly — with the right grants, &lt;code&gt;pg_stat_statements&lt;/code&gt; enabled, &lt;code&gt;track_activity_query_size&lt;/code&gt; large enough, and explain plan sampling at a useful threshold — the team gets query metrics but not query diagnostics. The setup work is one-time; the operational benefit is continuous.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Explain plans absent for short queries&lt;/td&gt;&lt;td&gt;&lt;code&gt;explain_statement_min_duration&lt;/code&gt; set to 5000ms (default)&lt;/td&gt;&lt;td&gt;Lower to 500ms for OLTP databases&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Truncated queries in DBM&lt;/td&gt;&lt;td&gt;&lt;code&gt;track_activity_query_size&lt;/code&gt; too small&lt;/td&gt;&lt;td&gt;Set to 4096 in &lt;code&gt;postgresql.conf&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Aurora read replicas not in DBM&lt;/td&gt;&lt;td&gt;Each endpoint is a separate instance&lt;/td&gt;&lt;td&gt;Add a separate &lt;code&gt;instances:&lt;/code&gt; entry for the reader endpoint in &lt;code&gt;conf.yaml&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;SECURITY DEFINER&lt;/code&gt; function security concern&lt;/td&gt;&lt;td&gt;Function runs EXPLAIN as superuser equivalent&lt;/td&gt;&lt;td&gt;Limit the function to read-only plans only — the function only calls &lt;code&gt;EXPLAIN&lt;/code&gt;, not &lt;code&gt;EXECUTE&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;DBM adds one extra connection per Agent&lt;/td&gt;&lt;td&gt;On databases near &lt;code&gt;max_connections&lt;/code&gt;, Agent connection pushes over the limit&lt;/td&gt;&lt;td&gt;Reserve connections for monitoring: set &lt;code&gt;max_connections&lt;/code&gt; 10 higher than application pool max&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;pg_stat_statements&lt;/code&gt; reset on restart&lt;/td&gt;&lt;td&gt;Cumulative counters reset; DBM shows spike&lt;/td&gt;&lt;td&gt;Set &lt;code&gt;pg_stat_statements.save = on&lt;/code&gt;; use rate metrics in Datadog, not raw counters&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Your database is visible in Datadog as infrastructure metrics but slow queries are not linked to their explain plans or wait events.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Enable DBM with the monitoring user grants above, set &lt;code&gt;explain_statement_min_duration&lt;/code&gt; to 500ms, and verify &lt;code&gt;pg_stat_statements&lt;/code&gt; is loaded.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; After setup, trigger a known slow query and verify it appears in Query Samples with an explain plan attached within 60 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; This week, create the &lt;code&gt;datadog&lt;/code&gt; monitoring user, add the &lt;code&gt;SECURITY DEFINER&lt;/code&gt; explain function, and set &lt;code&gt;dbm: true&lt;/code&gt; in the Agent config. Restart the Agent and verify query samples appear in the Datadog UI within 5 minutes.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category></item><item><title>Cassandra Observability: Compaction, Tombstones, Repair, Latency, and Hot Partitions</title><link>https://rajivonai.com/blog/2024-09-17-cassandra-observability-compaction-tombstones/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-09-17-cassandra-observability-compaction-tombstones/</guid><description>Why generic server monitoring fails for Apache Cassandra, and how to track the true operational signals of a distributed masterless database.</description><pubDate>Tue, 17 Sep 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;If you try to monitor a distributed, masterless database like Cassandra using the same dashboard you use for a monolithic relational database, you will misdiagnose every single incident.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Apache Cassandra operates on fundamentally different assumptions than relational systems like PostgreSQL or MySQL. It is an AP system in the CAP theorem context: highly available, partition tolerant, and eventually consistent. Data is distributed across a ring of nodes, writes are appended to memory and disk sequentially, and deletes are executed by inserting a marker called a “tombstone.”&lt;/p&gt;
&lt;p&gt;When teams adopt Cassandra, they often plug it into their existing monitoring stack. They set alerts on CPU utilization, disk space, and memory consumption. But in Cassandra, a node running at 80% CPU might be perfectly healthy and churning through background compaction, while a node at 20% CPU might be silently dropping mutations because it is overwhelmed by tombstones during read repair. Generic infrastructure metrics are insufficient; you must observe Cassandra’s internal state machine.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;
&lt;p&gt;A Cassandra cluster experiencing distress exhibits unique failure modes that rarely trigger standard host-level alarms until it is too late:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The Tombstone Overwhelm:&lt;/strong&gt; Read latency spikes for a specific table. CPU is low, but the application is timing out. The node is scanning and discarding thousands of deleted records (tombstones) to return a single live row.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Compaction Debt:&lt;/strong&gt; Disk usage begins climbing relentlessly. The node is writing data faster than the background compaction threads can merge the SSTables, leading to read latency degradation as queries must scan dozens of fragmented files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Partition Hotspot:&lt;/strong&gt; One node in a 10-node cluster is pegged at 100% CPU while the other nine sit at 15%. A single customer or entity is receiving a disproportionate share of traffic, overwhelming the node responsible for that token range.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Repair Drift:&lt;/strong&gt; Nodes return inconsistent data depending on the consistency level (&lt;code&gt;LOCAL_QUORUM&lt;/code&gt; vs &lt;code&gt;ONE&lt;/code&gt;). Anti-entropy repair processes have fallen behind or failed, leading to stale reads.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;p&gt;When a Cassandra pager alert fires—especially for p99 latency spikes—these are the five internal metrics you must check:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check Pending Tasks (&lt;code&gt;nodetool tpstats&lt;/code&gt;):&lt;/strong&gt;
This shows the thread pool statistics. The critical metrics are &lt;code&gt;Pending&lt;/code&gt; and &lt;code&gt;Dropped&lt;/code&gt; messages. If &lt;code&gt;MutationStage&lt;/code&gt; or &lt;code&gt;ReadStage&lt;/code&gt; have high pending counts, the node is saturated. If there are dropped mutations, data is not being written.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Evaluate Compaction Backlog (&lt;code&gt;nodetool compactionstats&lt;/code&gt;):&lt;/strong&gt;
Look at &lt;code&gt;pending tasks&lt;/code&gt;. A small number is normal. A number in the hundreds or thousands indicates compaction has fallen permanently behind the write rate.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Analyze Tombstone Ratios (Log inspection or JMX metrics):&lt;/strong&gt;
Check the &lt;code&gt;system.log&lt;/code&gt; for warnings about &lt;code&gt;Scanned over X tombstones&lt;/code&gt;. If this number exceeds the &lt;code&gt;tombstone_warn_threshold&lt;/code&gt;, read queries are doing massive amounts of wasted work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify Client Request Latency via JMX/Metrics:&lt;/strong&gt;
Look at &lt;code&gt;ClientRequest.Latency.Read&lt;/code&gt; and &lt;code&gt;ClientRequest.Latency.Write&lt;/code&gt; at the 99th percentile (p99). Cassandra is highly optimized for writes; if write latency spikes, disk I/O is usually the bottleneck.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Examine Partition Sizes (&lt;code&gt;nodetool tablestats&lt;/code&gt;):&lt;/strong&gt;
Look for the &lt;code&gt;Compacted partition maximum bytes&lt;/code&gt;. If a single partition exceeds 100MB, you have a data modeling problem causing a hotspot, not an infrastructure problem.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;p&gt;When diagnosing a Cassandra latency spike, use the following operational flow:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[p99 Latency Spike Detected] --&gt; B{Is it Read or Write Latency?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Write| C[Check Pending Tasks]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; C1{Are Mutations Dropping?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C1 --&gt;|Yes| C2[Node is Overwhelmed: Add Capacity or Shed Load]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C1 --&gt;|No| C3[Check Disk I/O Wait]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C3 --&gt;|High| C4[Storage Bottleneck: Upgrade Disks]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Read| D[Check Pending Tasks]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; D1{Are ReadStages Pending?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D1 --&gt;|No| D2[Check Tombstone Warnings in Logs]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D2 --&gt;|High| D3[Tombstone Overwhelm: Change Data Model or Lower GC Grace]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D2 --&gt;|Low| D4[Check Compaction Backlog]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D4 --&gt;|High| D5[Fragmented Reads: Tune Compaction Throughput]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tune Compaction Throughput (Medium Speed, Low Risk):&lt;/strong&gt;
If compaction is falling behind, you can dynamically increase &lt;code&gt;compaction_throughput_mb_per_sec&lt;/code&gt; using &lt;code&gt;nodetool setcompactionthroughput&lt;/code&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; Compaction is highly I/O intensive. Increasing throughput might clear the backlog but can temporarily degrade read and write latencies.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add Nodes to the Ring (Slow, Permanent Fix):&lt;/strong&gt;
If the entire cluster is legitimately saturated (high CPU, high pending tasks, dropping mutations across the ring), you must bootstrap new nodes.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; Bootstrapping involves streaming data across the network, which adds load to the existing struggling nodes. Do not wait until the cluster is at 95% capacity to scale.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lower &lt;code&gt;gc_grace_seconds&lt;/code&gt; (Fast, High Risk):&lt;/strong&gt;
If tombstones are crushing read performance on a specific table, and you do not require a long window for resurrecting dead data via repair, you can lower &lt;code&gt;gc_grace_seconds&lt;/code&gt; via &lt;code&gt;ALTER TABLE&lt;/code&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; If a node goes down for longer than the new &lt;code&gt;gc_grace_seconds&lt;/code&gt; and misses a delete, that deleted data will “resurrect” when the node comes back online.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;p&gt;If you tune compaction throughput too aggressively and disk I/O saturates causing widespread query timeouts, revert &lt;code&gt;compaction_throughput_mb_per_sec&lt;/code&gt; to its previous conservative value (e.g., 16 MB/s) using &lt;code&gt;nodetool setcompactionthroughput 16&lt;/code&gt;. Note: setting the value to &lt;code&gt;0&lt;/code&gt; removes the limit entirely — it does not pause compaction. If background compaction is actively destroying cluster stability, use &lt;code&gt;nodetool stop COMPACTION&lt;/code&gt; to halt the specific running tasks until I/O pressure subsides.&lt;/p&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Deploy an automated script that polls JMX metrics for &lt;code&gt;Dropped Mutations&lt;/code&gt; across all nodes. If a node begins dropping mutations for more than 5 minutes, automatically route application traffic away from that specific node’s local datacenter (if running multi-DC) or trigger a high-severity incident, because dropped mutations mean permanent data loss if not recovered via hinted handoff or repair.&lt;/p&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Acknowledge the Cassandra Tax:&lt;/strong&gt; Cassandra requires ongoing background maintenance (compaction and repair). You must provision your clusters so that they run at no more than 50-60% capacity during normal operations to leave headroom for this maintenance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Modeling is Operations:&lt;/strong&gt; 90% of Cassandra performance issues are caused by bad data models (large partitions or heavy deletes), not bad hardware.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor the 99th Percentile:&lt;/strong&gt; Cassandra is known for stable average latencies but terrifying tail latencies during JVM garbage collection or heavy compaction. Always alert on p99, never on the average.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Cassandra’s most destructive failure modes — tombstone read amplification, compaction debt, hot partitions — don’t register on CPU or memory dashboards until the cluster is already in distress, because a node scanning 50,000 tombstones to return one row can run at 20% CPU while its read latency is at 10 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Ingest &lt;code&gt;nodetool tpstats&lt;/code&gt; (pending and dropped task counts), &lt;code&gt;nodetool compactionstats&lt;/code&gt; (pending compaction tasks), and tombstone scan warnings from &lt;code&gt;system.log&lt;/code&gt; as time-series metrics alongside host metrics — these are the only signals that surface Cassandra-specific distress before it becomes visible to users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Artificially generate thousands of deletes on a test table in staging and verify that read latency alerts fire before the problem appears on CPU charts — if CPU is the first signal, the monitoring doesn’t give enough lead time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Configure JMX metrics ingestion (Datadog JMX integration or Prometheus JMX exporter) this week and add a panel tracking &lt;code&gt;ClientRequest.Latency.Read&lt;/code&gt; p99 and &lt;code&gt;Pending CompactionExecutor&lt;/code&gt; tasks — these two metrics together explain most Cassandra incidents.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category><category>failures</category></item><item><title>Cloud Architecture Review Checklist for Database-Backed Applications</title><link>https://rajivonai.com/blog/2024-09-12-cloud-architecture-review-checklist-for-database-backed-applications/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-09-12-cloud-architecture-review-checklist-for-database-backed-applications/</guid><description>Review checklist for database-backed cloud applications: connection saturation, migration locking, retry amplification, and region dependency failures.</description><pubDate>Thu, 12 Sep 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Most cloud architecture reviews fail because they inspect topology before they inspect failure. The database is drawn as a box, the application tier as another box, and the review turns into a discussion about instance sizes, replicas, and network paths. The harder question is operational: when latency rises, connections saturate, retries multiply, migrations lock hot tables, or a region loses dependency access, what prevents the application from turning a database symptom into a customer-facing outage?&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Database-backed applications have changed shape. A typical service is no longer a single application talking to one database over a private network. It may run across containers, serverless jobs, queues, caches, search indexes, object storage, feature flag systems, identity providers, and third-party APIs. The database remains the system of record, but the user path increasingly depends on many control planes and data planes staying within their expected latency budgets.&lt;/p&gt;
&lt;p&gt;Cloud platforms make the first version easy to deploy. Managed databases remove backup scripts, failover automation, patch windows, and much of the storage plumbing. That convenience is real. It also changes the review burden. Engineers now need to verify the contracts around the managed service: connection limits, failover behavior, replication lag, backup restore time, parameter changes, maintenance windows, identity policies, encryption boundaries, and observability.&lt;/p&gt;
&lt;p&gt;The architecture review should therefore be less about whether a diagram looks cloud native and more about whether the system degrades deliberately.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The common review checklist is too static. It asks whether the database is replicated, whether backups exist, whether TLS is enabled, whether the application has autoscaling, and whether monitoring is configured. Those are necessary checks, but they do not expose the most expensive failures.&lt;/p&gt;
&lt;p&gt;The expensive failures happen in the interactions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Autoscaling adds application instances faster than the database can accept new connections.&lt;/li&gt;
&lt;li&gt;Retry policies amplify a short database stall into sustained overload.&lt;/li&gt;
&lt;li&gt;Read replicas hide primary pressure until replication lag invalidates user workflows.&lt;/li&gt;
&lt;li&gt;A migration that passed staging blocks production writes because production cardinality is different.&lt;/li&gt;
&lt;li&gt;A cache masks database latency until eviction, deployment, or regional failover makes all callers miss at once.&lt;/li&gt;
&lt;li&gt;A backup policy exists, but the restore path has never been timed against the recovery objective.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The review question is not, “Do we have the right components?” It is: &lt;strong&gt;can this application keep its database failure modes bounded, observable, and reversible under production load?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;A useful architecture review for a database-backed cloud application follows the request path, the write path, and the recovery path. Each path should expose limits, contracts, and rollback points.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[client request — external traffic] --&gt; B[edge controls — auth and rate limits]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; C[application tier — bounded concurrency]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; D[connection pool — fixed database pressure]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; E[primary database — writes and transactions]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; F[cache layer — explicit freshness contract]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; G[read replica — bounded stale reads]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; H[change stream — async propagation]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt; I[workers — idempotent side effects]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; J[backup system — restore tested]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; K[metrics and traces — saturation visible]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt; L[runbook — rollback and failover]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The checklist should start with traffic admission. Every service needs a clear maximum for concurrent database work. Autoscaling policies should not be allowed to create unbounded database pressure. Connection pools should be sized from database capacity, not from the number of application instances. If the application uses serverless compute, the review must account for burst concurrency and cold starts creating connection storms.&lt;/p&gt;
&lt;p&gt;Next, inspect transaction design. Long transactions, interactive transactions, and transactions that call remote services are architecture smells. The database should protect invariants, but application code should avoid holding locks while waiting on external systems. For high-contention workflows, the review should ask how conflicts are detected, retried, surfaced, and measured.&lt;/p&gt;
&lt;p&gt;Then inspect read behavior. Read replicas are not a generic scaling button. They introduce a consistency contract. If a user writes data and immediately reads from a replica, the product may observe stale state unless the application routes read-after-write flows to the primary, uses session consistency, or makes staleness acceptable in the interface.&lt;/p&gt;
&lt;p&gt;Caching deserves a separate pass. The review should document what each cache entry means, how it expires, what invalidates it, and what happens when the cache is empty. A cache that protects a database in steady state can become an outage accelerator during mass eviction. Warmup, request coalescing, negative caching, and backpressure belong in the design, not in the incident retrospective.&lt;/p&gt;
&lt;p&gt;Finally, review recovery. Backups are not a recovery strategy until restores are exercised. The architecture needs defined recovery point objective, recovery time objective, restore ownership, data validation steps, and a tested path for reconnecting applications to the restored database.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;h3 id=&quot;context&quot;&gt;Context&lt;/h3&gt;
&lt;p&gt;The documented pattern across cloud reliability literature is that overload often propagates through retries and shared dependencies. The &lt;a href=&quot;https://sre.google/sre-book/handling-overload/&quot;&gt;Google SRE book chapter on handling overload&lt;/a&gt; describes overload as a system-level condition requiring load shedding, graceful degradation, and capacity-aware admission control. The database-backed application version of this pattern is direct: if every caller retries failed database work without a budget, the database receives more work precisely when it has the least capacity to serve it.&lt;/p&gt;
&lt;h3 id=&quot;action&quot;&gt;Action&lt;/h3&gt;
&lt;p&gt;The review action is to require retry budgets, deadlines, and idempotency. Amazon’s Builders’ Library article on &lt;a href=&quot;https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/&quot;&gt;timeouts, retries, and backoff with jitter&lt;/a&gt; documents the operational pattern: timeouts must be chosen from downstream latency behavior, retries should be limited, and jitter helps avoid synchronized retry waves. In a database-backed system, that means every database call should sit inside a request deadline, every retry should have a bounded count, and every retried write should be safe through an idempotency key, natural constraint, or transactionally recorded operation identifier.&lt;/p&gt;
&lt;h3 id=&quot;result&quot;&gt;Result&lt;/h3&gt;
&lt;p&gt;The result is not “no failures.” The result is bounded failure. PostgreSQL, for example, documents transaction isolation levels and serialization failures as normal concurrency outcomes rather than exceptional mysteries. Under &lt;code&gt;SERIALIZABLE&lt;/code&gt;, applications must be prepared to retry transactions that fail due to serialization anomalies. Under weaker isolation, applications must understand which anomalies they have accepted. The architectural learning is that correctness is partly a database feature and partly an application contract.&lt;/p&gt;
&lt;h3 id=&quot;learning&quot;&gt;Learning&lt;/h3&gt;
&lt;p&gt;The documented pattern is that database reliability depends on explicit contracts at the edges: admission control before the database, transaction boundaries inside the database, consistency rules around replicas, and recovery tests outside the live path. A review that cannot name those contracts has not reviewed the architecture. It has reviewed the drawing.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



























































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Review Area&lt;/th&gt;&lt;th&gt;Failure Mode&lt;/th&gt;&lt;th&gt;Better Question&lt;/th&gt;&lt;th&gt;Common Mitigation&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Autoscaling&lt;/td&gt;&lt;td&gt;Application fleet outgrows database connection capacity&lt;/td&gt;&lt;td&gt;What caps concurrent database work?&lt;/td&gt;&lt;td&gt;Pool limits, proxy, admission control&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Retries&lt;/td&gt;&lt;td&gt;Short stall becomes sustained overload&lt;/td&gt;&lt;td&gt;What is the retry budget per request?&lt;/td&gt;&lt;td&gt;Deadlines, backoff, jitter, idempotency&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replicas&lt;/td&gt;&lt;td&gt;Stale reads break user workflows&lt;/td&gt;&lt;td&gt;Which reads require fresh data?&lt;/td&gt;&lt;td&gt;Primary routing, session reads, explicit staleness&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Migrations&lt;/td&gt;&lt;td&gt;Schema change blocks hot production paths&lt;/td&gt;&lt;td&gt;How is lock impact tested?&lt;/td&gt;&lt;td&gt;Online migrations, batching, rollback plan&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Caching&lt;/td&gt;&lt;td&gt;Cache miss storm overloads primary&lt;/td&gt;&lt;td&gt;What happens on cold cache?&lt;/td&gt;&lt;td&gt;Request coalescing, warmup, backpressure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Backups&lt;/td&gt;&lt;td&gt;Backup exists but restore misses objective&lt;/td&gt;&lt;td&gt;When was restore last timed?&lt;/td&gt;&lt;td&gt;Restore drills, validation scripts, runbooks&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Observability&lt;/td&gt;&lt;td&gt;Metrics show symptoms but not saturation&lt;/td&gt;&lt;td&gt;Can we see queueing before errors?&lt;/td&gt;&lt;td&gt;Pool metrics, wait time, lock time, replica lag&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Failover&lt;/td&gt;&lt;td&gt;Promotion succeeds but app does not recover&lt;/td&gt;&lt;td&gt;Who changes writers and verifies data?&lt;/td&gt;&lt;td&gt;Automated failover tests, DNS and connection review&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The tradeoff is that these checks add friction before launch. They force teams to define limits earlier than they would prefer. That friction is useful. A database-backed application without declared limits still has limits; it discovers them during incidents.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt; — Start the review from failure modes, not component inventory. Ask how the application behaves when the database is slow, unavailable, stale, locked, overloaded, or restored from backup.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt; — Require explicit contracts for concurrency, retries, transactions, replicas, caches, migrations, observability, and recovery. Put those contracts in the design review and the runbook.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Proof&lt;/strong&gt; — Verify the contracts with load tests, migration rehearsals, restore drills, replica lag tests, cache cold-start tests, and dashboards that show saturation before user-visible errors.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt; — Before approving the architecture, make the team answer one operational question in writing: what exact mechanism prevents this application from making a struggling database worse?&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>cloud</category><category>databases</category><category>failures</category><category>checklist</category></item><item><title>Prometheus and Grafana for Database Monitoring: PostgreSQL and MySQL Setup</title><link>https://rajivonai.com/blog/2024-09-09-prometheus-grafana-database-monitoring-setup-postgres-mysql/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-09-09-prometheus-grafana-database-monitoring-setup-postgres-mysql/</guid><description>How to instrument PostgreSQL and MySQL with postgres_exporter and mysqld_exporter, configure Prometheus scrape jobs, and build Grafana panels that surface the metrics that matter — with working PromQL queries.</description><pubDate>Mon, 09 Sep 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Prometheus and Grafana are the right default for database monitoring when the team already runs them for infrastructure. The mistake is treating database exporters as install-and-forget: they require scope decisions, scrape tuning, recording rules for expensive queries, and panels aligned to operational questions rather than metric availability.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Prometheus with postgres_exporter or mysqld_exporter gives a team database metrics in the same system they use for Kubernetes, application, and infrastructure metrics. That consistency matters during incidents: one tool, one query language, one dashboard system.&lt;/p&gt;
&lt;p&gt;The challenge is setup quality. Both exporters expose hundreds of metrics by default. Without scope decisions and recording rules, the result is a Prometheus instance ingesting metrics that nobody queries, Grafana dashboards that show every metric but answer no operational question, and a scrape interval too infrequent to catch short-duration failures.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Symptom&lt;/th&gt;&lt;th&gt;Likely cause&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Grafana database dashboard shows data but engineer can’t tell if system is healthy&lt;/td&gt;&lt;td&gt;Dashboard shows metrics, not answers — no thresholds, no anomaly detection&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Prometheus scrape latency is high&lt;/td&gt;&lt;td&gt;Exporter is running expensive queries during scrape; needs collector filtering&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Database monitoring is absent during Prometheus downtime&lt;/td&gt;&lt;td&gt;No remote write or long-term storage — single point of failure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Alert fires but metric data is missing&lt;/td&gt;&lt;td&gt;Scrape interval too long for the alert evaluation window&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Exporter crashes after database restart&lt;/td&gt;&lt;td&gt;Exporter not configured to retry connections&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;1. Is postgres_exporter running with appropriate collector scope?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;postgres_exporter&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --collector.stat_activity_autovacuum&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --collector.stat_statements&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --collector.stat_bgwriter&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --collector.stat_replication&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --collector.replication_slot&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --no-collector.wal&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --no-collector.database_wraparound&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --web.listen-address=:9187&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Disable expensive collectors you do not need. &lt;code&gt;database_wraparound&lt;/code&gt; queries &lt;code&gt;age(datfrozenxid)&lt;/code&gt; on every database and can be slow on instances with many databases. Enable only the collectors you have dashboard panels for.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Is the scrape interval appropriate?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For OLTP databases, scrape every 30 seconds. For analytics-heavy workloads with slow collector queries, 60 seconds is acceptable. Shorter than 30 seconds risks accumulating scrape delays during high-load periods.&lt;/p&gt;
&lt;p&gt;In &lt;code&gt;prometheus.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;yaml&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;scrape_configs&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;job_name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;postgres&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    scrape_interval&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;30s&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    scrape_timeout&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;20s&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    static_configs&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;targets&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: [&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;postgres-exporter:9187&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        labels&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          env&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;production&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          db_engine&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;postgres&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          cluster&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;primary&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;3. Are recording rules defined for expensive derived metrics?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PromQL queries that compute ratios from raw counters on every dashboard load are expensive at query time. Move them into recording rules evaluated once per scrape.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;yaml&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# prometheus/rules/database.yaml&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;groups&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;database_derived&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    interval&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;60s&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    rules&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;record&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;postgres:cache_hit_ratio&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        expr&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;|&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;          rate(pg_statio_user_tables_heap_blks_hit[5m]) /&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;          (rate(pg_statio_user_tables_heap_blks_hit[5m]) +&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;           rate(pg_statio_user_tables_heap_blks_read[5m]))&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;record&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;postgres:connections_pct&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        expr&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;|&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;          pg_stat_activity_count{state!=&quot;idle&quot;} /&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;          pg_settings_max_connections * 100&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;record&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;postgres:replication_lag_seconds&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        expr&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;|&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;          pg_replication_lag&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;4. Are alert rules configured with meaningful labels?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;yaml&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;groups&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;postgres_alerts&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    rules&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;alert&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;PostgresReplicaLagHigh&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        expr&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;pg_replication_lag &gt; 60&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        for&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;2m&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        labels&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          severity&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;warning&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          team&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;database&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        annotations&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          summary&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;PostgreSQL replica lag above 60s on {{ $labels.instance }}&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          runbook_url&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;https://wiki.example.com/runbooks/postgres-replica-lag&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;      - &lt;/span&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;alert&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;PostgresConnectionsNearLimit&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        expr&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;postgres:connections_pct &gt; 85&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        for&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;5m&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        labels&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          severity&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;critical&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          team&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;database&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;        annotations&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;          summary&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;PostgreSQL connections at {{ $value | humanize }}% on {{ $labels.instance }}&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;5. Is mysqld_exporter configured with the right user grants?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; USER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; &apos;&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;prometheus&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;&apos;@&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; IDENTIFIED &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BY&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;use-secret-manager-here&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; PROCESS, REPLICATION CLIENT, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TO&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;prometheus&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;@&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- For performance_schema access:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GRANT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; performance_schema.&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TO&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;prometheus&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;@&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;FLUSH PRIVILEGES;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The exporter connects as this user. Grant only what the collectors actually need — not &lt;code&gt;SUPER&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Set up database monitoring with Prometheus] --&gt; B[Install exporter]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; C{Scope collectors}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|High-traffic OLTP| D[Enable: stat_activity, stat_statements, stat_bgwriter, stat_replication, locks]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|Analytics replica| E[Enable: stat_statements, replication_slot, database_size]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; F[Set scrape interval 30s]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; F&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; G[Define recording rules for ratios]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt; H[Build Grafana panels by operational question]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt; I{Alert rules}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt;|Define warning + critical| J[Set runbook URL on every alert]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt; K[Test alert with simulated failure in staging]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;core-grafana-panel-design&quot;&gt;Core Grafana Panel Design&lt;/h2&gt;
&lt;p&gt;Build panels that answer operational questions, not panels that display metrics.&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Question&lt;/th&gt;&lt;th&gt;Panel type&lt;/th&gt;&lt;th&gt;PromQL&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Is replica lag within SLO?&lt;/td&gt;&lt;td&gt;Gauge + threshold&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_replication_lag{instance=&quot;$instance&quot;}&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;How close are we to connection limit?&lt;/td&gt;&lt;td&gt;Gauge + threshold&lt;/td&gt;&lt;td&gt;&lt;code&gt;postgres:connections_pct{instance=&quot;$instance&quot;}&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Which queries are slowest right now?&lt;/td&gt;&lt;td&gt;Table&lt;/td&gt;&lt;td&gt;&lt;code&gt;topk(10, rate(pg_stat_statements_total_time[5m]))&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Is cache hit ratio healthy?&lt;/td&gt;&lt;td&gt;Time series&lt;/td&gt;&lt;td&gt;&lt;code&gt;postgres:cache_hit_ratio{instance=&quot;$instance&quot;}&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Which tables have the most dead tuples?&lt;/td&gt;&lt;td&gt;Bar chart&lt;/td&gt;&lt;td&gt;&lt;code&gt;topk(10, pg_stat_user_tables_n_dead_tup)&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Is checkpoint behavior normal?&lt;/td&gt;&lt;td&gt;Time series&lt;/td&gt;&lt;td&gt;&lt;code&gt;rate(pg_stat_bgwriter_checkpoints_req[5m])&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;For MySQL:&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Question&lt;/th&gt;&lt;th&gt;PromQL&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Replication lag&lt;/td&gt;&lt;td&gt;&lt;code&gt;mysql_slave_status_seconds_behind_master&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Threads running&lt;/td&gt;&lt;td&gt;&lt;code&gt;mysql_global_status_threads_running&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;InnoDB buffer pool wait&lt;/td&gt;&lt;td&gt;&lt;code&gt;rate(mysql_global_status_innodb_buffer_pool_wait_free[5m])&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slow queries per second&lt;/td&gt;&lt;td&gt;&lt;code&gt;rate(mysql_global_status_slow_queries[5m])&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Open tables vs cache&lt;/td&gt;&lt;td&gt;&lt;code&gt;mysql_global_status_open_tables / mysql_global_variables_table_open_cache&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;p&gt;If the exporter is causing database load:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Disable the problematic collector immediately: restart the exporter with &lt;code&gt;--no-collector.&amp;#x3C;name&gt;&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;pg_stat_activity&lt;/code&gt; for exporter sessions with long durations.&lt;/li&gt;
&lt;li&gt;Increase &lt;code&gt;scrape_timeout&lt;/code&gt; to avoid Prometheus treating slow scrapes as failed.&lt;/li&gt;
&lt;li&gt;If the database is degraded, disable the exporter entirely and fall back to CloudWatch or basic OS metrics until the database is stable.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dashboards as code&lt;/strong&gt;: store Grafana dashboard JSON in Git and use &lt;code&gt;grafana-dashboard-exporter&lt;/code&gt; or Terraform to provision dashboards. This prevents dashboard drift between environments.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Exporter configuration templates&lt;/strong&gt;: manage &lt;code&gt;postgres_exporter&lt;/code&gt; configuration through a Helm chart or Ansible role with environment-specific variables. The monitoring role credentials and scrape endpoints should be provisioned through the same credential management pipeline as application secrets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Alert rule testing&lt;/strong&gt;: use &lt;code&gt;promtool test rules&lt;/code&gt; to write unit tests for alert rules. Test that alerts fire correctly given synthetic metric data — before deploying the rules to production.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;promtool&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; test&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; rules&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; tests/database_alerts_test.yaml&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;p&gt;Prometheus and Grafana database monitoring is operationally complete only when it has four properties: appropriate collector scope (not every metric, only the ones with panels and alerts), recording rules for derived metrics (not computed on every dashboard load), alert rules with runbook links (not raw metric thresholds with no context), and tested alert coverage (simulated failures verified the alerts fire). An exporter that is installed but not tuned produces more cardinality than signal and slows down Prometheus at query time.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Exporter queries slow the database&lt;/td&gt;&lt;td&gt;Default collectors include expensive queries (e.g., bloat estimation)&lt;/td&gt;&lt;td&gt;Disable unused collectors; enable only what has dashboard panels&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Alert fires too often&lt;/td&gt;&lt;td&gt;Scrape every 15s, alert window is 1m — transient spikes trigger alert&lt;/td&gt;&lt;td&gt;Increase &lt;code&gt;for&lt;/code&gt; duration to 2–5 minutes for metric volatility&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Dashboard has 40 panels, no one knows what to look at&lt;/td&gt;&lt;td&gt;Metrics-first design instead of question-first&lt;/td&gt;&lt;td&gt;Redesign from operational questions, not metric availability&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Exporter loses database connection silently&lt;/td&gt;&lt;td&gt;PostgreSQL restart drops exporter connection; exporter does not reconnect&lt;/td&gt;&lt;td&gt;Set &lt;code&gt;--web.config.file&lt;/code&gt; reconnect policy; use Kubernetes liveness probe&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Alert runbook link is dead&lt;/td&gt;&lt;td&gt;Wiki reorganized, link not updated&lt;/td&gt;&lt;td&gt;Store runbook URL as a configmap value; validate links in CI&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Database monitoring uses Prometheus but panels show raw metrics, not operational health.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Add recording rules for derived metrics, build question-first panels, and add alert rules with runbook URLs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Walk through an incident simulation: kill one replica, verify the lag alert fires within 2 minutes, confirm the runbook link points to the correct procedure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; This week, define three recording rules (connection utilization, replica lag, cache hit ratio), create an alert for each at the critical threshold, and add a Grafana time series panel for each.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category></item><item><title>Why pgcrypto Is Not a Full Key Management Strategy</title><link>https://rajivonai.com/blog/2024-08-26-why-pgcrypto-is-not-a-full-key-management-strategy/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-08-26-why-pgcrypto-is-not-a-full-key-management-strategy/</guid><description>PostgreSQL&apos;s pgcrypto is a cryptographic function library, not a key management system. Treating it as one guarantees your encryption keys will eventually leak.</description><pubDate>Mon, 26 Aug 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;PostgreSQL’s &lt;code&gt;pgcrypto&lt;/code&gt; is a cryptographic function library, not a key management system. Treating it as one guarantees that your encryption keys will eventually leak into your observability pipelines, rendering your entire encryption strategy mathematically irrelevant.&lt;/strong&gt; If your architecture relies on passing plaintext keys across a database connection, you do not have a key management strategy; you have a compliance illusion.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;When platform teams are tasked with implementing column-level encryption for PII, the path of least resistance is often PostgreSQL’s native &lt;code&gt;pgcrypto&lt;/code&gt; extension. It is built-in, easy to use, and requires no external infrastructure.&lt;/p&gt;




















&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;/th&gt;&lt;th&gt;Default approach&lt;/th&gt;&lt;th&gt;Better alternative&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Operating model&lt;/td&gt;&lt;td&gt;Use &lt;code&gt;pgcrypto&lt;/code&gt; to encrypt data within the database engine using keys passed in SQL&lt;/td&gt;&lt;td&gt;Use an external Key Management Service (KMS) to encrypt data in the application memory space&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Failure mode&lt;/td&gt;&lt;td&gt;Keys are exposed in plaintext to the database process and observability tools&lt;/td&gt;&lt;td&gt;Keys are isolated in a dedicated IAM-governed control plane&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The fundamental flaw in using &lt;code&gt;pgcrypto&lt;/code&gt; for symmetric encryption (&lt;code&gt;pgp_sym_encrypt&lt;/code&gt;) is that the database engine itself must process the plaintext encryption key to execute the function.&lt;/p&gt;
&lt;p&gt;This creates a massive, multi-vectored exposure risk. &lt;code&gt;pgcrypto&lt;/code&gt; has no native integration with enterprise key management concepts like IAM, automated key rotation, or cryptographic audit trails. Worse, by passing the key in the SQL string, the key is instantly exposed to the database’s internal state.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Query Telemetry&lt;/td&gt;&lt;td&gt;Plaintext keys are logged in &lt;code&gt;pg_stat_activity&lt;/code&gt; and &lt;code&gt;pg_stat_statements&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Any engineer or tool with read access to system views can steal the keys&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slow Query Logs&lt;/td&gt;&lt;td&gt;Long-running queries containing the key are written to disk&lt;/td&gt;&lt;td&gt;Keys leak into external log aggregators like Datadog, Splunk, or CloudWatch&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replication Streams&lt;/td&gt;&lt;td&gt;Logical replication streams may broadcast the raw SQL&lt;/td&gt;&lt;td&gt;Downstream consumer databases and data warehouses inadvertently receive the keys&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The core architectural question is this: How do we perform column-level encryption without ever exposing the plaintext encryption key to the database’s execution engine or its telemetry pipelines?&lt;/p&gt;
&lt;h2 id=&quot;the-implementation&quot;&gt;The Implementation&lt;/h2&gt;
&lt;p&gt;The solution is to deprecate the use of &lt;code&gt;pgcrypto&lt;/code&gt; for sensitive, high-value data entirely, replacing it with an external Key Management Service (KMS) architecture.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[&quot;Application Service&quot;] --&gt;|1. Fetch Key| B[&quot;Cloud KMS&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|2. Return Key| A&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt;|3. Encrypt in Memory| A&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt;|4. Execute INSERT| C[&quot;PostgreSQL Database&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|5. Telemetry| D[&quot;pg_stat_statements&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Move encryption to the application compute layer.&lt;/strong&gt;&lt;br&gt;
The application fetches the encryption key from a secure vault (e.g., AWS KMS, HashiCorp Vault).&lt;br&gt;
Confirm: The key exists only in the volatile memory of the application process.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Encrypt the payload before constructing the SQL statement.&lt;/strong&gt;&lt;br&gt;
The application performs the encryption locally.&lt;br&gt;
Confirm: The SQL statement constructed by the ORM or query builder contains only the ciphertext.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Execute the query against PostgreSQL.&lt;/strong&gt;&lt;br&gt;
The database receives an &lt;code&gt;INSERT&lt;/code&gt; or &lt;code&gt;UPDATE&lt;/code&gt; containing pure ciphertext.&lt;br&gt;
Confirm: When this query is logged in &lt;code&gt;pg_stat_activity&lt;/code&gt; or shipped to Datadog via a slow query log, no plaintext keys are present in the SQL string.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern for maturing database security is to aggressively ban the use of inline key passing in SQL across the organization.&lt;/p&gt;
&lt;p&gt;Context: Consider a platform team troubleshooting performance issues. They enable &lt;code&gt;pg_stat_statements&lt;/code&gt; to track query execution times.&lt;/p&gt;
&lt;p&gt;Action: Because &lt;code&gt;pg_stat_statements&lt;/code&gt; normalizes queries but retains literal values depending on configuration (or because a specific slow query log captures the raw string), queries like &lt;code&gt;SELECT pgp_sym_encrypt(&apos;user_ssn&apos;, &apos;super_secret_key&apos;);&lt;/code&gt; are captured.&lt;/p&gt;
&lt;p&gt;Result: The encryption key (&lt;code&gt;super_secret_key&lt;/code&gt;) is now permanently stored in the telemetry database. If these logs are shipped to a centralized logging vendor, the key has now left your infrastructure perimeter. The encryption is entirely compromised.&lt;/p&gt;
&lt;p&gt;Learning: Cryptographic keys must never traverse the same network boundary or reside in the same system views as the data they are protecting. The database cannot be trusted to keep a secret that it must also use to parse a query.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Infrastructure Complexity&lt;/td&gt;&lt;td&gt;Developers need to encrypt data locally during testing&lt;/td&gt;&lt;td&gt;Provide local KMS emulators (e.g., AWS KMS Local) or deterministic dev-only keys in Docker Compose&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Application CPU Load&lt;/td&gt;&lt;td&gt;Shifting encryption from the database to the application spikes app-tier CPU&lt;/td&gt;&lt;td&gt;Ensure application containers are provisioned with AES-NI hardware acceleration enabled&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Legacy Codebases&lt;/td&gt;&lt;td&gt;Millions of lines of code currently rely on &lt;code&gt;pgcrypto&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Implement a database-side proxy (like PgBouncer with custom interceptors) or a slow, phased migration at the ORM layer&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Treating &lt;code&gt;pgcrypto&lt;/code&gt; as a key management system inevitably leaks plaintext encryption keys into logs, metrics, and replication streams.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Shift the cryptographic workload out of the database and into the application layer using a dedicated KMS.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A query captured in a Datadog slow query log will only show the ciphertext payload, keeping the encryption key entirely out of the observability pipeline.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Audit your &lt;code&gt;pg_stat_statements&lt;/code&gt; and slow query logs today. Search for the string &lt;code&gt;pgp_sym_encrypt&lt;/code&gt; to determine if your keys are currently being actively leaked to your logging vendors.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your encryption strategy relies on hoping that nobody looks too closely at your query logs, it is time to redesign your key management architecture.&lt;/p&gt;</content:encoded><category>databases</category><category>security</category><category>failures</category></item><item><title>PostgreSQL Observability: Vacuum, Bloat, Locks, Replication Lag, and Query Plans</title><link>https://rajivonai.com/blog/2024-08-20-postgresql-observability-vacuum-bloat-locks/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-08-20-postgresql-observability-vacuum-bloat-locks/</guid><description>Monitoring PostgreSQL requires looking past the operating system and into the internal bookkeeping of MVCC, autovacuum, and replication streams.</description><pubDate>Tue, 20 Aug 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;If you treat PostgreSQL like a black box that only consumes CPU and Memory, you will eventually be crushed by the invisible weight of its MVCC architecture.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s Multi-Version Concurrency Control (MVCC) is powerful, but it requires continuous internal maintenance. Every &lt;code&gt;UPDATE&lt;/code&gt; creates a new row version, and every &lt;code&gt;DELETE&lt;/code&gt; marks an old row as a “dead tuple.” The &lt;code&gt;autovacuum&lt;/code&gt; daemon must eventually clean up these dead tuples to prevent table bloat and transaction ID wraparound.&lt;/p&gt;
&lt;p&gt;When teams migrate to PostgreSQL from other database engines, they often bring their generic monitoring dashboards with them. They alert on CPU spikes or memory exhaustion. But in PostgreSQL, the most dangerous failures are silent. An aggressive transaction holds a lock for too long, replication falls silently behind, or autovacuum is misconfigured and gives up on heavily updated tables. By the time these issues manifest as CPU spikes, the database is already deeply unhealthy.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;
&lt;p&gt;A failing PostgreSQL instance leaves distinct operational footprints before it fully collapses:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The Bloat Spiral:&lt;/strong&gt; Queries that used to return in milliseconds now take seconds. The table size on disk has doubled, but the actual row count hasn’t changed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Stale Stats Fallacy:&lt;/strong&gt; The query planner suddenly switches from a fast Index Scan to a catastrophic Sequential Scan because the table statistics are out of date.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Lock Cascade:&lt;/strong&gt; Application monitoring shows massive latency spikes across unrelated endpoints because a long-running reporting query is holding an &lt;code&gt;AccessShareLock&lt;/code&gt; that blocks an &lt;code&gt;AccessExclusiveLock&lt;/code&gt; requested by a schema migration, which in turn blocks all subsequent &lt;code&gt;SELECT&lt;/code&gt; queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replication Desync:&lt;/strong&gt; The primary database is healthy, but read-heavy applications serving from replicas are displaying data that is five minutes old.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;p&gt;When a PostgreSQL incident begins, these are the queries and metrics you must check first:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check for Blocking Sessions (&lt;code&gt;pg_locks&lt;/code&gt;):&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocked_pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       blocking_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       blocked_activity&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;query&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocked_query,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       blocking_activity&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;query&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_query&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; pg_catalog&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pg_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocked_locks&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; pg_catalog&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pg_stat_activity&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocked_activity &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked_activity&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; pg_catalog&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pg_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_locks &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;locktype&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;locktype&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; pg_catalog&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pg_stat_activity&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_activity &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking_activity&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;granted&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;granted&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check Dead Tuples and Autovacuum Status (&lt;code&gt;pg_stat_user_tables&lt;/code&gt;):&lt;/strong&gt;
Look at &lt;code&gt;n_dead_tup&lt;/code&gt; vs &lt;code&gt;n_live_tup&lt;/code&gt;. Check &lt;code&gt;last_autovacuum&lt;/code&gt; to see if the daemon is actually completing its work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check Replication Lag (&lt;code&gt;pg_stat_replication&lt;/code&gt;):&lt;/strong&gt;
Compare &lt;code&gt;pg_current_wal_lsn()&lt;/code&gt; with the &lt;code&gt;replay_lsn&lt;/code&gt; of the standby to calculate the byte lag.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Identify Long-Running Transactions (&lt;code&gt;pg_stat_activity&lt;/code&gt;):&lt;/strong&gt;
Transactions sitting in &lt;code&gt;idle in transaction&lt;/code&gt; for hours are holding locks and preventing dead tuples from being vacuumed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Examine Query Plan Regressions (&lt;code&gt;pg_stat_statements&lt;/code&gt;):&lt;/strong&gt;
If a specific query is suddenly slow, use &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt; to see if it is executing a sequential scan due to stale statistics.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;p&gt;When diagnosing sudden latency in PostgreSQL, the triage path branches quickly based on locks vs. load.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Latency Spike Detected] --&gt; B{Are there blocking sessions?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Yes| C[Identify Blocking PID]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; C1{Is the blocker idle in transaction?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C1 --&gt;|Yes| C2[Terminate Blocker]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C1 --&gt;|No| C3[Evaluate Impact: Terminate or Wait]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|No| D{Are queries using Sequential Scans?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|Yes| D1[Check n_dead_tup]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D1 --&gt;|High| D2[Run VACUUM ANALYZE manually]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D1 --&gt;|Low| D3[Update pg_statistic via ANALYZE]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|No| E[Check Connection Pool]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; E1[If saturated, increase pool size or shed load]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Kill the Blocking Session (Fast, Disruptive):&lt;/strong&gt;
Using &lt;code&gt;pg_terminate_backend(pid)&lt;/code&gt; will immediately release locks.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; The terminated application transaction will fail and must be retried.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Manual &lt;code&gt;VACUUM ANALYZE&lt;/code&gt; (Medium Speed, High I/O):&lt;/strong&gt;
If a table has massive bloat and stale stats, forcing a manual vacuum updates the planner.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; This generates significant disk I/O and can degrade performance further while it runs.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tuning &lt;code&gt;autovacuum_vacuum_scale_factor&lt;/code&gt; (Slow, Permanent Fix):&lt;/strong&gt;
If large tables are never being vacuumed, lower the scale factor for those specific tables using &lt;code&gt;ALTER TABLE ... SET (autovacuum_vacuum_scale_factor = 0.01)&lt;/code&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; Requires understanding the write velocity of the specific table to tune correctly.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;p&gt;If you execute a manual &lt;code&gt;VACUUM FULL&lt;/code&gt; attempting to reclaim disk space, remember that it takes an &lt;code&gt;AccessExclusiveLock&lt;/code&gt; on the entire table. If this blocks production traffic unexpectedly, the rollback plan is to immediately cancel the &lt;code&gt;VACUUM FULL&lt;/code&gt; command. PostgreSQL will safely release the lock and revert to the previous state, though no space will have been reclaimed.&lt;/p&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Deploy an agent or cron job that explicitly alerts on “Transactions older than 1 hour” and “Idle in transaction older than 15 minutes.” These are almost always application bugs (leaked connections) and they are the primary cause of autovacuum failing to clean up dead tuples.&lt;/p&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Vacuum is a Feature, Not a Chore:&lt;/strong&gt; Do not disable or restrict autovacuum. If it is consuming too much I/O, tune it to run more frequently but less aggressively.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert on the Right Metrics:&lt;/strong&gt; Stop alerting purely on CPU. Alert on replication lag, connection saturation, and long-running locks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor Query Plans:&lt;/strong&gt; Use &lt;code&gt;pg_stat_statements&lt;/code&gt; to track the average execution time of your top queries to catch regressions before they cause outages.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; PostgreSQL’s most dangerous failures — bloat spirals, lock cascades, replication desync — are invisible on CPU and memory dashboards until the database is already deeply unhealthy. By the time CPU spikes from bloat, the table has been unvacuumed long enough to cause query plan regressions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Add lock chain detection, dead tuple ratio, replication byte lag, and long transaction age as continuously scraped metrics alongside host metrics — these are the leading indicators CPU can never provide.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Introduce a sleeping &lt;code&gt;idle in transaction&lt;/code&gt; connection in staging and verify it appears on the “Transactions older than 15 minutes” alert before it blocks a schema migration — if the alert doesn’t fire, the monitoring gap is real.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Add &lt;code&gt;lock_timeout = &apos;5s&apos;&lt;/code&gt; to all schema migration scripts this sprint, and create a Grafana panel tracking &lt;code&gt;n_dead_tup / (n_live_tup + n_dead_tup)&lt;/code&gt; per table to catch bloat before it affects query plans.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category><category>failures</category></item><item><title>Database Alert Design: Thresholds That Fire on Real Problems</title><link>https://rajivonai.com/blog/2024-08-12-database-alert-design-thresholds-that-fire-on-real-problems/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-08-12-database-alert-design-thresholds-that-fire-on-real-problems/</guid><description>How to set database alert thresholds that catch real failures without burning the team on autovacuum noise, checkpoint churn, and replication lag spikes — with specific values for PostgreSQL, MySQL, and Aurora.</description><pubDate>Mon, 12 Aug 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Most database alert fatigue comes from thresholds set to catch anything unusual rather than thresholds calibrated to actual user impact. An alert that fires on every autovacuum run, every checkpoint, and every 5-second replica lag spike will be silenced by engineers within a week — and then the real incidents will go unnoticed.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Database teams accumulate alerts in one of two ways: copy default thresholds from the monitoring tool’s out-of-box configuration, or set thresholds after an incident when the previous absence of an alert was painful. Both approaches produce the wrong result.&lt;/p&gt;
&lt;p&gt;Default thresholds are calibrated for visibility, not signal quality. They generate enough noise that teams learn to ignore them. Incident-driven thresholds overfit to a specific failure pattern and miss adjacent ones.&lt;/p&gt;
&lt;p&gt;The right design is a two-level alert architecture: a warning level that gives the team early signal and time to investigate, and a critical level that triggers paging because user impact is already occurring or imminent.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Symptom in the alert system&lt;/th&gt;&lt;th&gt;What it usually means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Alert fired, no incident found&lt;/td&gt;&lt;td&gt;Threshold is at wrong level or condition is transient and self-resolving&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Alert fired after users already complained&lt;/td&gt;&lt;td&gt;Threshold is too high or measurement resolution is too low&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Same alert fires daily at the same time&lt;/td&gt;&lt;td&gt;Normal batch job or backup window — suppress or add time-based exclusion&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Alert never fires in production&lt;/td&gt;&lt;td&gt;Either system is very healthy, or threshold is too permissive&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Multiple alerts fire at once for the same root cause&lt;/td&gt;&lt;td&gt;Missing alert correlation — downstream symptoms of a single root cause&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;p&gt;Before setting any threshold, measure the baseline over 7 days on the production workload.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. What is the normal replica lag distribution?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Collect &lt;code&gt;replay_lag&lt;/code&gt; from &lt;code&gt;pg_stat_replication&lt;/code&gt; (PostgreSQL) or &lt;code&gt;Seconds_Behind_Master&lt;/code&gt; (MySQL) every 60 seconds for 7 days. Identify:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Median lag during business hours&lt;/li&gt;
&lt;li&gt;95th percentile lag during peak write periods&lt;/li&gt;
&lt;li&gt;Maximum lag during known batch jobs or backups&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Set the warning threshold at 2× the 95th percentile peak. Set the critical threshold at the point where read replicas return data more than one commit cycle stale for your application’s consistency requirements — typically 60–120 seconds for OLTP, 5–15 minutes for analytics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. What is the normal connection utilization pattern?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- PostgreSQL: connections used vs max&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; active,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; setting::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;int&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_settings &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; name&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;max_connections&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; max_conn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       ROUND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;             (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; setting::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;int&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_settings &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; name&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;max_connections&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;), &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pct_used&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Measure this every minute over 7 days. Alert at 70% (warning — time to investigate pool settings) and 85% (critical — application will soon see connection errors).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. What does checkpoint behavior look like during normal operations?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;From &lt;code&gt;pg_stat_bgwriter&lt;/code&gt;, collect &lt;code&gt;checkpoints_req&lt;/code&gt; over time. Zero is ideal — all checkpoints should be &lt;code&gt;checkpoints_timed&lt;/code&gt;. Any non-zero &lt;code&gt;checkpoints_req&lt;/code&gt; over a 5-minute period means write pressure is forcing early checkpoints. Alert when &lt;code&gt;checkpoints_req &gt; 0&lt;/code&gt; for more than 3 consecutive minutes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. What is the slow query baseline?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Enable &lt;code&gt;pg_stat_statements&lt;/code&gt; and measure the 95th percentile query duration for your top 20 query types over 7 days. Use this to set application-specific slow query thresholds — not a global “any query over 1 second” rule, which fires on legitimate analytical queries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. What does disk growth look like?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Measure database disk usage daily for 30 days and compute the trend. Alert when the projected exhaustion date (at the current growth rate) falls within 14 days. This is a warning. A critical alert triggers when the projected exhaustion falls within 3 days or when a sudden disk spike exceeds the 30-day average growth by 5×.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Alert fires] --&gt; B{User impact?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Users already reporting issues| C[Critical — escalate to on-call]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|No user reports| D{Trending toward impact?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|Yes — within SLO window| E[Warning — investigate now]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|No — transient spike| F{Is this a known pattern?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|Yes — batch job, backup, maintenance| G[Suppress for this window — add schedule exclusion]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|No — unexpected| H[Investigate root cause — check pg_stat_activity and slow query log]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt; I{Root cause identified?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt;|Yes| J[Fix or tune threshold — document the baseline]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt;|No| K[Escalate with evidence package — query plans, metrics window, server log]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;alert-thresholds-reference&quot;&gt;Alert Thresholds Reference&lt;/h2&gt;
&lt;h3 id=&quot;postgresql&quot;&gt;PostgreSQL&lt;/h3&gt;

































































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Metric&lt;/th&gt;&lt;th&gt;Warning&lt;/th&gt;&lt;th&gt;Critical&lt;/th&gt;&lt;th&gt;Notes&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Replica lag&lt;/td&gt;&lt;td&gt;60s&lt;/td&gt;&lt;td&gt;300s&lt;/td&gt;&lt;td&gt;Use &lt;code&gt;replay_lag&lt;/code&gt;; adjust for batch job windows&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Connection utilization&lt;/td&gt;&lt;td&gt;70% of &lt;code&gt;max_connections&lt;/code&gt;&lt;/td&gt;&lt;td&gt;85%&lt;/td&gt;&lt;td&gt;Count only non-idle sessions for more accurate signal&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;checkpoints_req&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&gt; 0 for 3 min&lt;/td&gt;&lt;td&gt;&gt; 0 for 10 min&lt;/td&gt;&lt;td&gt;Any forced checkpoint means write pressure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Dead tuple ratio&lt;/td&gt;&lt;td&gt;20% on tables &gt; 100k rows&lt;/td&gt;&lt;td&gt;40%&lt;/td&gt;&lt;td&gt;Per-table alert, not global&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cache hit ratio&lt;/td&gt;&lt;td&gt;&amp;#x3C; 97%&lt;/td&gt;&lt;td&gt;&amp;#x3C; 90%&lt;/td&gt;&lt;td&gt;Monitor &lt;code&gt;pg_statio_user_tables&lt;/code&gt; hits vs reads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Table bloat (relation size growth)&lt;/td&gt;&lt;td&gt;2× expected&lt;/td&gt;&lt;td&gt;3× expected&lt;/td&gt;&lt;td&gt;Compare against 30-day baseline&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Long-running query&lt;/td&gt;&lt;td&gt;&gt; 60s&lt;/td&gt;&lt;td&gt;&gt; 300s&lt;/td&gt;&lt;td&gt;OLTP threshold; analytical systems need separate policy&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Idle-in-transaction session&lt;/td&gt;&lt;td&gt;&gt; 5 min&lt;/td&gt;&lt;td&gt;&gt; 15 min&lt;/td&gt;&lt;td&gt;Per-session duration, not aggregate count&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;pg_stat_replication&lt;/code&gt; slot lag&lt;/td&gt;&lt;td&gt;100 MB&lt;/td&gt;&lt;td&gt;1 GB&lt;/td&gt;&lt;td&gt;Unused replication slots block WAL cleanup&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h3 id=&quot;mysql--aurora-mysql&quot;&gt;MySQL / Aurora MySQL&lt;/h3&gt;





















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Metric&lt;/th&gt;&lt;th&gt;Warning&lt;/th&gt;&lt;th&gt;Critical&lt;/th&gt;&lt;th&gt;Notes&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Seconds_Behind_Master&lt;/code&gt;&lt;/td&gt;&lt;td&gt;30s&lt;/td&gt;&lt;td&gt;120s&lt;/td&gt;&lt;td&gt;Use Aurora replica lag metric in CloudWatch for Aurora&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Threads_connected&lt;/code&gt;&lt;/td&gt;&lt;td&gt;70% of &lt;code&gt;max_connections&lt;/code&gt;&lt;/td&gt;&lt;td&gt;85%&lt;/td&gt;&lt;td&gt;&lt;code&gt;Threads_running&lt;/code&gt; spike is the lead indicator&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Innodb_buffer_pool_wait_free&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&gt; 0 per 5 min&lt;/td&gt;&lt;td&gt;&gt; 100 per 5 min&lt;/td&gt;&lt;td&gt;Buffer pool pages not available — memory pressure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Innodb_log_waits&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&gt; 0 per 5 min&lt;/td&gt;&lt;td&gt;&gt; 10 per 5 min&lt;/td&gt;&lt;td&gt;Redo log full — write throughput exceeded&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slow query rate&lt;/td&gt;&lt;td&gt;2× 7-day average&lt;/td&gt;&lt;td&gt;5× 7-day average&lt;/td&gt;&lt;td&gt;Rate, not absolute count&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Open_tables&lt;/code&gt;&lt;/td&gt;&lt;td&gt;80% of &lt;code&gt;table_open_cache&lt;/code&gt;&lt;/td&gt;&lt;td&gt;95%&lt;/td&gt;&lt;td&gt;Too-small cache causes repeated table opens&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Lock wait timeout&lt;/td&gt;&lt;td&gt;&gt; 5 per minute&lt;/td&gt;&lt;td&gt;&gt; 20 per minute&lt;/td&gt;&lt;td&gt;High contention — check for hot rows or large transactions&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h3 id=&quot;aurora-postgresql--aurora-mysql-cloudwatch-specific&quot;&gt;Aurora PostgreSQL / Aurora MySQL (CloudWatch-specific)&lt;/h3&gt;















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;CloudWatch metric&lt;/th&gt;&lt;th&gt;Warning&lt;/th&gt;&lt;th&gt;Critical&lt;/th&gt;&lt;th&gt;Notes&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;ReplicaLag&lt;/code&gt;&lt;/td&gt;&lt;td&gt;30s&lt;/td&gt;&lt;td&gt;120s&lt;/td&gt;&lt;td&gt;Distinct from standard PostgreSQL; checked via CloudWatch&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;DatabaseConnections&lt;/code&gt;&lt;/td&gt;&lt;td&gt;70% of instance max&lt;/td&gt;&lt;td&gt;85%&lt;/td&gt;&lt;td&gt;Per-instance limit, check RDS parameter group&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;FreeStorageSpace&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&amp;#x3C; 20 GB or &amp;#x3C; 20%&lt;/td&gt;&lt;td&gt;&amp;#x3C; 5 GB&lt;/td&gt;&lt;td&gt;Aurora storage auto-scales but billing changes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;AuroraVolumeBytesLeftTotal&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&amp;#x3C; 10 TB&lt;/td&gt;&lt;td&gt;&amp;#x3C; 1 TB&lt;/td&gt;&lt;td&gt;Aurora 128 TB storage ceiling&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;WriteIOPS&lt;/code&gt;&lt;/td&gt;&lt;td&gt;2× 7-day P95&lt;/td&gt;&lt;td&gt;5× 7-day P95&lt;/td&gt;&lt;td&gt;Sudden IOPS spike — check for bulk loads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;EngineUptime&lt;/code&gt;&lt;/td&gt;&lt;td&gt;—&lt;/td&gt;&lt;td&gt;Unexpected reset&lt;/td&gt;&lt;td&gt;Unexpected restart — check for OOM or crash&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;p&gt;If a threshold change causes alert fatigue or misses a real incident:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Revert to the previous threshold immediately and document the direction of failure (too sensitive vs. too permissive).&lt;/li&gt;
&lt;li&gt;Collect a 7-day baseline at the previous threshold before making another change.&lt;/li&gt;
&lt;li&gt;For critical alerts, always test in staging with a simulated failure scenario before applying to production.&lt;/li&gt;
&lt;li&gt;Keep a changelog of threshold changes with the justification and the measurement that motivated each change.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Alert routing automation that reduces toil:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Batch job suppression&lt;/strong&gt;: automatically suppress replica lag alerts during known ETL windows (e.g., 01:00–04:00 UTC) and backup windows. Log the suppression, do not silently drop.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Alert correlation&lt;/strong&gt;: when connection exhaustion and slow query alerts fire within 5 minutes of each other, group them into a single incident with both signals attached. The root cause is almost always the same event.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Baseline drift detection&lt;/strong&gt;: weekly job that checks whether current metric values have permanently shifted from the thresholds set 30 days ago. If p95 is consistently higher than the warning threshold, the baseline has shifted — either the system is degrading or the workload grew.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;p&gt;Database alert reliability is a trust problem as much as a technical one. Teams stop responding to alerts that have false-positive rates above 20%. The two-level architecture (warning = investigate, critical = page) with calibrated per-metric thresholds keeps signal quality high enough that critical alerts are taken seriously. The measurement-first approach — setting thresholds from 7-day baselines rather than intuition — produces thresholds that reflect actual system behavior, not guesses.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Threshold set without baseline&lt;/td&gt;&lt;td&gt;Alert fires on normal workload variation&lt;/td&gt;&lt;td&gt;Measure 7-day baseline before setting any threshold&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Global slow query threshold&lt;/td&gt;&lt;td&gt;Legitimate analytics queries fire alert constantly&lt;/td&gt;&lt;td&gt;Per-query-class thresholds or separate analytics monitoring policy&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Alert on every autovacuum&lt;/td&gt;&lt;td&gt;autovacuum is working correctly but noisy&lt;/td&gt;&lt;td&gt;Alert on dead tuple ratio, not autovacuum event frequency&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Missing maintenance window suppression&lt;/td&gt;&lt;td&gt;Backup and ETL jobs generate false positives every night&lt;/td&gt;&lt;td&gt;Add time-of-day or scheduled suppressions with logging&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;No test for false negatives&lt;/td&gt;&lt;td&gt;Team knows when alerts fire too much, but not when they miss&lt;/td&gt;&lt;td&gt;Simulate failure scenarios in staging quarterly to verify alert coverage&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Your database alerts either fire too often (ignored) or too late (users complain first).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Measure 7-day baselines for the five metric groups above, then set two-level thresholds (warning, critical) calibrated to those baselines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Replay the last three database incidents against the proposed thresholds and verify they would have alerted at the warning level before user impact.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; This week, pull 7 days of replica lag, connection utilization, and slow query data from your monitoring tool and set the two-level thresholds using the reference values above as a starting point.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category></item><item><title>Database Encryption: TDE, Column Encryption, pgcrypto, KMS</title><link>https://rajivonai.com/blog/2024-08-05-database-encryption-tde-column-pgcrypto-kms/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-08-05-database-encryption-tde-column-pgcrypto-kms/</guid><description>Why Transparent Data Encryption ticks compliance boxes but fails against compromised credentials, and how to push encryption boundaries up the stack.</description><pubDate>Mon, 05 Aug 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Transparent Data Encryption (TDE) is a compliance checkbox that protects against a stolen hard drive, but it offers zero protection against the actual threat: an attacker walking through the front door with a compromised database credential.&lt;/strong&gt; To genuinely secure sensitive data, engineering teams must shift cryptographic boundaries out of the storage engine and into the application layer, moving away from legacy patterns that trust the database process with the keys to the kingdom.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;The regulatory definition of “encrypted at rest” is colliding with the reality of modern cloud security and zero-trust architectures. For decades, the industry standard was to turn on Transparent Data Encryption (TDE) at the database layer. TDE satisfies auditors—the data on the raw block storage device is mathematically inaccessible to someone who walks into an AWS data center and physically unplugs the hard drive.&lt;/p&gt;
&lt;p&gt;But physical theft is not the failure mode we are fighting in 2024. The threats we face are leaked application credentials in source code, Server-Side Request Forgery (SSRF) hitting internal database endpoints, and SQL injection vulnerabilities upstream. TDE operates seamlessly below the database engine’s shared memory buffers; it decrypts data automatically for any authenticated session. If an attacker has a valid credential, the database engine eagerly decrypts every row the attacker requests.&lt;/p&gt;




















&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;/th&gt;&lt;th&gt;Default approach&lt;/th&gt;&lt;th&gt;Better alternative&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Operating model&lt;/td&gt;&lt;td&gt;Turn on disk-level encryption (TDE) at the infrastructure layer, trusting the database process&lt;/td&gt;&lt;td&gt;Envelope encryption managed entirely by the application compute layer via a KMS&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Failure mode&lt;/td&gt;&lt;td&gt;Data is completely accessible in plaintext if a valid database credential is leaked&lt;/td&gt;&lt;td&gt;Data remains ciphertext to the database; keys live in a disconnected control plane&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;When you rely on the database engine to handle encryption, you are explicitly deciding that the database process itself is the boundary of trust.&lt;/p&gt;
&lt;p&gt;This breaks down mechanically in two ways: disk-level (TDE) and column-level via database extensions (&lt;code&gt;pgcrypto&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The mechanics of TDE failure:&lt;/strong&gt; TDE encrypts database pages as they are flushed to disk and decrypts them as they are read into memory (like PostgreSQL’s &lt;code&gt;shared_buffers&lt;/code&gt; or MySQL’s &lt;code&gt;InnoDB Buffer Pool&lt;/code&gt;). The database process holds the encryption key in memory. From the perspective of the SQL execution engine, the data is always in plaintext. A leaked database credential bypasses TDE completely.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The mechanics of database extension failure:&lt;/strong&gt; To solve the TDE problem, teams often move to column-level encryption using database extensions like PostgreSQL’s &lt;code&gt;pgcrypto&lt;/code&gt;. They execute queries like:
&lt;code&gt;SELECT pgp_sym_encrypt(&apos;sensitive_value&apos;, &apos;my_secret_key&apos;);&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This introduces a catastrophic operational vulnerability. The plaintext encryption key is passed directly across the wire in the SQL string. Unless you aggressively sanitize your telemetry, that plaintext key will instantly leak into:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt; (visible to any monitoring agent)&lt;/li&gt;
&lt;li&gt;Slow query logs shipped to Datadog or CloudWatch&lt;/li&gt;
&lt;li&gt;Logical replication streams&lt;/li&gt;
&lt;li&gt;PostgreSQL’s internal statement history&lt;/li&gt;
&lt;/ol&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure point&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;TDE (Disk-level)&lt;/td&gt;&lt;td&gt;Database decrypts data automatically on disk reads&lt;/td&gt;&lt;td&gt;Offers zero defense against SQL injection, SSRF, or credential theft&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Database Extensions&lt;/td&gt;&lt;td&gt;Keys are passed as string literals in SQL queries&lt;/td&gt;&lt;td&gt;Keys leak across all database observability and replication pipelines&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Application Encryption&lt;/td&gt;&lt;td&gt;The database engine loses visibility into the payload&lt;/td&gt;&lt;td&gt;Query patterns must be fundamentally redesigned to support exact-match searches&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The core architectural question is this: How do we completely decouple data access from data storage without destroying the database’s ability to efficiently serve queries?&lt;/p&gt;
&lt;h2 id=&quot;the-implementation&quot;&gt;The Implementation&lt;/h2&gt;
&lt;p&gt;The most resilient architecture shifts the cryptographic boundary out of the database entirely. The database is treated as a hostile, untrusted storage plane. The application layer handles all encryption using envelope encryption backed by a cloud Key Management Service (KMS), such as AWS KMS or Google Cloud KMS.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[&quot;Application Memory Pool&quot;] --&gt;|1. Request DEK| B[&quot;Cloud KMS API&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|2. Return Plaintext — Ciphertext| A&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt;|3. Encrypt Payload locally| A&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt;|4. Write Ciphertext| C[&quot;Database Storage Engine&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Request the Data Encryption Key (DEK).&lt;/strong&gt;&lt;br&gt;
The application compute layer calls the KMS API, requesting a new DEK for a specific record.&lt;br&gt;
Confirm: The KMS returns two versions of the DEK to the application: the raw plaintext DEK and a KMS-wrapped ciphertext version of the DEK.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Encrypt locally in the application pool.&lt;/strong&gt;&lt;br&gt;
The application uses a local cryptographic library (like AES-GCM-256) to encrypt the sensitive payload using the plaintext DEK.&lt;br&gt;
Confirm: The plaintext DEK is immediately discarded and zeroed out from the application’s memory pool. Only the ciphertext payload and the ciphertext DEK remain.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Write ciphertext to the hostile storage.&lt;/strong&gt;&lt;br&gt;
The application issues an &lt;code&gt;INSERT&lt;/code&gt; or &lt;code&gt;UPDATE&lt;/code&gt; to the database, writing both the encrypted payload and the ciphertext DEK into the row.&lt;br&gt;
Confirm: The database receives pure ciphertext. It cannot read the payload, and it cannot decrypt the DEK. The database is mathematically blind.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When reading the data back, the application fetches the row, sends the ciphertext DEK to the KMS to be unwrapped into plaintext, and then locally decrypts the payload.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern across mature platform architectures—especially those handling payments, healthcare records, or critical PII—is to enforce application-side envelope encryption over database-native cryptography.&lt;/p&gt;
&lt;p&gt;Context: When storing highly sensitive data points, standard operational posture assumes the database storage tier will eventually be compromised. A snapshot might be copied into a staging environment by a rogue script, or a read-replica credential might be exposed in a Slack channel.&lt;/p&gt;
&lt;p&gt;Action: Teams implement interceptors at the Object-Relational Mapping (ORM) layer or within a dedicated data access service. These interceptors automatically intercept writes to designated fields, execute the KMS envelope encryption flow, and replace the plaintext with the ciphertext bundle before the SQL statement is ever constructed.&lt;/p&gt;
&lt;p&gt;Result: When a read-replica is inadvertently exposed, the exfiltrated data is entirely useless. An attacker holding the database dump only holds ciphertext. To actually read the data, the attacker would need simultaneous, active access to the specific IAM roles allowed to call the KMS &lt;code&gt;Decrypt&lt;/code&gt; API—a completely isolated security plane with its own rate limits and audit trails.&lt;/p&gt;
&lt;p&gt;Learning: The database must be decoupled from the cryptographic control plane. Relying on the database to police access to its own underlying data is a topological anti-pattern.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;
&lt;p&gt;Shifting the cryptographic boundary to the application layer introduces severe mechanical constraints on the database engine.&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Searchability&lt;/td&gt;&lt;td&gt;Executing &lt;code&gt;SELECT ... WHERE encrypted_column = &apos;value&apos;&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Implement deterministic encryption for exact-match lookups, or build cryptographic blind indexes (e.g., HMAC-SHA256 of the plaintext)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Key Rotation&lt;/td&gt;&lt;td&gt;A KMS key needs to be rotated due to personnel exit&lt;/td&gt;&lt;td&gt;Build asynchronous background workers to iterate over tables, pull ciphertext, unwrap, rewrap with the new key, and write back&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Compute Overhead&lt;/td&gt;&lt;td&gt;The application calls KMS over the network for every row read&lt;/td&gt;&lt;td&gt;Cache the un-wrapped DEKs locally within the application memory space for a strict, short TTL (e.g., 5 minutes) to avoid KMS rate limits&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Database-level encryption features like TDE and &lt;code&gt;pgcrypto&lt;/code&gt; provide a false sense of security against the most common vectors of data exfiltration, leaving data vulnerable to compromised credentials and SQL injection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Move the cryptographic boundary out of the database and up to the application compute layer using KMS envelope encryption.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A leaked database credential or snapshot yields only ciphertext; an attacker must breach both the data plane and the IAM control plane simultaneously to extract value.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Audit your schema for sensitive columns currently relying on TDE or &lt;code&gt;pgcrypto&lt;/code&gt;. Identify one critical field and scope the engineering effort to migrate it behind an application-side KMS flow with a blind index.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The ultimate measure of a zero-trust data architecture is not whether the disk is encrypted, but how many entirely disparate systems an attacker must compromise at the exact same time to read a single row of plaintext.&lt;/p&gt;</content:encoded><category>databases</category><category>architecture</category><category>security</category></item><item><title>MySQL and Aurora Monitoring: The Dashboard That Catches Problems Before Users Do</title><link>https://rajivonai.com/blog/2024-07-22-mysql-aurora-monitoring-dashboard-queries-replication-innodb/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-07-22-mysql-aurora-monitoring-dashboard-queries-replication-innodb/</guid><description>The seven MySQL and Aurora metric groups that matter for production operations — threads, replication lag, InnoDB buffer pool, slow queries, connections, locks, and disk — with exact SQL, CloudWatch metrics, and alert thresholds.</description><pubDate>Mon, 22 Jul 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A MySQL dashboard that shows only CPU and disk IOPS will miss the failures that actually page you at 3 AM: replication stopped because of a single bad row, InnoDB buffer pool thrashing on a cold restart, connection exhaustion from a leaked pool, and a lock chain building behind an ALTER TABLE that forgot &lt;code&gt;LOCK=NONE&lt;/code&gt;.&lt;/strong&gt; The metrics that matter come from &lt;code&gt;INFORMATION_SCHEMA&lt;/code&gt;, &lt;code&gt;performance_schema&lt;/code&gt;, and the MySQL status variables — not the OS.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most MySQL monitoring starts with the infrastructure layer: CPU, memory, disk I/O, network. These are necessary for capacity planning but insufficient for operational health. A MySQL instance with 30% CPU and plenty of free memory can still be moments from an outage: replica lag at 45 minutes, InnoDB buffer pool hit rate at 80% (normal is 99%), connection count at 95% of &lt;code&gt;max_connections&lt;/code&gt;, and five sessions blocked behind a lock on a hot row.&lt;/p&gt;
&lt;p&gt;Aurora adds its own layer: storage auto-scaling, volume bytes ceiling, cluster-level failover, and replica lag measured differently than MySQL’s &lt;code&gt;Seconds_Behind_Master&lt;/code&gt;. Monitoring Aurora with only MySQL queries misses the Aurora-specific failure modes.&lt;/p&gt;
&lt;p&gt;The seven metric groups below apply to both self-managed MySQL and Aurora MySQL. Where Aurora differs, the Aurora-specific metric or query is noted.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Symptom&lt;/th&gt;&lt;th&gt;Likely source&lt;/th&gt;&lt;th&gt;First place to check&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Application queries suddenly slower&lt;/td&gt;&lt;td&gt;Lock contention or plan regression&lt;/td&gt;&lt;td&gt;&lt;code&gt;INFORMATION_SCHEMA.INNODB_TRX&lt;/code&gt;, &lt;code&gt;SHOW PROCESSLIST&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Connection pool exhausted&lt;/td&gt;&lt;td&gt;&lt;code&gt;max_connections&lt;/code&gt; hit or leaked connections&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW STATUS LIKE &apos;Threads_connected&apos;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replica reads returning stale data&lt;/td&gt;&lt;td&gt;Replication lag&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW SLAVE STATUS&lt;/code&gt; / Aurora CloudWatch &lt;code&gt;ReplicaLag&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Table scan on a previously fast query&lt;/td&gt;&lt;td&gt;Missing index or stale stats&lt;/td&gt;&lt;td&gt;&lt;code&gt;EXPLAIN&lt;/code&gt;, &lt;code&gt;information_schema.STATISTICS&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Got error 1040: Too many connections&lt;/code&gt; in app logs&lt;/td&gt;&lt;td&gt;Connections near or at limit&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW VARIABLES LIKE &apos;max_connections&apos;&lt;/code&gt; vs current threads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Disk filling faster than expected&lt;/td&gt;&lt;td&gt;Binary logs not purging or large temp tables&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW VARIABLES LIKE &apos;expire_logs_days&apos;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;OOM kill on MySQL process&lt;/td&gt;&lt;td&gt;Buffer pool too large for available RAM&lt;/td&gt;&lt;td&gt;&lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; vs system RAM&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Lock wait timeout exceeded&lt;/code&gt; in app&lt;/td&gt;&lt;td&gt;Long-running transaction holding row locks&lt;/td&gt;&lt;td&gt;&lt;code&gt;INFORMATION_SCHEMA.INNODB_TRX&lt;/code&gt; + &lt;code&gt;INNODB_LOCKS&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;p&gt;Run these in order when something is wrong. Each requires only &lt;code&gt;PROCESS&lt;/code&gt; privilege or &lt;code&gt;SELECT&lt;/code&gt; on &lt;code&gt;performance_schema&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. What are active threads doing right now?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; id, user, host, db, command, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;time&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;LEFT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(info, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;120&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;PROCESSLIST&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; command &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;!=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;Sleep&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; time&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; time&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Look for threads in &lt;code&gt;Waiting for lock&lt;/code&gt;, &lt;code&gt;Sending data&lt;/code&gt;, or &lt;code&gt;Copying to tmp table&lt;/code&gt; with long durations. Any active query running more than 30 seconds in OLTP deserves investigation. &lt;code&gt;Waiting for lock&lt;/code&gt; with a chain of blocked sessions is a reliability event.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Is anyone waiting on InnoDB row locks?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; waiting_trx_id,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_mysql_thread_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; waiting_thread,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_query&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; waiting_query,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  b&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_trx_id,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  b&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_mysql_thread_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_thread,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  b&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_query&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_query,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  TIMESTAMPDIFF(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SECOND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_wait_started&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;NOW&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; wait_seconds&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;INNODB_LOCK_WAITS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; w&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;INNODB_TRX&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; r &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; w&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;requesting_trx_id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;INNODB_TRX&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; b &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; b&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; w&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;blocking_trx_id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; wait_seconds &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For MySQL 8.0+, use &lt;code&gt;performance_schema.data_lock_waits&lt;/code&gt; instead of &lt;code&gt;INNODB_LOCK_WAITS&lt;/code&gt; (deprecated). A lock wait exceeding 10 seconds on an OLTP system is a reliability event, not a transient blip.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. How far behind is the replica?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- MySQL self-managed:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW SLAVE &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;STATUS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;\G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Key fields: Seconds_Behind_Master, Slave_IO_Running, Slave_SQL_Running, Last_Error&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;Seconds_Behind_Master&lt;/code&gt; reports the difference between the timestamp of the last event the replica’s SQL thread applied and the current timestamp. It goes to &lt;code&gt;NULL&lt;/code&gt; when replication is stopped — this is not zero lag, it is broken replication.&lt;/p&gt;
&lt;p&gt;For Aurora MySQL: use CloudWatch metric &lt;code&gt;ReplicaLag&lt;/code&gt;. Aurora’s lag metric is more accurate because replicas share the same storage volume and lag is measured as I/O apply delay, not binary log position difference.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. What is the InnoDB buffer pool hit rate?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  variable_name,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  variable_value&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; performance_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;global_status&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; variable_name &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &apos;Innodb_buffer_pool_read_requests&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &apos;Innodb_buffer_pool_reads&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &apos;Innodb_buffer_pool_wait_free&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &apos;Innodb_buffer_pool_pages_dirty&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &apos;Innodb_buffer_pool_pages_total&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Compute hit rate: &lt;code&gt;(Innodb_buffer_pool_read_requests - Innodb_buffer_pool_reads) / Innodb_buffer_pool_read_requests * 100&lt;/code&gt;. Below 99% means the buffer pool is too small or the working set exceeds available memory. &lt;code&gt;Innodb_buffer_pool_wait_free &gt; 0&lt;/code&gt; means MySQL had to wait for a clean page — a sign of memory pressure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. What does the slow query rate look like?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;STATUS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LIKE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;Slow_queries&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW VARIABLES &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIKE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;long_query_time&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW VARIABLES &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIKE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;slow_query_log%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If &lt;code&gt;slow_query_log&lt;/code&gt; is &lt;code&gt;OFF&lt;/code&gt;, turn it on: &lt;code&gt;SET GLOBAL slow_query_log = &apos;ON&apos;; SET GLOBAL long_query_time = 1;&lt;/code&gt; (1 second threshold for OLTP). &lt;code&gt;Slow_queries&lt;/code&gt; is a cumulative counter since last restart — track the rate of change, not the absolute value.&lt;/p&gt;
&lt;p&gt;For &lt;code&gt;performance_schema&lt;/code&gt;, query the top queries by total latency:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; schema_name, digest_text,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       count_star &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; executions,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       ROUND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(avg_timer_wait &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; 1e12, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;3&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; avg_latency_sec,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       ROUND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(sum_timer_wait &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; 1e12, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;3&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; total_latency_sec&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; performance_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;events_statements_summary_by_digest&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; schema_name &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IS NOT NULL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; sum_timer_wait &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Symptom observed] --&gt; B{Active threads check}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Long-running active queries| C[Run EXPLAIN — plan regression or missing index?]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Threads in lock wait| D[Find blocking transaction — INNODB_TRX]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Many Sleep threads| E[Check connection pool — leaked connections or idle timeout not set?]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|All looks normal| F{Check replication}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|Seconds_Behind_Master high or NULL| G[Check Slave_IO_Running and Slave_SQL_Running — IO stopped means network or binlog issue — SQL stopped means error on replica apply]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|Lag acceptable| H{Check InnoDB buffer pool}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|Hit rate below 99%| I[Working set exceeds buffer pool — increase innodb_buffer_pool_size or identify hot tables]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|wait_free above zero| J[Memory pressure — check OS swap and buffer pool size vs available RAM]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|Buffer pool healthy| K{Check slow queries}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|Slow query rate spiking| L[Run EXPLAIN on top queries from performance_schema digest — find index gaps]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|No slow query signal| M{Check connections}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    M --&gt;|Threads_connected near max_connections| N[Check for leaked connections — application not closing pool]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    M --&gt;|Connections healthy| O[Check InnoDB redo log waits and binary log position]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Problem&lt;/th&gt;&lt;th&gt;Immediate action&lt;/th&gt;&lt;th&gt;Durable fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Lock chain blocking transactions&lt;/td&gt;&lt;td&gt;&lt;code&gt;KILL &amp;#x3C;blocking_thread_id&gt;&lt;/code&gt; — use with caution, rolls back the transaction&lt;/td&gt;&lt;td&gt;Fix the application transaction that holds locks across slow external calls; add &lt;code&gt;innodb_lock_wait_timeout&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replication stopped — SQL thread error&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW SLAVE STATUS&lt;/code&gt; for &lt;code&gt;Last_SQL_Error&lt;/code&gt;; &lt;code&gt;STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;&lt;/code&gt; only if the row is truly safe to skip&lt;/td&gt;&lt;td&gt;Fix the root cause (schema drift, unsupported statement in ROW format); never skip without understanding the error&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;InnoDB buffer pool hit rate below 99%&lt;/td&gt;&lt;td&gt;Identify and cache the hot tables; check if a large dump or batch job is evicting the working set&lt;/td&gt;&lt;td&gt;Increase &lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; (safe upper bound: 70–80% of total RAM); use buffer pool warmup after restart&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Connection exhaustion&lt;/td&gt;&lt;td&gt;Kill idle connections: &lt;code&gt;SELECT CONCAT(&apos;KILL &apos;, id, &apos;;&apos;) FROM information_schema.PROCESSLIST WHERE command=&apos;Sleep&apos; AND time &gt; 300;&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Set &lt;code&gt;wait_timeout&lt;/code&gt; and &lt;code&gt;interactive_timeout&lt;/code&gt;; fix application connection pool to return connections after use&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slow query regression&lt;/td&gt;&lt;td&gt;Temporarily add an index with &lt;code&gt;CREATE INDEX ... ALGORITHM=INPLACE, LOCK=NONE&lt;/code&gt;; or force a plan with &lt;code&gt;FORCE INDEX&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Tune the query; rebuild statistics with &lt;code&gt;ANALYZE TABLE&lt;/code&gt;; add index permanently after testing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Disk filling from binary logs&lt;/td&gt;&lt;td&gt;&lt;code&gt;PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 3 DAY)&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Set &lt;code&gt;expire_logs_days = 7&lt;/code&gt;; verify replica is not lagging — purging logs a replica needs will break replication&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Three MySQL checks can be automated into a runbook trigger:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Replication watchdog&lt;/strong&gt;: poll &lt;code&gt;Seconds_Behind_Master&lt;/code&gt; every 60 seconds; alert when it exceeds 60 seconds; alert as critical when it is &lt;code&gt;NULL&lt;/code&gt; (replication stopped). For Aurora, subscribe to CloudWatch &lt;code&gt;ReplicaLag&lt;/code&gt; metric and create the same two-level alarm.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Connection saturation check&lt;/strong&gt;: query &lt;code&gt;Threads_connected / max_connections&lt;/code&gt; every 60 seconds. Alert at 70%, page at 85%. This gives the team time to identify the source (pool leak, burst traffic, slow query cascade) before connection errors reach the application.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Long-running transaction watchdog&lt;/strong&gt;: query &lt;code&gt;INFORMATION_SCHEMA.INNODB_TRX&lt;/code&gt; every 60 seconds. Alert if any transaction has been running more than 5 minutes. Auto-terminate transactions running more than 30 minutes with a logged record. Long-running transactions block autovacuum analogs (purge thread), hold row locks, and inflate undo log.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;p&gt;MySQL health is not visible in CPU and disk IOPS alone. Replication lag, InnoDB buffer pool utilization, lock chains, and connection exhaustion are the failure modes that cause user-visible errors — and all of them are visible in MySQL status variables and &lt;code&gt;INFORMATION_SCHEMA&lt;/code&gt; before CPU shows any anomaly. The most common monitoring gap in MySQL deployments is treating &lt;code&gt;Seconds_Behind_Master = NULL&lt;/code&gt; as zero lag instead of broken replication, and setting a single global slow query threshold that fires on legitimate batch queries while missing OLTP regressions. The seven metric groups above require only a &lt;code&gt;PROCESS&lt;/code&gt; privilege and a 60-second poll interval.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Why it happens&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Seconds_Behind_Master = NULL&lt;/code&gt; treated as healthy&lt;/td&gt;&lt;td&gt;NULL means replication stopped, not zero lag&lt;/td&gt;&lt;td&gt;Alert on &lt;code&gt;NULL&lt;/code&gt; as critical, not informational&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slow query alert fires on batch jobs&lt;/td&gt;&lt;td&gt;Global &lt;code&gt;long_query_time&lt;/code&gt; threshold applies to all queries&lt;/td&gt;&lt;td&gt;Set per-session &lt;code&gt;long_query_time&lt;/code&gt; for batch roles; alert on rate from &lt;code&gt;performance_schema&lt;/code&gt; digest by schema&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Buffer pool hit rate appears fine but queries are slow&lt;/td&gt;&lt;td&gt;A large report query is evicting the working set during the report window&lt;/td&gt;&lt;td&gt;Alert on hit rate averaged over 5 minutes; monitor &lt;code&gt;Innodb_buffer_pool_reads&lt;/code&gt; rate alongside hit rate&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Lock wait queries not visible&lt;/td&gt;&lt;td&gt;&lt;code&gt;INNODB_LOCK_WAITS&lt;/code&gt; requires MySQL 5.6–5.7 syntax; MySQL 8.0 uses &lt;code&gt;performance_schema.data_lock_waits&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Upgrade monitoring queries for MySQL 8.0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Aurora &lt;code&gt;Seconds_Behind_Master&lt;/code&gt; not available&lt;/td&gt;&lt;td&gt;Aurora replicas don’t expose this variable via &lt;code&gt;SHOW SLAVE STATUS&lt;/code&gt; in the same way&lt;/td&gt;&lt;td&gt;Use CloudWatch &lt;code&gt;ReplicaLag&lt;/code&gt; metric; do not rely on &lt;code&gt;SHOW SLAVE STATUS&lt;/code&gt; for Aurora replica lag&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;performance_schema&lt;/code&gt; disabled&lt;/td&gt;&lt;td&gt;Default enabled since MySQL 5.7 but can be disabled; digest table empty&lt;/td&gt;&lt;td&gt;Verify &lt;code&gt;performance_schema = ON&lt;/code&gt; in &lt;code&gt;my.cnf&lt;/code&gt;; enable &lt;code&gt;events_statements_history&lt;/code&gt; consumer&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; MySQL and Aurora monitoring shows infrastructure metrics but misses the database-level signals that precede outages.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Add the seven metric groups above using a &lt;code&gt;PROCESS&lt;/code&gt;-privileged monitoring user and a 60-second poll interval. For Aurora, add CloudWatch alarms for &lt;code&gt;ReplicaLag&lt;/code&gt;, &lt;code&gt;DatabaseConnections&lt;/code&gt;, and &lt;code&gt;FreeStorageSpace&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Run the five checks above against your production instance right now and confirm replication is not &lt;code&gt;NULL&lt;/code&gt;, buffer pool hit rate is above 99%, and no thread has been blocked on a lock for more than 10 seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; This week, create a monitoring role (&lt;code&gt;GRANT PROCESS, SELECT ON performance_schema.* TO &apos;monitoring&apos;@&apos;%&apos;&lt;/code&gt;), enable &lt;code&gt;slow_query_log&lt;/code&gt;, and set a replication lag alert with a 60-second warning threshold.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category></item><item><title>CloudWatch Database Insights for Aurora and RDS: The New AWS Monitoring Center</title><link>https://rajivonai.com/blog/2024-07-16-cloudwatch-database-insights-aurora-rds/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-07-16-cloudwatch-database-insights-aurora-rds/</guid><description>How to use CloudWatch and Performance Insights to root-cause Aurora and RDS incidents without deploying third-party agents.</description><pubDate>Tue, 16 Jul 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;If you are still SSH-ing into a bastion host to run &lt;code&gt;top&lt;/code&gt; and &lt;code&gt;SHOW PROCESSLIST&lt;/code&gt; during an Aurora outage, you are ignoring the richest telemetry plane AWS provides.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Historically, monitoring a managed database like Amazon RDS or Aurora meant making a choice: rely on the sparse, high-level metrics provided by default CloudWatch, or install a third-party agent that required network access, credential management, and additional compute overhead.&lt;/p&gt;
&lt;p&gt;The industry standard has shifted. AWS has unified Performance Insights (PI), Enhanced Monitoring (EM), and CloudWatch into a central observability plane. For teams operating Aurora and RDS at scale, the native AWS monitoring stack now provides enough granularity to diagnose deadlocks, pinpoint bad query plans, and trace I/O saturation without ever leaving the AWS console or writing a custom exporter.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;
&lt;p&gt;Database failures in Aurora rarely look like hard crashes. They look like creeping degradation. The operational symptoms typically manifest as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The Phantom CPU Spike:&lt;/strong&gt; &lt;code&gt;CPUUtilization&lt;/code&gt; hits 99%, but &lt;code&gt;DatabaseConnections&lt;/code&gt; remains flat. The application feels sluggish.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The I/O Ceiling:&lt;/strong&gt; Queries that normally take 5ms suddenly take 500ms. The &lt;code&gt;ReadIOPS&lt;/code&gt; or &lt;code&gt;WriteIOPS&lt;/code&gt; metrics flatline at the exact provisioned limit.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Connection Storm:&lt;/strong&gt; &lt;code&gt;DatabaseConnections&lt;/code&gt; spikes vertically, followed immediately by application-side 502 Bad Gateway errors as the connection pool queue fills up.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Silent Blocker:&lt;/strong&gt; Application latency increases, but &lt;code&gt;CPUUtilization&lt;/code&gt; is suspiciously low. Threads are waiting, not working.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;p&gt;When a paging alert fires for an Aurora or RDS instance, these are the first five checks an engineer should perform using native AWS tools:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check &lt;code&gt;DBLoad&lt;/code&gt; in Performance Insights:&lt;/strong&gt;
This is the single most important metric. DBLoad measures the number of active sessions in the database engine. If DBLoad exceeds the number of vCPUs, the database is bottlenecked.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Review the &lt;code&gt;Wait Events&lt;/code&gt; Breakdown:&lt;/strong&gt;
Slice the DBLoad metric by &lt;code&gt;waits&lt;/code&gt;. Are sessions waiting on &lt;code&gt;CPU&lt;/code&gt; (working)? &lt;code&gt;io/table/sql/read&lt;/code&gt; (I/O bound)? Or &lt;code&gt;Lock&lt;/code&gt; (contention)?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check &lt;code&gt;FreeableMemory&lt;/code&gt; and &lt;code&gt;SwapUsage&lt;/code&gt; (CloudWatch):&lt;/strong&gt;
If &lt;code&gt;FreeableMemory&lt;/code&gt; plunges near zero and &lt;code&gt;SwapUsage&lt;/code&gt; begins climbing, the instance is thrashing. This often precedes an Out Of Memory (OOM) crash.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Identify the Top SQL by Load (Performance Insights):&lt;/strong&gt;
Look at the “Top SQL” panel. Is the load caused by a single terrible query plan (one bar dominates), or an aggregate increase in all traffic?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Examine &lt;code&gt;CommitLatency&lt;/code&gt; and &lt;code&gt;Deadlocks&lt;/code&gt; (Aurora Specific):&lt;/strong&gt;
For Aurora PostgreSQL, check the &lt;code&gt;CommitLatency&lt;/code&gt; metric. If commit latency spikes while read IOPS are low, the storage volume might be experiencing multi-AZ replication delays.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;p&gt;When diagnosing an Aurora performance incident, diagnosing the wait event is the critical pivot point.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[DBLoad Exceeds vCPUs] --&gt; B{What is the Dominant Wait State?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|CPU| C[Check Top SQL by Load]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; C1{Is it a single query?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C1 --&gt;|Yes| C2[Missing Index or Bad Plan]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C1 --&gt;|No| C3[Traffic Spike: Scale Up Instance]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|I/O| D[Check IOPS Metrics]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; D1{Hitting Provisioned Limits?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D1 --&gt;|Yes| D2[Increase Provisioned IOPS or EBS Volume Size]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D1 --&gt;|No| D3[Check Buffer Cache Hit Ratio]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Locks| E[Check Blocking Sessions]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; E1[Identify the Blocking PID]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E1 --&gt; E2[Kill Blocker or Refactor Transaction Scope]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;Once the root cause is identified, you have a limited set of remediation paths.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Kill the Offending Query (Fastest, High Risk):&lt;/strong&gt;
If a single analytic query is holding an &lt;code&gt;AccessExclusiveLock&lt;/code&gt;, terminating the PID (&lt;code&gt;pg_terminate_backend&lt;/code&gt;) immediately restores service.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; The application must handle the failure gracefully. If it immediately retries the exact same bad query, the database will lock up again.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Vertical Scaling (Medium Speed, High Cost):&lt;/strong&gt;
Modifying the instance to a larger SKU provides more CPU and memory. For Aurora, this takes minutes.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; It requires a brief interruption of service (failover) and treats the symptom (lack of resources) rather than the disease (bad queries).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Deploy an Emergency Index (Slowest, Permanent Fix):&lt;/strong&gt;
If the Top SQL reveals a missing index causing a sequential scan, building the index &lt;code&gt;CONCURRENTLY&lt;/code&gt; resolves the CPU load.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; Building an index takes time and adds I/O pressure to an already struggling database.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;p&gt;If a remediation action worsens the situation (e.g., terminating a session causes a massive rollback that spikes I/O), the immediate rollback plan must be well-defined:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Stop the application traffic at the load balancer to shed load.&lt;/li&gt;
&lt;li&gt;Wait for the database engine to finish its internal rollback procedures.&lt;/li&gt;
&lt;li&gt;Do not reboot the instance during an active transaction rollback, as it will simply restart the rollback process upon recovery.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;CloudWatch allows for automated remediation through Alarms and Systems Manager (SSM) Runbooks. For example, you can create a CloudWatch Alarm that triggers when &lt;code&gt;FreeableMemory&lt;/code&gt; drops below 10%. Instead of just paging an engineer, the alarm can trigger an AWS Lambda function that queries Performance Insights, identifies the session consuming the most memory, and automatically terminates it.&lt;/p&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Standardize on Performance Insights:&lt;/strong&gt; Do not rely purely on basic CloudWatch metrics. PI’s &lt;code&gt;DBLoad&lt;/code&gt; is the only metric that accurately reflects database saturation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tag Your Queries:&lt;/strong&gt; Mandate that application teams use SQL comments (e.g., &lt;code&gt;/* route=checkout, user=123 */&lt;/code&gt;) so that PI can group database load by application feature.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alert on Saturation, Not Averages:&lt;/strong&gt; Set alarms on wait events and connection limits, not just 80% CPU utilization.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Engineers SSH into bastion hosts and run &lt;code&gt;SHOW PROCESSLIST&lt;/code&gt; during Aurora incidents because the default CloudWatch dashboard surfaces host saturation, not database saturation — &lt;code&gt;CPUUtilization&lt;/code&gt; at 40% tells you nothing about 500 sessions waiting on a lock.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Make &lt;code&gt;DBLoad&lt;/code&gt; sliced by wait event type the primary diagnostic signal in every Aurora incident — it’s the only metric that shows whether the database is blocked, I/O-bound, or genuinely CPU-saturated.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Simulate an I/O spike in staging and verify the corresponding CloudWatch alarm fires within 2 minutes with the wait event correctly identified — if the alarm fires on CPU and not &lt;code&gt;DBLoad&lt;/code&gt;, the triage workflow hasn’t improved.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Enable Performance Insights at 1-second granularity on all production Aurora clusters, add a &lt;code&gt;DBLoad &gt; vCPUs&lt;/code&gt; alarm with wait-event context, and require “Top SQL by Load” in the next database post-mortem.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category><category>architecture</category></item><item><title>Database Changes in CI/CD: Migrations, Backfills, Expand-Contract, and Verification</title><link>https://rajivonai.com/blog/2024-07-16-database-changes-in-ci-cd-migrations-backfills-expand-contract-and-verification/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-07-16-database-changes-in-ci-cd-migrations-backfills-expand-contract-and-verification/</guid><description>Database changes in CI/CD require separate gates for schema migrations, backfills, and expand-contract patterns — not just a shell command before deployment.</description><pubDate>Tue, 16 Jul 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A deployment pipeline that treats database change as a shell command is not automated; it is just moving the outage closer to production.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Application delivery has become routine. Every merge can build, test, package, scan, deploy, and roll back. The uncomfortable exception is the database. Schema changes are durable, shared, stateful, and often expensive. A bad application deploy can be rolled back by moving traffic to a previous artifact. A bad column drop, blocking index build, or half-completed backfill is a different class of failure.&lt;/p&gt;
&lt;p&gt;That is why database delivery needs its own release protocol inside CI/CD. Migrations are not just files in a repository. They are operations against a live, contended system with locks, replication lag, query plans, old application versions, new application versions, background workers, and human rollback expectations.&lt;/p&gt;
&lt;p&gt;Rails describes migrations as a way to evolve schema over time, but its own documentation also notes that not every database supports transactional DDL for every schema operation; when a migration fails, some completed parts may not be rolled back automatically.&lt;sup&gt;&lt;a href=&quot;#user-content-fn-rails-migrations&quot; id=&quot;user-content-fnref-rails-migrations&quot; data-footnote-ref=&quot;&quot; aria-describedby=&quot;footnote-label&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; That small detail is the heart of the problem. Database change is deployment, data repair, capacity management, and verification all at once.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Most teams begin with a simple rule: run migrations before deploy. That works until the migration is slow, incompatible, or logically coupled to code that is not fully rolled out.&lt;/p&gt;
&lt;p&gt;The common failure modes are predictable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A deploy adds code that reads a column before the migration is complete.&lt;/li&gt;
&lt;li&gt;A migration drops a column still used by an older application instance.&lt;/li&gt;
&lt;li&gt;A backfill competes with production traffic and creates lock waits or replica lag.&lt;/li&gt;
&lt;li&gt;A new constraint validates existing dirty data and blocks the deploy.&lt;/li&gt;
&lt;li&gt;A rollback reverts application code but leaves the database in the new shape.&lt;/li&gt;
&lt;li&gt;CI proves the migration works on an empty test database but not on production-sized data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The question is not whether database changes should be automated. They should. The question is what the pipeline must know before it is allowed to change shared state.&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;The safe pattern is expand, deploy, backfill, verify, contract. It turns a dangerous one-step migration into a sequence of compatible states.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[proposal — schema change request] --&gt; B[static checks — unsafe operation detection]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; C[expand migration — additive schema]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; D[deploy code — dual read or dual write]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; E[backfill job — bounded batches]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; F[verification — counts constraints and query plans]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt; G[contract migration — remove obsolete shape]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  G --&gt; H[post deploy audit — drift and health checks]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt;|reject| X[manual review — lock risk or data risk]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt;|pause| Y[traffic protection — throttle or stop]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt;|fail| Z[remediation — repair data before contract]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first design rule is compatibility. Every production state must tolerate old code and new code running together. That means additive migrations first: add nullable columns, create tables, add indexes concurrently where the database supports it, and avoid immediate destructive changes.&lt;/p&gt;
&lt;p&gt;The second rule is separation. Schema migration and data migration are different operations. A schema migration changes shape. A backfill changes volume. Backfills belong in resumable, observable jobs, not inside a deploy transaction. They need batch size, sleep interval, retry policy, progress state, error quarantine, and an emergency stop.&lt;/p&gt;
&lt;p&gt;The third rule is verification as a gate, not a dashboard. The pipeline should not merely run &lt;code&gt;db:migrate&lt;/code&gt; and report success. It should ask whether the resulting database state is compatible with the next release step. That means verifying migration order, expected columns, indexes, constraints, row counts, null rates, duplicate keys, backfill completion, and query plan changes for critical paths.&lt;/p&gt;
&lt;p&gt;The fourth rule is delayed destruction. Contract migrations happen only after the system has proven that the old shape is unused. Dropping a column is not the rollback plan. It is the last step after telemetry, code search, deploy completion, and data verification say the old contract is gone.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; The documented pattern across mature systems is that schema change must be decoupled from ordinary deploy speed. GitLab documents post-deployment migrations for changes that should run after application code is deployed, and it separately documents batched background migrations for long-running data changes.&lt;sup&gt;&lt;a href=&quot;#user-content-fn-gitlab-post-deploy&quot; id=&quot;user-content-fnref-gitlab-post-deploy&quot; data-footnote-ref=&quot;&quot; aria-describedby=&quot;footnote-label&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;sup&gt;&lt;a href=&quot;#user-content-fn-gitlab-batched&quot; id=&quot;user-content-fnref-gitlab-batched&quot; data-footnote-ref=&quot;&quot; aria-describedby=&quot;footnote-label&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; That is not an exotic optimization. It is an acknowledgement that different database operations belong at different points in the release lifecycle.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The platform should encode those phases directly. A pull request that adds a column should pass static migration checks. A deploy should apply only migrations that are safe before code rollout. A post-deploy phase should run operations that depend on new code being present. A backfill worker should own data movement in controlled batches. A final contract migration should be blocked until verification proves the old path is no longer required.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The result is not zero risk. It is localized risk. A failed additive migration can block a deploy before incompatible code ships. A slow backfill can be paused without rolling back the application. A failed verification can stop the contract phase while production continues using the expanded schema. GitHub’s &lt;code&gt;gh-ost&lt;/code&gt; is an example of the same operational instinct for MySQL schema changes: online migration machinery exists because directly altering large production tables can couple migration workload to user-facing database load.&lt;sup&gt;&lt;a href=&quot;#user-content-fn-github-ghost-blog&quot; id=&quot;user-content-fnref-github-ghost-blog&quot; data-footnote-ref=&quot;&quot; aria-describedby=&quot;footnote-label&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;sup&gt;&lt;a href=&quot;#user-content-fn-github-ghost-repo&quot; id=&quot;user-content-fnref-github-ghost-repo&quot; data-footnote-ref=&quot;&quot; aria-describedby=&quot;footnote-label&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; The important lesson is that database CI/CD should optimize for reversible application states, not reversible SQL files. Rollback is often a code movement back to a compatible version while the database remains expanded. The database should move forward through safe states, with destructive changes delayed until they are boring.&lt;/p&gt;
&lt;h3 id=&quot;the-pipeline-contract&quot;&gt;The Pipeline Contract&lt;/h3&gt;
&lt;p&gt;A serious database pipeline needs more than a migration runner.&lt;/p&gt;
&lt;p&gt;It needs a classifier. Additive operations can proceed automatically. Potentially blocking operations require review. Destructive operations require proof that they are in the contract phase. Data rewrites require a backfill plan.&lt;/p&gt;
&lt;p&gt;It needs production realism. CI should run migrations from both an empty database and a recent schema snapshot. The empty case catches ordering problems. The snapshot case catches drift, long-forgotten assumptions, and migrations that only work when no data exists.&lt;/p&gt;
&lt;p&gt;It needs policy checks. Examples include rejecting column drops outside a contract migration, requiring concurrent index creation where supported, blocking non-null constraints without a prior validation plan, and requiring idempotent backfill jobs with checkpoints.&lt;/p&gt;
&lt;p&gt;It needs observability. A backfill without progress is just a long-running incident with a friendlier name. Track rows scanned, rows changed, error rate, lock waits, deadlocks, replica lag, batch latency, and estimated completion. The deploy system should be able to pause the job automatically when the database is under stress.&lt;/p&gt;
&lt;p&gt;It needs explicit ownership. The author of a migration owns the full lifecycle: expand, application compatibility, backfill, verification, and contract. Platform automation can enforce the gates, but it cannot infer the business invariant. Only the owning team can say what “fully backfilled” or “safe to remove” means.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Why it happens&lt;/th&gt;&lt;th&gt;Mitigation&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Migration passes CI but blocks production&lt;/td&gt;&lt;td&gt;Test data is too small and lock behavior is invisible&lt;/td&gt;&lt;td&gt;Run static checks, use realistic schema snapshots, require online patterns for large tables&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Backfill overloads the primary&lt;/td&gt;&lt;td&gt;Data movement is deployed like code instead of operated like workload&lt;/td&gt;&lt;td&gt;Use bounded batches, throttling, checkpoints, and automatic pause conditions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Rollback expectation is false&lt;/td&gt;&lt;td&gt;Application rollback cannot undo destructive schema changes&lt;/td&gt;&lt;td&gt;Use expand-contract and keep old schema available through rollback windows&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Constraint validation fails late&lt;/td&gt;&lt;td&gt;Existing data violates the new invariant&lt;/td&gt;&lt;td&gt;Add constraints in stages, preflight violations, repair data before enforcement&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Contract happens too early&lt;/td&gt;&lt;td&gt;Old code path still exists in workers, scripts, or delayed jobs&lt;/td&gt;&lt;td&gt;Verify usage with telemetry, code search, deploy completion, and job drain checks&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Pipeline becomes too slow&lt;/td&gt;&lt;td&gt;Every change is treated as maximum risk&lt;/td&gt;&lt;td&gt;Classify operations and automate the safe path while escalating only risky changes&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Database changes fail differently than application changes because they mutate shared durable state.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Treat schema migration, code rollout, backfill, verification, and contract as separate CI/CD phases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Use documented patterns such as post-deployment migrations, batched background migrations, and online schema migration tools as evidence that mature systems separate risk by operation type.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Add pipeline gates for unsafe DDL, require resumable backfills, block destructive changes until verification passes, and make every database change declare its expand-contract plan.&lt;/li&gt;
&lt;/ul&gt;
&lt;section data-footnotes=&quot;&quot; class=&quot;footnotes&quot;&gt;&lt;h2 class=&quot;sr-only&quot; id=&quot;footnote-label&quot;&gt;Footnotes&lt;/h2&gt;
&lt;ol&gt;
&lt;li id=&quot;user-content-fn-rails-migrations&quot;&gt;
&lt;p&gt;&lt;a href=&quot;https://guides.rubyonrails.org/active_record_migrations.html&quot;&gt;Rails Guides — Active Record Migrations&lt;/a&gt; &lt;a href=&quot;#user-content-fnref-rails-migrations&quot; data-footnote-backref=&quot;&quot; aria-label=&quot;Back to reference 1&quot; class=&quot;data-footnote-backref&quot;&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id=&quot;user-content-fn-gitlab-post-deploy&quot;&gt;
&lt;p&gt;&lt;a href=&quot;https://docs.gitlab.com/development/database/post_deployment_migrations/&quot;&gt;GitLab Docs — Post-deployment migrations&lt;/a&gt; &lt;a href=&quot;#user-content-fnref-gitlab-post-deploy&quot; data-footnote-backref=&quot;&quot; aria-label=&quot;Back to reference 2&quot; class=&quot;data-footnote-backref&quot;&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id=&quot;user-content-fn-gitlab-batched&quot;&gt;
&lt;p&gt;&lt;a href=&quot;https://docs.gitlab.com/development/database/batched_background_migrations/&quot;&gt;GitLab Docs — Batched background migrations&lt;/a&gt; &lt;a href=&quot;#user-content-fnref-gitlab-batched&quot; data-footnote-backref=&quot;&quot; aria-label=&quot;Back to reference 3&quot; class=&quot;data-footnote-backref&quot;&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id=&quot;user-content-fn-github-ghost-blog&quot;&gt;
&lt;p&gt;&lt;a href=&quot;https://github.blog/2016-08-01-gh-ost-github-s-online-migration-tool-for-mysql/&quot;&gt;GitHub Blog — gh-ost: GitHub’s online schema migration tool for MySQL&lt;/a&gt; &lt;a href=&quot;#user-content-fnref-github-ghost-blog&quot; data-footnote-backref=&quot;&quot; aria-label=&quot;Back to reference 4&quot; class=&quot;data-footnote-backref&quot;&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id=&quot;user-content-fn-github-ghost-repo&quot;&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/github/gh-ost&quot;&gt;GitHub — gh-ost repository&lt;/a&gt; &lt;a href=&quot;#user-content-fnref-github-ghost-repo&quot; data-footnote-backref=&quot;&quot; aria-label=&quot;Back to reference 5&quot; class=&quot;data-footnote-backref&quot;&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;</content:encoded><category>databases</category><category>architecture</category></item><item><title>PostgreSQL Monitoring: The Dashboard That Surfaces Problems Before Users Do</title><link>https://rajivonai.com/blog/2024-07-08-postgresql-monitoring-dashboard-queries-connections-replication-vacuum/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-07-08-postgresql-monitoring-dashboard-queries-connections-replication-vacuum/</guid><description>The eight PostgreSQL metric groups that matter for production operations — queries, connections, replication lag, autovacuum, locks, cache pressure, checkpoint behavior, and bloat — with exact SQL and alert thresholds.</description><pubDate>Mon, 08 Jul 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A PostgreSQL dashboard that only shows CPU and memory is a late warning system. The database tells you about problems in its own catalog — in &lt;code&gt;pg_stat_activity&lt;/code&gt;, &lt;code&gt;pg_stat_statements&lt;/code&gt;, &lt;code&gt;pg_stat_replication&lt;/code&gt;, and &lt;code&gt;pg_stat_bgwriter&lt;/code&gt; — before they surface as user-visible errors. The question is whether you’re reading those catalogs before or after the incident page fires.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most PostgreSQL monitoring setups start with the OS metrics the infrastructure team already collects: CPU, memory, disk I/O, network. Those metrics are necessary but not sufficient. A database with 20% CPU and 60% memory can still be in deep trouble: connection pools exhausted, replica 45 minutes behind, autovacuum fighting bloat on the largest tables, and a lock chain building behind a slow migration.&lt;/p&gt;
&lt;p&gt;The eight PostgreSQL metric groups below come from the database itself. Most can be collected by any monitoring agent — Datadog, Prometheus + postgres_exporter, CloudWatch with Enhanced Monitoring, or direct queries from a read-only monitoring role.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Symptom&lt;/th&gt;&lt;th&gt;Likely source&lt;/th&gt;&lt;th&gt;First catalog to check&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Application queries suddenly slower&lt;/td&gt;&lt;td&gt;Lock contention or bad plan&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt;, &lt;code&gt;pg_locks&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Connection pool exhausted&lt;/td&gt;&lt;td&gt;Idle-in-transaction or max_connections hit&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt; filtered by state&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replica reads returning stale data&lt;/td&gt;&lt;td&gt;Replication lag&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_replication&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Table scan on a previously fast query&lt;/td&gt;&lt;td&gt;Bloat has made statistics stale&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_user_tables&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Checkpoint warnings in server log&lt;/td&gt;&lt;td&gt;bgwriter pressure&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_bgwriter&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Application sees deadlock errors&lt;/td&gt;&lt;td&gt;Write contention on hot rows&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_locks&lt;/code&gt; + server log&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Disk filling faster than expected&lt;/td&gt;&lt;td&gt;Orphaned temp files or unarchived WAL&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_bgwriter&lt;/code&gt;, WAL directory&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;OOM kill on the database server&lt;/td&gt;&lt;td&gt;Work_mem overrun from parallel queries&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt; + &lt;code&gt;work_mem&lt;/code&gt; setting&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;p&gt;Run these in order when something is wrong. Each check requires only read access to system catalogs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. What are active sessions doing right now?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pid, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;query_start&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; duration,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       query, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, wait_event_type, wait_event, usename&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; !=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query_start &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; interval &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;5 seconds&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; duration &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Look for sessions in &lt;code&gt;idle in transaction&lt;/code&gt; (holding locks while waiting on an application) or &lt;code&gt;active&lt;/code&gt; with long durations. Any query running more than 30 seconds in OLTP deserves investigation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Is anyone waiting on locks?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocked_pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       blocked&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;query&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocked_query,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;query&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_query,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;usename&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_user,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;       now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;query_start&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocked_duration&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_locks blocked_locks&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity blocked &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_locks blocking_locks&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;     ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;transactionid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;transactionid&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;     AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; !=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity blocking &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocked_locks&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;granted&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocked_duration &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A lock chain longer than 10 seconds is a reliability event, not a monitoring blip.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. How far behind is the replica?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- On the primary:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; client_addr, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, sent_lsn, write_lsn, flush_lsn, replay_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       (sent_lsn &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; replay_lsn) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; replication_lag_bytes,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       write_lag, flush_lag, replay_lag&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_replication;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For seconds of lag: &lt;code&gt;pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 16384 * (wal_block_size / 16384)&lt;/code&gt; approximates byte lag. Many monitoring agents compute this directly. Alert at 60 seconds; page at 300 seconds for read-replica-dependent applications.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. Is autovacuum keeping up?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; relname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       n_dead_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       n_live_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       ROUND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_dead_tup::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;numeric&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; NULLIF&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_live_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;+&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;2&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; dead_pct,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       last_autovacuum,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       last_autoanalyze&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dead tuple ratio over 20% on a high-traffic table means autovacuum is behind. Tables not autovacuumed in 24 hours are candidates for bloat investigation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. What is checkpoint pressure?&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; checkpoints_timed, checkpoints_req,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       checkpoint_write_time &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; write_secs,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       checkpoint_sync_time &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; sync_secs,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       buffers_checkpoint, buffers_clean, buffers_backend,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       buffers_alloc,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       stats_reset&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_bgwriter;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;checkpoints_req&lt;/code&gt; above zero means PostgreSQL is forcing checkpoints faster than &lt;code&gt;checkpoint_completion_target&lt;/code&gt; can absorb. &lt;code&gt;buffers_backend&lt;/code&gt; above zero means application processes are doing work that &lt;code&gt;bgwriter&lt;/code&gt; should handle — a sign of write pressure.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Symptom observed] --&gt; B{Active sessions check}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Long-running active queries| C[Check pg_stat_statements — plan regression or new query?]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Idle in transaction sessions| D[Find the application holding transactions open]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Lock waits| E[Kill blocking session or escalate to application team]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|All looks normal| F{Check replication}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|Replica lag above threshold| G[Identify write pressure source — high-volume writes or bloated WAL archiving?]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|Lag acceptable| H{Check autovacuum}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|Dead tuples high| I[Manual VACUUM on table or increase autovacuum_vacuum_scale_factor]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|Autovacuum absent| J[Check autovacuum_max_workers and pg_stat_activity for autovacuum processes]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|No autovacuum issues| K{Check checkpoint pressure}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|checkpoints_req high| L[Increase max_wal_size or spread write workload]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|buffers_backend high| M[Tune bgwriter_lru_maxpages or review write amplification]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Problem&lt;/th&gt;&lt;th&gt;Immediate action&lt;/th&gt;&lt;th&gt;Durable fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Long-running idle-in-transaction&lt;/td&gt;&lt;td&gt;&lt;code&gt;SELECT pg_terminate_backend(pid)&lt;/code&gt; on sessions over threshold&lt;/td&gt;&lt;td&gt;Set &lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt; on the application role&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Lock chain&lt;/td&gt;&lt;td&gt;Identify and terminate the root blocking session&lt;/td&gt;&lt;td&gt;Fix the application transaction that holds locks across slow external calls&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replica lag&lt;/td&gt;&lt;td&gt;Check for write burst or long transaction on primary&lt;/td&gt;&lt;td&gt;Add streaming replication slot monitoring; tune &lt;code&gt;wal_level&lt;/code&gt; and replica apply workers&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High dead tuples&lt;/td&gt;&lt;td&gt;&lt;code&gt;VACUUM (VERBOSE) tablename;&lt;/code&gt; directly&lt;/td&gt;&lt;td&gt;Lower &lt;code&gt;autovacuum_vacuum_scale_factor&lt;/code&gt; for high-traffic tables; increase &lt;code&gt;autovacuum_max_workers&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Checkpoint pressure&lt;/td&gt;&lt;td&gt;Increase &lt;code&gt;max_wal_size&lt;/code&gt; (default 1GB, common to set 4–16GB)&lt;/td&gt;&lt;td&gt;Review write amplification from bulk loads; separate OLAP workloads to replicas&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cache hit ratio below 95%&lt;/td&gt;&lt;td&gt;Review &lt;code&gt;shared_buffers&lt;/code&gt; sizing (target 25% of RAM, not more)&lt;/td&gt;&lt;td&gt;Identify tables with sequential scans using &lt;code&gt;pg_statio_user_tables&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Three PostgreSQL checks can be automated into a runbook trigger:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Idle-in-transaction watchdog&lt;/strong&gt;: query &lt;code&gt;pg_stat_activity&lt;/code&gt; every 60 seconds; alert if any session has been &lt;code&gt;idle in transaction&lt;/code&gt; for more than 5 minutes. Auto-terminate sessions over 30 minutes with a logged record.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Replica lag SLO&lt;/strong&gt;: collect &lt;code&gt;pg_stat_replication.replay_lag&lt;/code&gt; as a gauge metric; alert at 60s, page at 5 minutes, trigger write traffic rerouting away from reader endpoint at 10 minutes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Autovacuum health check&lt;/strong&gt;: daily scheduled query against &lt;code&gt;pg_stat_user_tables&lt;/code&gt;; flag tables where &lt;code&gt;last_autovacuum&lt;/code&gt; is null or more than 48 hours old AND &lt;code&gt;n_live_tup &gt; 100000&lt;/code&gt;. Output as a structured JSON payload to the operations channel.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;p&gt;PostgreSQL health is not visible in CPU and memory alone. The database catalogs tell you about lock chains, replica lag, bloat accumulation, and checkpoint pressure — all of which affect user-visible latency before CPU crosses 80%. The metrics above require a read-only monitoring role and a scrape interval of 60 seconds or less. The most common monitoring gap in PostgreSQL deployments is not the absence of metrics but the absence of thresholds: teams collect data without defining what “bad” looks like until they are in an incident trying to find historical baselines.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Why it happens&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Alert on every autovacuum completion&lt;/td&gt;&lt;td&gt;autovacuum runs are logged as activity; thresholds not tuned to table size&lt;/td&gt;&lt;td&gt;Alert on dead tuple ratio, not autovacuum frequency&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Lock alert fires during schema migration&lt;/td&gt;&lt;td&gt;Intentional DDL lock causes alert storm&lt;/td&gt;&lt;td&gt;Suppress lock alerts during maintenance windows; use &lt;code&gt;lock_timeout&lt;/code&gt; on migrations&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replica lag alert on writes&lt;/td&gt;&lt;td&gt;Single large write causes temporary lag; recovers in seconds&lt;/td&gt;&lt;td&gt;Use 60-second averages, not point-in-time values&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;pg_stat_statements&lt;/code&gt; not populated&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_statements&lt;/code&gt; not in &lt;code&gt;shared_preload_libraries&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Add to &lt;code&gt;shared_preload_libraries&lt;/code&gt;, restart, set &lt;code&gt;track_activity_query_size&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Monitoring role missing&lt;/td&gt;&lt;td&gt;Agent lacks read access to catalogs&lt;/td&gt;&lt;td&gt;Create a dedicated &lt;code&gt;monitoring&lt;/code&gt; role with &lt;code&gt;pg_monitor&lt;/code&gt; system role (PG 10+)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Timestamp drift on replicas&lt;/td&gt;&lt;td&gt;Lag reported in bytes, not seconds&lt;/td&gt;&lt;td&gt;Use &lt;code&gt;replay_lag&lt;/code&gt; column directly (PG 10+) or compute from LSN difference&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-this-post-does-not-cover&quot;&gt;What This Post Does Not Cover&lt;/h2&gt;
&lt;p&gt;This post covers catalog-level PostgreSQL monitoring from inside the database. It does not cover: Prometheus exporter configuration and recording rules (covered in the Prometheus and Grafana post in this series), CloudWatch Enhanced Monitoring for RDS/Aurora, PgBouncer pool metrics, or logical replication slot lag as a distinct monitoring dimension. Each of those has a dedicated post in this series.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; PostgreSQL is reporting problems through its catalogs, but your dashboard only shows OS-level metrics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Add the eight metric groups above to your monitoring stack using &lt;code&gt;pg_monitor&lt;/code&gt; role and a 60-second scrape interval.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Run the five checks above against your production instance right now and note whether any sessions are idle-in-transaction, whether replicas are within SLO, and whether any table has a dead tuple ratio above 10%.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; This week, create a &lt;code&gt;monitoring&lt;/code&gt; role with &lt;code&gt;GRANT pg_monitor TO monitoring&lt;/code&gt;, add it to your Datadog, Prometheus, or CloudWatch configuration, and set a replica lag alert with a 60-second threshold.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category></item><item><title>Search Index Drift Workflow: Rebuilds, Dual Writes, CDC, and User-Visible Staleness</title><link>https://rajivonai.com/blog/2024-06-14-search-index-drift-workflow-rebuilds-dual-writes-cdc-and-user-visible-staleness/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-06-14-search-index-drift-workflow-rebuilds-dual-writes-cdc-and-user-visible-staleness/</guid><description>Search index drift is a truth-management failure: when to rebuild vs. dual-write vs. CDC, and how to bound user-visible staleness.</description><pubDate>Fri, 14 Jun 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Search drift is not a search problem first. It is a truth-management problem that becomes visible through search.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most product systems keep their source of truth in a transactional database and serve discovery from a separate search index. The database is optimized for correctness, constraints, and writes. The index is optimized for ranking, tokenization, faceting, filtering, autocomplete, and latency.&lt;/p&gt;
&lt;p&gt;That split is normal. PostgreSQL, MySQL, DynamoDB, Spanner, or another system owns the canonical record. Elasticsearch, OpenSearch, Solr, Vespa, Algolia, or a custom retrieval layer owns the read path for search. Between them sits a workflow that turns database mutations into index mutations.&lt;/p&gt;
&lt;p&gt;The uncomfortable part is that the index is not merely a cache. Users treat search results as product truth. If a deleted document still appears, if a price update lags, if an access-control change is missing, or if a newly created object is absent, the failure is not described as “eventual consistency.” It is described as “the product is wrong.”&lt;/p&gt;
&lt;p&gt;Search index drift is the gap between canonical state and searchable state. Some drift is expected. Unbounded drift is an incident.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Teams usually discover drift after adopting one of three write patterns.&lt;/p&gt;
&lt;p&gt;The first is application dual write: the request handler writes the database and then writes the search index. This looks simple until partial failure appears. The database commit succeeds, the index write times out, the retry creates stale ordering, or the process crashes between operations. If the two systems cannot share a transaction boundary, the application has accepted a consistency gap.&lt;/p&gt;
&lt;p&gt;The second is asynchronous job indexing: writes enqueue work, and workers update the index later. This removes latency from the request path, but it creates a backlog system. Queue lag, poison messages, deploy bugs, and schema incompatibilities become search correctness risks.&lt;/p&gt;
&lt;p&gt;The third is periodic rebuild: the team periodically scans the database and recreates the index. Rebuilds are useful, but they are not a complete freshness strategy. A nightly rebuild can repair silent corruption, but it cannot provide minute-level correctness unless the product accepts a full day of visible staleness.&lt;/p&gt;
&lt;p&gt;The core question is not “which tool indexes fastest?” It is: how do we bound, observe, repair, and communicate the difference between source-of-truth state and search-visible state?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;The practical architecture combines four ideas: change capture, idempotent indexing, rebuildable indexes, and user-visible freshness controls.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[primary database — canonical records] --&gt; B[transaction log — ordered changes]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; C[change capture workers — durable cursor]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; D[index writer — idempotent updates]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; E[active search index — user queries]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A --&gt; F[bulk rebuild job — full snapshot]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt; G[shadow search index — validation target]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  G --&gt; H[index alias switch — controlled cutover]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; I[drift monitor — lag and mismatches]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  I --&gt; J[operator workflow — replay repair rebuild]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; K[user interface — freshness signals]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The database remains the only source of truth. Search documents carry source version metadata: record ID, updated timestamp, logical sequence number, schema version, and deletion marker. Index writes are idempotent, so replaying the same change is safe. Out-of-order writes are rejected when the incoming version is older than the indexed version.&lt;/p&gt;
&lt;p&gt;Change data capture is the preferred steady-state path because it follows committed database changes rather than application intent. The application writes the database once. A CDC pipeline reads the transaction log and updates the index. This does not eliminate drift, but it moves drift into a measurable workflow: cursor lag, event age, failure rate, dead-letter volume, and version mismatch count.&lt;/p&gt;
&lt;p&gt;Rebuilds remain mandatory. CDC preserves forward progress; rebuilds repair historical mistakes. A rebuild creates a shadow index from a consistent source snapshot, validates document counts and sampled records, warms query paths, then atomically moves an alias or routing pointer. The old index remains available for rollback until confidence is high.&lt;/p&gt;
&lt;p&gt;Dual writes are still useful in narrow places. For example, a product may write directly to search for low-risk preview experiences while CDC provides authoritative correction. But dual writes should not be the only correctness mechanism for objects where permissions, money, inventory, or deletion semantics matter.&lt;/p&gt;
&lt;p&gt;User-visible staleness must be designed deliberately. Some systems can show “results updated a few seconds ago.” Others need read-after-write behavior for the author of a change, even if global search is eventually consistent. That can be handled by merging canonical database reads for the user’s own recent writes, routing a specific object lookup to the database, or hiding search results whose indexed version is older than a known permission version.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Elasticsearch documents its &lt;code&gt;_reindex&lt;/code&gt; API and alias-based index management as operational mechanisms for copying documents into a new index and switching traffic through aliases. The documented pattern is that index structure changes and large repairs are handled by creating a new index, filling it, and moving the read alias rather than mutating every serving assumption in place.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Apply that pattern to search drift recovery. Treat every serving index as replaceable. Keep index mappings and analyzers versioned. Build a shadow index from the canonical store, compare counts and sampled documents, then switch the alias when validation passes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Rebuilds become a normal maintenance operation instead of a one-off incident script. The system can repair missed CDC events, analyzer mistakes, mapping errors, and accidental partial deletes without taking search offline.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Rebuildability is a correctness property. If the index cannot be recreated from truth, then the index has quietly become truth.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Debezium’s documented architecture captures database changes from transaction logs and emits ordered change events to downstream consumers. PostgreSQL logical decoding and MySQL binlog replication expose the same architectural principle: committed database changes can be read after the fact without placing a second write inside the application request path.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Use CDC as the default index mutation source. Persist consumer offsets. Make index writes idempotent. Store source versions in documents. Send failed records to a dead-letter workflow that can be replayed after the bug is fixed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The indexing path becomes observable as a pipeline rather than hidden inside application handlers. Operators can measure lag, pause consumers, replay records, and distinguish source write failures from projection failures.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; CDC does not make search strongly consistent. It makes inconsistency bounded, inspectable, and repairable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Amazon DynamoDB Streams documents an ordered stream of item-level modifications that can trigger downstream processing. The documented pattern is not specific to search: one durable primary write can fan out to derived views.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; For key-value or document stores, use the database’s change stream as the trigger for index projection. Preserve deletion events, because missing tombstones are one of the most common sources of user-visible drift.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The index can track creates, updates, and deletes from the same committed mutation source. Replays can reconstruct the projected state if the index writer is deterministic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Deletes deserve first-class workflow design. A stale creation is annoying; a stale deletion can be a privacy, permission, or compliance failure.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;













































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Why it happens&lt;/th&gt;&lt;th&gt;Mitigation&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Out-of-order updates&lt;/td&gt;&lt;td&gt;Retries and parallel workers race&lt;/td&gt;&lt;td&gt;Store source versions and reject older writes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Missing deletes&lt;/td&gt;&lt;td&gt;Tombstones expire before indexing catches up&lt;/td&gt;&lt;td&gt;Retain delete events long enough for replay and rebuild&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Rebuild cutover errors&lt;/td&gt;&lt;td&gt;Shadow index differs from serving assumptions&lt;/td&gt;&lt;td&gt;Use aliases, validation queries, and rollback windows&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;CDC backlog&lt;/td&gt;&lt;td&gt;Consumer deploy, poison event, or downstream throttling&lt;/td&gt;&lt;td&gt;Alert on event age, not only queue depth&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Mapping drift&lt;/td&gt;&lt;td&gt;Application emits fields the index cannot parse&lt;/td&gt;&lt;td&gt;Version schemas and fail records into replayable quarantine&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Permission staleness&lt;/td&gt;&lt;td&gt;Search document carries old access metadata&lt;/td&gt;&lt;td&gt;Version authorization data or verify sensitive results against truth&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Silent corruption&lt;/td&gt;&lt;td&gt;Index accepts wrong but valid documents&lt;/td&gt;&lt;td&gt;Run sampled truth-versus-index audits continuously&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Search drift becomes dangerous when nobody can say how stale the index is. Define freshness SLOs by product surface, not by infrastructure component.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Use CDC for steady-state propagation, idempotent writers for replay, shadow rebuilds for repair, and alias cutovers for controlled replacement.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Instrument source version, indexed version, CDC cursor lag, oldest unprocessed event age, dead-letter count, rebuild validation count, and sampled mismatch rate.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Start with one high-value entity. Add version metadata to its search document, build a truth-versus-index audit, and write the runbook for replay, rebuild, and rollback before the next drift incident.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>databases</category><category>cloud</category></item><item><title>The Database Observability Baseline: What Every DBA Dashboard Must Show</title><link>https://rajivonai.com/blog/2024-06-04-database-observability-baseline-dashboard/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-06-04-database-observability-baseline-dashboard/</guid><description>Before you can adopt AI-assisted triage, your database dashboard needs a foundation built on saturation, locking, and lag metrics.</description><pubDate>Tue, 04 Jun 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;If your primary database monitoring signal is a CPU spike, your telemetry is designed to tell you when the application is already broken, rather than telling you why the database is about to break.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most engineering teams rely on default cloud dashboards that prioritize host-level metrics: CPU utilization, memory consumption, and disk I/O. While these metrics matter for capacity planning, they are lag indicators for database health. A CPU spike is the &lt;em&gt;result&lt;/em&gt; of a problem—a bad query plan, a missing index, or a connection storm—not the problem itself.&lt;/p&gt;
&lt;p&gt;As teams move toward automated operations and AI-assisted triage, the agentic systems investigating incidents need granular telemetry. You cannot build a reliable AI SRE if the only context it receives is “CPU is at 99%.” The foundation of database observability must shift from host-level symptoms to engine-level state.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;When a database fails, it usually does so in one of three ways: it runs out of connections, it gets blocked by a lock, or it falls behind on maintenance tasks (like replication or vacuuming) until performance collapses.&lt;/p&gt;
&lt;p&gt;Default dashboards rarely surface these states clearly. Engineers spend critical incident minutes running ad-hoc SQL queries to figure out what is currently executing, who is blocking whom, and whether the connection pool is saturated. If your observability strategy relies on engineers SSH-ing into a bastion or running &lt;code&gt;pg_stat_activity&lt;/code&gt; manually during an outage, your time-to-mitigation will never improve.&lt;/p&gt;
&lt;h2 id=&quot;the-saturation-and-contention-baseline&quot;&gt;The Saturation and Contention Baseline&lt;/h2&gt;
&lt;p&gt;Every database dashboard must surface three categories of engine-level telemetry:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Saturation Metrics&lt;/strong&gt;: Active connections vs. maximum allowed, thread pool utilization, and cache hit ratios. You must know if the database is refusing work.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Contention Metrics&lt;/strong&gt;: Row locks, table locks, and wait events. In PostgreSQL, this means tracking &lt;code&gt;wait_event_type&lt;/code&gt;. In MySQL, it means watching InnoDB row lock waits.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lag Metrics&lt;/strong&gt;: Replication lag (in bytes and seconds) and maintenance lag (e.g., autovacuum backlog, compaction queue depth).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A baseline SQL query for PostgreSQL contention that should be converted into a constant metric looks like this:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    wait_event_type, &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    wait_event, &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;as&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; waiting_sessions&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; wait_event_type &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IS NOT NULL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; wait_event_type, wait_event&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; waiting_sessions &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If your dashboard shows a spike in &lt;code&gt;Lock&lt;/code&gt; wait events alongside a drop in cache hit ratio, you immediately know you have a query contention issue, saving 15 minutes of triage.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern for robust observability involves turning engine-state queries into time-series data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; PostgreSQL’s lock architecture means that sessions waiting for a lock consume zero CPU — a blocked process is simply parked, not working. This makes host-level monitoring blind to lock-induced latency. The PostgreSQL documentation describes &lt;code&gt;pg_stat_activity.wait_event_type&lt;/code&gt; as the authoritative source for what a session is waiting on, with &lt;code&gt;Lock&lt;/code&gt; as the wait event type for sessions blocked behind another session’s hold (&lt;a href=&quot;https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW&quot;&gt;PostgreSQL docs: pg_stat_activity&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The documented operational pattern is to export &lt;code&gt;pg_stat_activity&lt;/code&gt; wait event counts as a time-series metric polled every 10–15 seconds, so that lock contention spikes appear on dashboards alongside — and often well ahead of — latency metrics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; This approach surfaces &lt;code&gt;AccessExclusiveLock&lt;/code&gt; spikes from DDL operations — &lt;code&gt;TRUNCATE&lt;/code&gt;, &lt;code&gt;VACUUM FULL&lt;/code&gt;, schema migrations — that block all concurrent readers without generating any CPU activity on the database host.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; PostgreSQL lock waits are invisible to infrastructure monitoring. The only signal is in the engine itself: &lt;code&gt;wait_event_type = &apos;Lock&apos;&lt;/code&gt; in &lt;code&gt;pg_stat_activity&lt;/code&gt; is the diagnostic that turns a “CPU looks fine, why is the app slow?” incident into a sub-minute diagnosis.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;
&lt;p&gt;Relying entirely on custom engine metrics introduces its own set of tradeoffs:&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Approach&lt;/th&gt;&lt;th&gt;Advantage&lt;/th&gt;&lt;th&gt;Disadvantage&lt;/th&gt;&lt;th&gt;Failure Mode&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;High-Frequency Polling&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Catches micro-spikes in locks and connection exhaustion.&lt;/td&gt;&lt;td&gt;Puts continuous load on the database just to monitor it.&lt;/td&gt;&lt;td&gt;The monitoring query itself times out when the database is fully saturated.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Log-Based Telemetry&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Zero additional query load; captures exact slow queries.&lt;/td&gt;&lt;td&gt;High ingestion costs and delayed parsing times.&lt;/td&gt;&lt;td&gt;Log volumes spike during an incident, delaying the very telemetry needed to diagnose it.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Cloud Provider Insights (e.g., PI)&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Managed, low-overhead, deep integration with the hypervisor.&lt;/td&gt;&lt;td&gt;Locked into the vendor’s UI; harder to expose to internal AI agents.&lt;/td&gt;&lt;td&gt;The data cannot be easily correlated with external application traces.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Default cloud dashboards report CPU and memory — lag indicators that fire after the database is already broken, not before. Lock-induced latency produces zero CPU signal.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Add a “What is Waiting?” panel tracking &lt;code&gt;pg_stat_activity&lt;/code&gt; wait event counts, active lock counts, connection pool saturation, and replication byte lag as continuously scraped time-series metrics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; A staging game day that artificially locks a row should fire an alert within 60 seconds based on wait events — if it doesn’t, the telemetry foundation is incomplete and the next production incident will look exactly like the current one.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Deploy a PostgreSQL exporter polling &lt;code&gt;pg_stat_activity&lt;/code&gt; every 15 seconds and add a dashboard panel for &lt;code&gt;Lock&lt;/code&gt; wait event counts this week.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category><category>failures</category><category>checklist</category></item><item><title>pgvector Basics: Embeddings Inside PostgreSQL</title><link>https://rajivonai.com/blog/2024-06-03-pgvector-basics-embeddings-inside-postgresql/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-06-03-pgvector-basics-embeddings-inside-postgresql/</guid><description>How pgvector adds vector storage and similarity search to PostgreSQL, what the three distance operators do, and the index you must create before you hit 100K rows.</description><pubDate>Mon, 03 Jun 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;pgvector lets you store and query embeddings directly in PostgreSQL — no separate vector database required. The extension is straightforward to install and the SQL surface is small. What catches engineers is that PostgreSQL will silently fall back to a full sequential scan if you never create a vector index, and at 10K rows that’s fine, but at 1M rows it’s unusable.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Embedding-based search has moved from ML research into standard backend work. Any feature that does semantic search, recommendations, or RAG retrieval needs to store embedding vectors and query them by similarity. The default answer for the past few years was to reach for a dedicated vector database — Pinecone, Weaviate, Qdrant. That’s still reasonable for pure vector workloads at scale. But for teams already running PostgreSQL, adding a second operational system for vectors means new infrastructure, new credentials, a second backup strategy, and cross-system consistency problems when the embedding and the source document live in different stores.&lt;/p&gt;
&lt;p&gt;pgvector, a PostgreSQL extension maintained on GitHub at &lt;code&gt;pgvector/pgvector&lt;/code&gt;, adds a native &lt;code&gt;vector&lt;/code&gt; column type and three index strategies to an existing Postgres instance. If your application already runs on PostgreSQL and your vector search latency requirements are in the tens-of-milliseconds range rather than single-digit milliseconds, pgvector lets you keep vectors and metadata in the same rows, under the same ACID guarantees, queried with the same SQL you already write.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Engineers discover pgvector, install it in an afternoon, add a &lt;code&gt;vector(1536)&lt;/code&gt; column to an existing table, and populate it with OpenAI embeddings using &lt;code&gt;text-embedding-ada-002&lt;/code&gt;. The first few similarity queries are fast. They ship the feature. Six months later, the table has grown to several hundred thousand rows and those queries are timing out.&lt;/p&gt;
&lt;p&gt;The root cause is almost always the same: no index was created on the vector column. PostgreSQL’s query planner has no way to prune a vector search geometrically without an index, so it scans every row and computes the distance to the query vector one row at a time. At 10K rows a sequential scan takes milliseconds. At 1M rows it takes seconds. The extension documentation on the pgvector GitHub README is explicit about this — approximate nearest-neighbor indexes are required for large datasets — but the requirement is easy to miss when the extension works so well at small scale.&lt;/p&gt;
&lt;p&gt;The core question this post answers: what do you need to set up correctly on day one so that pgvector stays fast as data grows?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  App[Application] --&gt; Query[SQL Query with Embedding]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  Query --&gt; PG[PostgreSQL — pgvector extension]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  PG --&gt; Planner[Query Planner]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  Planner --&gt; CheckIndex{Vector Index Exists}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  CheckIndex --&gt;|No| SeqScan[Sequential Scan]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  SeqScan --&gt; ComputeAll[Compute Distance for Every Row]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  CheckIndex --&gt;|Yes| IndexScan[HNSW or IVFFlat Index Scan]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  IndexScan --&gt; ComputeApprox[Approximate Nearest Neighbor Search]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  ComputeAll --&gt; Results[Return Top K Results]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  ComputeApprox --&gt; Results&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Installation.&lt;/strong&gt; pgvector ships as a standard PostgreSQL extension. On most managed cloud databases (Amazon RDS, Google Cloud SQL, Supabase, Neon) it’s already available. On a self-managed Postgres instance, install from the pgvector GitHub repository or via your distro’s package manager, then run:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; EXTENSION &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IF&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXISTS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; vector;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That’s the full installation step. No daemon, no separate service.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column type and table shape.&lt;/strong&gt; pgvector adds a &lt;code&gt;vector(n)&lt;/code&gt; column type where &lt;code&gt;n&lt;/code&gt; is the number of dimensions. OpenAI’s &lt;code&gt;text-embedding-ada-002&lt;/code&gt; model produces 1536-dimensional vectors; &lt;code&gt;text-embedding-3-small&lt;/code&gt; and &lt;code&gt;text-embedding-3-large&lt;/code&gt; use variable dimensions configurable at generation time with 1536 as a common default. A minimal embeddings table looks like:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; documents&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  id       &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;bigserial&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; PRIMARY KEY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  content  &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;text&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;       NOT NULL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  embedding vector(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1536&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Inserting a row with an embedding means passing the vector as a string literal or using a client library that serializes it for you:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; documents (content, embedding)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VALUES&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;The query planner chooses scan strategies based on statistics.&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;[0.021, -0.008, 0.034, ...]&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;The three distance operators.&lt;/strong&gt; pgvector exposes three similarity operators, each suited to different use cases:&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Operator&lt;/th&gt;&lt;th&gt;Name&lt;/th&gt;&lt;th&gt;When to use&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;&amp;#x3C;-&gt;&lt;/code&gt;&lt;/td&gt;&lt;td&gt;L2 (Euclidean) distance&lt;/td&gt;&lt;td&gt;General-purpose; works on raw or normalized vectors&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;&amp;#x3C;=&gt;&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Cosine distance&lt;/td&gt;&lt;td&gt;Text embeddings; robust to vectors of different magnitudes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;&amp;#x3C;#&gt;&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Negative inner product&lt;/td&gt;&lt;td&gt;Normalized vectors only; fastest to compute&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;A cosine similarity query — “return the 5 documents most semantically similar to this query embedding” — looks like:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; id, content, embedding &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&amp;#x3C;=&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;[0.021, -0.008, 0.034, ...]&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; distance&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; documents&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; distance&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For text embeddings, &lt;code&gt;&amp;#x3C;=&gt;&lt;/code&gt; (cosine) is the safe default. It is magnitude-insensitive, which matters because embedding models do not guarantee that all vectors will have the same norm.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Index types.&lt;/strong&gt; Without an index, every query above is a full sequential scan. pgvector supports two approximate nearest-neighbor index types:&lt;/p&gt;


























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Index&lt;/th&gt;&lt;th&gt;Build cost&lt;/th&gt;&lt;th&gt;Query recall&lt;/th&gt;&lt;th&gt;Memory use&lt;/th&gt;&lt;th&gt;Good for&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;IVFFlat&lt;/td&gt;&lt;td&gt;Lower&lt;/td&gt;&lt;td&gt;Tunable (lists parameter)&lt;/td&gt;&lt;td&gt;Lower&lt;/td&gt;&lt;td&gt;Datasets that change infrequently; faster to build&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;HNSW&lt;/td&gt;&lt;td&gt;Higher&lt;/td&gt;&lt;td&gt;Higher by default&lt;/td&gt;&lt;td&gt;Higher&lt;/td&gt;&lt;td&gt;Datasets that are queried heavily; better recall at same speed&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;For an initial deployment, IVFFlat is simpler. The &lt;code&gt;lists&lt;/code&gt; parameter divides the vector space into clusters; a good starting value is &lt;code&gt;sqrt(row_count)&lt;/code&gt;. A minimal IVFFlat index on cosine distance:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; documents &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;USING&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ivfflat (embedding vector_cosine_ops)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WITH&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (lists &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For HNSW:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; documents &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;USING&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; hnsw (embedding vector_cosine_ops);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At datasets below roughly 10K rows, a sequential scan will often outperform an approximate index because the index lookup overhead isn’t amortized. At 100K rows and beyond, the index becomes necessary. There is no harm in creating the index early.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The pgvector GitHub README documents the full operator and index syntax. The project is maintained at &lt;code&gt;pgvector/pgvector&lt;/code&gt; on GitHub and the README is the authoritative source for supported Postgres versions, operator names, and index parameter ranges.&lt;/p&gt;
&lt;p&gt;OpenAI’s embeddings API documentation specifies that &lt;code&gt;text-embedding-ada-002&lt;/code&gt; produces 1536-dimensional vectors. That dimension count is a fixed constraint — the &lt;code&gt;vector(n)&lt;/code&gt; column type enforces an exact match, and a query embedding with a different dimension count will return a PostgreSQL type error at runtime. This is a documented behavior of the pgvector type system, not an edge case.&lt;/p&gt;
&lt;p&gt;The documented behavior of PostgreSQL’s query planner is that without a vector index, the planner will perform a sequential scan and compute all distances. &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on a similarity query against an unindexed column will show &lt;code&gt;Seq Scan&lt;/code&gt; in the plan. Adding an IVFFlat or HNSW index causes the planner to switch to an index scan for large enough datasets — observable directly in the &lt;code&gt;EXPLAIN&lt;/code&gt; output.&lt;/p&gt;
&lt;p&gt;The documented pattern for vector deployments is to implement index assertions in CI to prevent regressions. Because &lt;code&gt;pgvector&lt;/code&gt; will silently fall back to a sequential scan if the vector index is invalid or dropped, automated tests running &lt;code&gt;EXPLAIN&lt;/code&gt; against a sample dataset ensure that the planner selects an &lt;code&gt;Index Scan&lt;/code&gt; rather than a &lt;code&gt;Seq Scan&lt;/code&gt; before code reaches production.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;No index at scale&lt;/td&gt;&lt;td&gt;Similarity queries time out above ~100K rows&lt;/td&gt;&lt;td&gt;PostgreSQL falls back to sequential scan, computing all pairwise distances in memory&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Dimension mismatch&lt;/td&gt;&lt;td&gt;Type error at query time&lt;/td&gt;&lt;td&gt;pgvector enforces exact dimension count; query embedding must match column definition&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cosine similarity on non-normalized vectors&lt;/td&gt;&lt;td&gt;Unexpected result rankings&lt;/td&gt;&lt;td&gt;Cosine distance accounts for angle only; two vectors with very different magnitudes can rank highly even when semantically distant if norms are unequal — use &lt;code&gt;&amp;#x3C;=&gt;&lt;/code&gt; not &lt;code&gt;&amp;#x3C;#&gt;&lt;/code&gt; unless you normalize at insertion time&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: pgvector silently uses a sequential scan on unindexed vector columns, so similarity queries that are fast at development scale become unusable in production without a code change.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Create an IVFFlat or HNSW index on the vector column at table creation time, using &lt;code&gt;vector_cosine_ops&lt;/code&gt; for text embeddings; verify with &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; that the planner uses the index.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on your similarity query — the plan should show &lt;code&gt;Index Scan using ... on documents&lt;/code&gt; rather than &lt;code&gt;Seq Scan&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, add the &lt;code&gt;CREATE INDEX ... USING hnsw&lt;/code&gt; statement to your schema migration for any table with a vector column, and add a &lt;code&gt;EXPLAIN&lt;/code&gt; assertion to your staging smoke test so index regression is caught before it reaches production.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>vector-db</category><category>ai-engineering</category></item><item><title>Top GitHub Breakouts: March 2025 (Part 2)</title><link>https://rajivonai.com/blog/2024-05-23-github-stars-mar-2024/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-05-23-github-stars-mar-2024/</guid><description>Three March 2025 open-source projects that eliminate the iteration pauses engineers manually bridge — research review loops, vector index calibration, and agent provisioning YAML.</description><pubDate>Thu, 23 May 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The bottleneck in AI engineering has shifted from what you can build to how fast you can iterate. Three March 2025 breakouts targeted the pauses that stop that iteration: the overnight research loop that waits for a human reviewer in the morning, the vector index that must be calibrated before it can serve queries, and the agent workload that cannot run until someone authors its Kubernetes manifest.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;AI teams building and evaluating models share a common operational pattern: each iteration cycle contains at least one manual handoff that blocks the next step. Researchers run an experiment, stop to evaluate results by hand, and start the next run the next day. RAG engineers set up a FAISS index, discover the quantization codebook needs retraining when the corpus changes, and block query serving while the rebuild runs. Platform teams deploying AI agents write per-workload Kubernetes YAML, configure API gateways separately, and repeat the process for each new agent runtime.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Domain&lt;/th&gt;&lt;th&gt;Manual bottleneck&lt;/th&gt;&lt;th&gt;What it costs&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;System design&lt;/td&gt;&lt;td&gt;Researcher must manually score, critique, and restart experiment loops&lt;/td&gt;&lt;td&gt;Each iteration cycle requires a human present; overnight compute goes unreviewed&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;FAISS and similar indexes require data-dependent codebook training before serving queries&lt;/td&gt;&lt;td&gt;Index becomes stale when corpus grows; rebuild blocks query serving for the duration&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;Float32 vector storage grows linearly with corpus — 10M docs consume 31 GB RAM&lt;/td&gt;&lt;td&gt;Infrastructure cost forces engineers to cap corpus size or over-provision memory&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform engineering&lt;/td&gt;&lt;td&gt;Per-agent Kubernetes YAML must be authored before any new agent workload can be scheduled&lt;/td&gt;&lt;td&gt;4+ hours of manifest authoring, gateway configuration, and credential wiring per new agent type&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Can purpose-built tooling available today replace these four manual steps without adding new framework dependencies?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[AI iteration overhead] --&gt; B[System Design]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; C[Databases — Vector Storage]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; D[Platform Engineering]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; E[ARIS]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; F[turbovec]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; G[ClawManager]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; H[autonomous overnight research loops]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; I[zero-calibration quantized vector index]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt; J[K8s-native agent provisioning control plane]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;aris--eliminating-the-manual-research-review-loop&quot;&gt;ARIS — eliminating the manual research review loop&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: ML research iteration pauses each cycle to wait for a human to score results, identify weaknesses, and restart the next run — compute sits idle overnight while the researcher sleeps.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces or accelerates that task&lt;/strong&gt;: According to the project README, ARIS implements a five-stage autonomous loop — plan, draft, adversarial review, iterate, persist — using cross-model collaboration. Claude Code (or Codex CLI) executes the research while an external LLM acts as a critical reviewer. The README explains the design choice: “using the same model reviewing its own patterns creates blind spots.” A second model actively probes weaknesses the executor did not anticipate, breaking the self-play local minimum. The system is implemented as plain Markdown skill files — zero dependencies, no database, no Docker. The entire workflow state is stored in files the agent can read and write.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;:
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Install Claude Code, then clone ARIS skills&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;git&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; clone&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# In your research project directory, run the W1 workflow&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# (score paper, identify weaknesses, propose experiments)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;claude&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /review-paper&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --workflow&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; W1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Runs overnight: scores the draft, adversarial review, iterates,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# writes findings to Research Wiki — no human required until morning&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
According to the README, the W2 workflow adds experiment automation and the W3 workflow adds multi-paper synthesis. The Research Wiki is a persistent knowledge base that accumulates scored papers, ideas, and experiment results across sessions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: The README notes that decomposing ambiguous research goals produces weaker review loops — concrete research questions (“does X outperform Y on benchmark Z?”) work better than open-ended ones (“improve this paper”). The cross-model setup requires API access to at least two model providers; teams with access to only one model must use single-model mode, which the README acknowledges loses the adversarial benefit.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;turbovec--eliminating-vector-index-calibration-and-rebuild-cycles&quot;&gt;turbovec — eliminating vector index calibration and rebuild cycles&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: FAISS and product quantization indexes require data-dependent codebook training before they can serve queries; when the corpus grows, the codebook must be retrained and the index rebuilt, blocking query serving for the rebuild duration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces or accelerates that task&lt;/strong&gt;: According to the project README, turbovec uses Google Research’s TurboQuant algorithm — a data-oblivious quantizer that “matches the Shannon lower bound on distortion with zero training and zero data passes.” The README states: “A 10 million document corpus takes 31 GB of RAM as float32. turbovec fits it in 4 GB — and searches it faster than FAISS.” Because the quantizer is data-oblivious, vectors can be added incrementally without rebuilding. The README documents that NEON (ARM) and AVX-512BW (x86) hand-written kernels beat FAISS IndexPQFastScan by 12–20% on ARM and match or beat it on x86. Filtered search (restricting results to a candidate set from SQL, BM25, or ACL) is built into the kernel directly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;:
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Before: FAISS PQ index requires codebook training on a data sample&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; faiss&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;quantizer &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; faiss.IndexFlatL2(dim)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;index &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; faiss.IndexIVFPQ(quantizer, dim, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;8&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;8&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;index.train(training_vectors)   &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# blocks until training completes&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;index.add(vectors)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# After: turbovec — no training, incremental adds&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;from&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; turbovec &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; TurboQuantIndex&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;index &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; TurboQuantIndex(&lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;dim&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1536&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;bit_width&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;4&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;index.add(vectors)              &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# no training step; index is ready immediately&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;index.add(more_vectors)         &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# incremental adds work without rebuilding&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;scores, indices &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; index.search(query, &lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;k&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;index.write(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;my_index.tq&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
For filtered hybrid retrieval, the README shows passing an id allowlist directly to &lt;code&gt;search()&lt;/code&gt; — the filter is applied inside the SIMD kernel rather than as a post-filter, so recall is maintained on selective filters without over-fetching.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: According to the project documentation, turbovec is Python and Rust only; there are no JavaScript or Go bindings in the current release. The &lt;code&gt;bit_width=4&lt;/code&gt; default trades some recall for the memory reduction — the README documents this tradeoff but does not publish a benchmark table mapping bit widths to recall across common datasets. Teams requiring guaranteed recall thresholds should benchmark against their specific corpus before replacing FAISS in production.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;clawmanager--eliminating-per-agent-kubernetes-yaml-authoring&quot;&gt;ClawManager — eliminating per-agent Kubernetes YAML authoring&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The productivity problem it solves&lt;/strong&gt;: Platform teams deploying AI agents author Kubernetes manifests per workload, configure AI API gateways separately, and repeat the process for each new agent runtime — the README describes this as the “YAML sprawl” problem for agent infrastructure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How AI replaces or accelerates that task&lt;/strong&gt;: According to the project README, ClawManager is a Kubernetes-native control plane that provides a unified interface for agent instance management, AI Gateway governance, skill discovery, and multi-runtime orchestration. The README shows provisioning a new agent instance from a web UI in under 60 seconds in the product demo GIF. The AI Gateway layer centralizes API key management and access control across all agent runtimes, eliminating per-agent gateway configuration. Skill scanning discovers and registers agent capabilities automatically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The workflow&lt;/strong&gt;:
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Install ClawManager into an existing K8s cluster&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;helm&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; repo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; add&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; clawmanager&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; https://yuan-lab-llm.github.io/ClawManager/charts&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;helm&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; install&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; clawmanager&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; clawmanager/clawmanager&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Open the web UI — provision a new agent instance from the Agent Control Plane&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Skills are scanned and registered automatically; AI Gateway injects API access&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# No per-agent YAML authoring or gateway configuration required&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
According to the README changelog (2024-05-18), team workspace support was added with one-click team creation, shared storage, task dispatch, and Redis Team Bus injection. The changelog also documents Hermes runtime integration for Webtop-based agent provisioning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where it breaks&lt;/strong&gt;: ClawManager is designed for teams already running Kubernetes; bare-metal or Docker Compose deployments are not documented. The README’s changelog shows rapid weekly releases (v0.1 through multiple patches in the first 60 days), indicating the platform is early and the API surface may shift. Teams adopting it today should expect schema and config changes between minor releases.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ARIS&lt;/strong&gt;: The documented pattern for ARIS involves a five-stage loop and Research Wiki behavior, as defined in the project’s &lt;code&gt;AGENT_GUIDE.md&lt;/code&gt;. The adversarial cross-model design rationale is explicitly explained in the README. The accompanying research paper (arXiv:2405.03042) should be consulted for methodology claims, as production research quality is still emerging.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;turbovec&lt;/strong&gt;: Derived from how the system actually behaves, the TurboQuant algorithm (arXiv:2404.19874) provides a “no training” guarantee specific to its quantizer. The memory reduction claim (“31 GB to 4 GB for 10M documents at float32”) and search speed comparison (12–20% faster than FAISS IndexPQFastScan on ARM) are stated in the project README. Benchmark figures at other corpus scales or on specific embedding model outputs have not been independently verified.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ClawManager&lt;/strong&gt;: Derived from its stated behavior, the project provides an AI Gateway, agent provisioning, skill scanning, and team workspaces. The 60-second provisioning claim is illustrated by a demo GIF in the README. No independent production-scale deployment report is available; the project is pre-1.0.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;ARIS review loop produces shallow critique&lt;/td&gt;&lt;td&gt;Open-ended research goal without concrete evaluation criteria&lt;/td&gt;&lt;td&gt;Define specific benchmark tasks and success thresholds before invoking the review loop&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ARIS second model not accessible&lt;/td&gt;&lt;td&gt;Single-provider API access or rate limit hit during overnight run&lt;/td&gt;&lt;td&gt;Configure a fallback single-model mode (documented in README); schedule runs when rate limits are low&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;turbovec recall drops on selective filters&lt;/td&gt;&lt;td&gt;Bit width too low for the embedding model’s effective dimensionality&lt;/td&gt;&lt;td&gt;Benchmark bit_width=4 vs bit_width=8 on your corpus before production; increase bit width if recall is below threshold&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;turbovec no Go or JavaScript bindings&lt;/td&gt;&lt;td&gt;Services written outside Python or Rust need vector search&lt;/td&gt;&lt;td&gt;Wrap turbovec search behind a thin Python REST service; use FAISS for non-Python runtimes in the interim&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ClawManager API surface changes between releases&lt;/td&gt;&lt;td&gt;Adopting ClawManager while it is pre-1.0&lt;/td&gt;&lt;td&gt;Pin to a specific release in Helm; track the changelog for breaking changes before upgrading&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ClawManager requires Kubernetes&lt;/td&gt;&lt;td&gt;Team running Docker Compose or bare-metal&lt;/td&gt;&lt;td&gt;Deploy a lightweight K3s cluster for agent infrastructure even if the rest of the stack uses Docker Compose&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: AI iteration speed is blocked at three manual handoffs — research review loops that pause overnight, vector indexes that cannot grow without a rebuild, and agent workloads that cannot be provisioned without per-workload YAML authoring.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Use ARIS to run cross-model research review overnight without human intervention, turbovec to replace FAISS with a zero-calibration index that grows incrementally, and ClawManager to provision and govern agent instances from a single Kubernetes-native control plane.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After &lt;code&gt;pip install turbovec&lt;/code&gt;, replace one FAISS index with a TurboQuantIndex, add the same vectors, and run the same benchmark query — if the index built without a training call and returned results within the expected latency range, the integration is validated.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run &lt;code&gt;pip install turbovec&lt;/code&gt; and convert one existing FAISS index this week; the before/after code is four lines and requires no corpus changes.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>ai-engineering</category><category>architecture</category><category>databases</category></item><item><title>Vectorless RAG Patterns for Database Knowledge Systems</title><link>https://rajivonai.com/blog/2024-05-16-vectorless-rag-patterns-for-database-knowledge-systems/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-05-16-vectorless-rag-patterns-for-database-knowledge-systems/</guid><description>How tree-based retrieval can improve DB runbooks, schema docs, and incident knowledge over chunked vector search.</description><pubDate>Thu, 16 May 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;RAG (Retrieval-Augmented Generation) is the default pattern for giving AI assistants context, but chunking structured operational documentation into 300-token vectors destroys the sequence of runbooks precisely when you need them most.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Engineering teams are increasingly feeding their incident response channels and database documentation into vector databases to build automated on-call assistants. The goal is to surface the right mitigation command at 2:13 a.m. when replica lag climbs or autovacuum gets blocked, without manually paging through Git repositories or wiki pages.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The default chunked vector search implementation fails catastrophically for procedural database runbooks. It splits documents into arbitrary token pieces, embedding each piece into a vector, and retrieving chunks based on vocabulary similarity.&lt;/p&gt;
&lt;p&gt;A PostgreSQL schema migration runbook contains a precheck, the DDL command, a validation query, and a rollback step. Vector chunking breaks this structure apart. Similarity scoring finds the chunk with the best vocabulary match for “migration,” which might return the validation query without the prerequisite rollback instructions. How do we retrieve operational knowledge while preserving the exact order of execution?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;Vectorless RAG bypasses embedding models for structured documentation by using &lt;strong&gt;section tree retrieval&lt;/strong&gt;. Instead of slicing text into chunks and measuring cosine similarity, documents are stored as a structured JSON tree keyed by document path. Retrieval happens via path prefixes rather than semantic approximation, guaranteeing that the precheck, command, validation, and rollback remain attached and in sequence.&lt;/p&gt;
&lt;h2 id=&quot;section-tree-retrieval-architecture&quot;&gt;Section Tree Retrieval Architecture&lt;/h2&gt;
&lt;p&gt;To build this, store your operational docs as a structured JSON tree in PostgreSQL using JSONB, keeping a vector store only for messy operational memory like Slack exports.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: Convert one critical runbook into a section tree.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The tree builder parses your Markdown headings into a nested JSON structure where each node has a &lt;code&gt;path&lt;/code&gt; (array of heading titles from root to section), a &lt;code&gt;summary&lt;/code&gt;, and the section &lt;code&gt;body&lt;/code&gt;. No embeddings — just structure.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;python&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; scripts/build_doc_tree.py&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --input&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; docs/postgres/replication-lag.md&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --doc-id&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; postgres-replication-lag&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --output&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; build/postgres-replication-lag.json&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Confirm with:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;jq&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;.doc_id, .children[0].path, .children[0].summary&apos;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; build/postgres-replication-lag.json&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 2: Store the tree in Postgres JSONB with path-aware lookup.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Each row is one document section. The &lt;code&gt;path&lt;/code&gt; column is an array (&lt;code&gt;ARRAY[&apos;Postgres&apos;,&apos;Replication&apos;,&apos;Lag&apos;]&lt;/code&gt;) so you can query by prefix — “give me all Replication sections” — without scanning the full document body.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; doc_index&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  doc_id        &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;text&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    NOT NULL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  path&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;          text&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;[]  &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;NOT NULL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  title         &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;text&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    NOT NULL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  summary       &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;text&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    NOT NULL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  body          &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;text&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    NOT NULL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  owner&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;         text&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_verified &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  node          jsonb   &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;NOT NULL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  PRIMARY KEY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (doc_id, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;path&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; doc_index_path_gin&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; doc_index &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;USING&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; gin (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;path&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; doc_index_node_gin&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; doc_index &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;USING&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; gin (node jsonb_path_ops);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 3: Load sections without flattening the procedure.&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;python&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; scripts/load_doc_tree_pg.py&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --file&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; build/postgres-replication-lag.json&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --dsn&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$DOC_INDEX_DSN&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 4: Route structured questions to tree retrieval first.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;At query time, match document class before calling an LLM. Runbooks and schema docs route to the &lt;code&gt;doc_index&lt;/code&gt; table. Incident postmortems route to the vector store.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; path&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, title, summary, body&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; doc_index&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; doc_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;postgres-replication-lag&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    summary ILIKE &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%schema migration%&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    OR&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; body   ILIKE &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%replica lag%&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    OR&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; path&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; @&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ARRAY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;[&apos;Postgres&apos;,&apos;Replication&apos;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  )&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; array_length(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;path&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 5: Keep vector search for messy incident memory.&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;python&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; scripts/embed_incidents.py&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --source&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; s3://db-knowledge/incidents/&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --collection&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; db_incidents&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --vector-store&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; qdrant&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    DBA[DBA question] --&gt; Router[retrieval router]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Router --&gt;|structured runbook| PostgresJSONB[doc_index in Postgres JSONB]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Router --&gt;|unstructured tickets| Qdrant[Qdrant — incidents collection]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    PostgresJSONB --&gt; TreePath[section path — parent summaries — body]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Qdrant --&gt; VectorHits[top-k incident snippets]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    TreePath --&gt; LLM[LLM answer composer]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    VectorHits --&gt; LLM&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    LLM --&gt; Answer[answer with exact citation]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Answer --&gt; DBA&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The router decision is intentionally boring: classify the document type first, then retrieve. Boring routing wakes you up less often.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern across operational knowledge systems is to strictly bound retrieval by how database engines execute commands. Derived from how PostgreSQL handles locking, schema changes hold an &lt;code&gt;AccessExclusiveLock&lt;/code&gt; that queues all subsequent reads, often manifesting as replication lag or connection exhaustion. When a standard chunked RAG system encounters a query about this lock state, it routinely hallucinates by stitching together a &lt;code&gt;pg_stat_activity&lt;/code&gt; query from a minor version upgrade document with a generic &lt;code&gt;pg_cancel_backend&lt;/code&gt; snippet. This disjointed context encourages operators to blindly kill processes without verifying the blocker. By migrating to a section tree, the system instead pulls the entire operational branch—returning the specific diagnostic query, the targeted termination command, and the required rollback sequence as an atomic unit.&lt;/p&gt;
&lt;p&gt;This structural alignment yields measurable shifts in how retrieval behaves during incidents:&lt;/p&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Metric&lt;/th&gt;&lt;th&gt;Chunked vector search&lt;/th&gt;&lt;th&gt;Section tree retrieval&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Runbook answer citation&lt;/td&gt;&lt;td&gt;Chunk ID + similarity score&lt;/td&gt;&lt;td&gt;Exact section path&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Migration rollback retrieval&lt;/td&gt;&lt;td&gt;Often split across 2–4 chunks&lt;/td&gt;&lt;td&gt;Full prerequisite, command, validation, rollback in one section&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Embedding model change&lt;/td&gt;&lt;td&gt;Re-embed runbooks, tickets, postmortems&lt;/td&gt;&lt;td&gt;Re-embed tickets only; tree index unchanged&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Incident query behavior&lt;/td&gt;&lt;td&gt;Finds similar language&lt;/td&gt;&lt;td&gt;Follows operational structure first&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The architectural split between structured and unstructured data typically looks like this:&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Corpus&lt;/th&gt;&lt;th&gt;Best retrieval pattern&lt;/th&gt;&lt;th&gt;Reason&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL failover runbook&lt;/td&gt;&lt;td&gt;Section tree&lt;/td&gt;&lt;td&gt;Procedure order and rollback must stay together&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Snowflake warehouse guide&lt;/td&gt;&lt;td&gt;Section tree&lt;/td&gt;&lt;td&gt;Sections map to operational decisions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Prior SEV2 postmortems&lt;/td&gt;&lt;td&gt;Vector search&lt;/td&gt;&lt;td&gt;Language and structure vary across incidents&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slack incident channel export&lt;/td&gt;&lt;td&gt;Vector search&lt;/td&gt;&lt;td&gt;Messy, duplicated, high volume&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Schema ownership docs&lt;/td&gt;&lt;td&gt;Section tree&lt;/td&gt;&lt;td&gt;Paths and citations matter&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slow query examples&lt;/td&gt;&lt;td&gt;Hybrid&lt;/td&gt;&lt;td&gt;Similar query shape + exact remediation docs&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Bad tree structure&lt;/td&gt;&lt;td&gt;Markdown headings are inconsistent or PDF parsing invents sections&lt;/td&gt;&lt;td&gt;Normalize docs to Markdown before building the tree; reject trees with missing &lt;code&gt;path&lt;/code&gt;, &lt;code&gt;summary&lt;/code&gt;, or &lt;code&gt;last_verified&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Wrong retrieval route&lt;/td&gt;&lt;td&gt;Query says “incident” but asks for the official rollback procedure&lt;/td&gt;&lt;td&gt;Add explicit document-class rules before any semantic routing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Stale runbook answer&lt;/td&gt;&lt;td&gt;Section exists but has not been tested since PostgreSQL 14&lt;/td&gt;&lt;td&gt;Require &lt;code&gt;last_verified&lt;/code&gt;; suppress sections older than the last engine upgrade&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;JSONB table abuse&lt;/td&gt;&lt;td&gt;Teams start dumping every Slack export as a tree&lt;/td&gt;&lt;td&gt;Enforce: high-volume, messy text stays in the vector store&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;LLM over-summarizes commands&lt;/td&gt;&lt;td&gt;Retrieved section has multiple guarded branches&lt;/td&gt;&lt;td&gt;Return command blocks verbatim; make the model cite the section path, not paraphrase it&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Chunked vector search destroys the procedural sequence of database runbooks, leading to dangerous out-of-order execution during incidents.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Implement section tree retrieval using PostgreSQL JSONB to store and query operational documentation by hierarchical paths instead of token embeddings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Extracting a full node path guarantees that prerequisites, commands, and rollbacks are returned as cohesive units, respecting the database’s locking behaviors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Convert one critical PostgreSQL failover runbook into a JSON tree in &lt;code&gt;doc_index&lt;/code&gt;, and test 20 questions from recent incidents against both the tree index and the legacy vector store to compare citation accuracy.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>vector-db</category><category>ai-engineering</category></item><item><title>Redis Licensing and Valkey: What Engineers Should Know</title><link>https://rajivonai.com/blog/2024-05-13-redis-licensing-valkey-what-engineers-should-know/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-05-13-redis-licensing-valkey-what-engineers-should-know/</guid><description>In March 2024, Redis Ltd changed Redis 7.4+ to a non-OSS license. Here is what that actually means for your deployment — and what Valkey is.</description><pubDate>Mon, 13 May 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The Redis license change affects far fewer engineers than the headlines implied — but the engineers it does affect have real decisions to make.&lt;/strong&gt; In March 2024, Redis Ltd relicensed Redis 7.4 and later versions from BSD to a dual SSPL/RSALv2 license. The Linux Foundation forked Redis 7.2.4 — the last BSD-licensed version — into a project called Valkey. Understanding which of these events actually applies to your situation determines what, if anything, you need to do.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Redis is one of the most widely deployed in-memory data stores in the industry. It runs as a cache, a session store, a message queue, a rate limiter, and more. For most application developers, Redis is a network dependency: you point a client library at a host and port, and it works.&lt;/p&gt;
&lt;p&gt;That familiarity is also why the licensing announcement in March 2024 generated so much noise. Engineers who had never thought about Redis licensing suddenly had to decide whether to care. Most of them do not need to. But the engineers who do — platform teams managing self-hosted Redis, teams using managed services, and teams building products that bundle Redis — need a clear picture before their next infrastructure review.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The license change created a widely-shared misconception: that all Redis users are now on proprietary software and must act immediately. That is not accurate, and acting on it without understanding the scope leads to unnecessary migration work or, worse, ignored risk where it actually exists.&lt;/p&gt;
&lt;p&gt;The SSPL (Server Side Public License) is a copyleft license written by MongoDB. Its key clause is that if you offer Redis as a service to others — meaning you build a product or SaaS on top of Redis and expose it to external users — you must either open-source your entire stack or obtain a commercial license. The RSALv2 (Redis Source Available License v2) restricts using Redis in a competing database product. Neither license affects a team using Redis as an internal application dependency.&lt;/p&gt;
&lt;p&gt;The concrete failure mode is a platform team that does not audit its Redis version, does not track the managed service provider’s roadmap, and then discovers that their AWS ElastiCache clusters have been silently migrated to Valkey — or that a Redis module they depend on (RedisSearch, RedisJSON) has incomplete Valkey compatibility.&lt;/p&gt;
&lt;p&gt;The decision this forces: what is your organization’s relationship to Redis — user, operator, or distributor?&lt;/p&gt;
&lt;h2 id=&quot;what-the-license-change-actually-changes-by-role&quot;&gt;What the License Change Actually Changes by Role&lt;/h2&gt;
&lt;p&gt;The answer depends entirely on how your organization uses Redis.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Application developers using Redis as a cache or queue&lt;/strong&gt; are not affected. Your application connects to Redis over the network — you are not distributing it. Existing deployments continue to work. Redis 6.x and 7.2.x remain under BSD license.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Platform teams running self-managed Redis&lt;/strong&gt; need to make a decision, but not immediately. Redis 7.2.4 and earlier are BSD-licensed. Options: stay on 7.2.x (accepting it will eventually fall behind on security), migrate to Valkey 7.2 or 8.x, or move to a managed service. Valkey 7.2 was released by the Linux Foundation in May 2024 with backing from AWS, Google, Oracle, and Ericsson. It maintains protocol and API compatibility with Redis 7.2 — most Redis client libraries need no changes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Teams on AWS ElastiCache or GCP Memorystore&lt;/strong&gt; should check their provider’s roadmap. AWS made ElastiCache for Valkey generally available in September 2024; new clusters default to Valkey. GCP Memorystore offers both modes. Staying on the default may mean you are already running Valkey without having made an explicit decision.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Teams building a product that includes Redis&lt;/strong&gt; are in scope for the SSPL. If you expose Redis to external users as part of a service, get a legal opinion before your next release.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Role&lt;/th&gt;&lt;th&gt;License risk&lt;/th&gt;&lt;th&gt;Action&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;App developer using Redis as a dependency&lt;/td&gt;&lt;td&gt;None&lt;/td&gt;&lt;td&gt;None&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform team — self-managed Redis 7.2.4 or earlier&lt;/td&gt;&lt;td&gt;None immediately&lt;/td&gt;&lt;td&gt;Plan migration timeline&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform team — self-managed Redis 7.4+&lt;/td&gt;&lt;td&gt;SSPL applies if distributing&lt;/td&gt;&lt;td&gt;Evaluate Valkey or commercial license&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;AWS ElastiCache or GCP Memorystore user&lt;/td&gt;&lt;td&gt;Provider-managed&lt;/td&gt;&lt;td&gt;Check current cluster engine version&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Product builder distributing Redis&lt;/td&gt;&lt;td&gt;SSPL applies&lt;/td&gt;&lt;td&gt;Legal review required&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;Redis Ltd announced the license change on March 20, 2024. The Linux Foundation announced the Valkey fork the same day, based on Redis 7.2.4. The Valkey repository is at github.com/valkey-io/valkey.&lt;/p&gt;
&lt;p&gt;AWS made Amazon ElastiCache for Valkey generally available in September 2024, confirming that Valkey 7.2 is API- and protocol-compatible with Redis 7.2 and that existing applications required no code changes to switch. Valkey 8.0 followed in September 2024, adding features beyond the Redis 7.2 baseline.&lt;/p&gt;
&lt;p&gt;The documented pattern from this event: a fork with institutional backing can reach production stability quickly when it starts from a well-tested codebase. The Redis-to-Valkey path is cleaner than many license-driven forks because Valkey explicitly maintains the Redis Serialization Protocol (RESP) and the standard Redis command set.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;SSPL applicability confusion&lt;/td&gt;&lt;td&gt;Engineers treat SSPL as affecting all Redis users and trigger unnecessary migration projects&lt;/td&gt;&lt;td&gt;SSPL copyleft clause is narrow — it targets service providers, not application users&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Redis module dependency&lt;/td&gt;&lt;td&gt;Teams using RedisSearch, RedisJSON, or RedisTimeSeries migrate to Valkey and find incomplete or missing module support&lt;/td&gt;&lt;td&gt;Valkey compatibility with Redis modules varies; some modules are Redis Ltd proprietary and have no Valkey equivalent&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Valkey feature divergence over time&lt;/td&gt;&lt;td&gt;Applications assume long-term Redis and Valkey compatibility, but the projects diverge on new features&lt;/td&gt;&lt;td&gt;Current divergence is minimal; future compatibility depends on both projects’ roadmaps and is unknown&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Platform teams that have not audited their Redis deployments since March 2024 may be running unlicensed Redis 7.4+ in a distribution context, or may be unaware that their managed service has already migrated to Valkey.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Audit your Redis deployment: check the exact version in each environment, identify whether you are distributing Redis to external users, and confirm your managed service provider’s current engine version and roadmap.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Query &lt;code&gt;INFO server&lt;/code&gt; on a running instance — the output identifies the fork and exact version unambiguously:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;redis-cli&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; INFO&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; server&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; grep&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -E&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;redis_version|redis_git|os:&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Redis:  redis_version:7.2.4&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Valkey: redis_version:7.2.5  (Valkey still uses the redis_version key for compatibility)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;#         valkey_version:7.2.5  (added by Valkey; absent on Redis)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, run &lt;code&gt;INFO server&lt;/code&gt; against each production Redis instance and record the version. If any are 7.4 or later, assess your distribution exposure. If you are on AWS ElastiCache, open the console and check the engine version — you may already be on Valkey and just not know it.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The license change matters for a specific set of roles, and it barely registers for everyone else. The engineers who get hurt are the ones who either ignore it completely when they shouldn’t, or treat it as a fire drill when it doesn’t apply to them. Know which situation you are in before deciding how much energy to spend.&lt;/p&gt;</content:encoded><category>databases</category><category>architecture</category></item><item><title>MySQL 8.4 LTS: What DBAs Should Check Before Upgrade</title><link>https://rajivonai.com/blog/2024-01-29-mysql-84-lts-what-dbas-should-check-before-upgrade/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-01-29-mysql-84-lts-what-dbas-should-check-before-upgrade/</guid><description>MySQL 8.4 is the first long-term support release in the 8.x line — five breaking changes that require verification before any production upgrade.</description><pubDate>Tue, 07 May 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;MySQL 8.4, released April 30, 2024, is the first long-term support release in the 8.x series and will receive extended security and bug-fix support — but the upgrade path has real breaking changes that will silently break application authentication, pagination queries, and GROUP BY logic if you do not check them first.&lt;/strong&gt; The most dangerous change is the authentication plugin enforcement. Old client libraries that do not support &lt;code&gt;caching_sha2_password&lt;/code&gt; will fail to connect after the upgrade, and the failure mode is a hard connection error, not a graceful fallback.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Oracle shipped MySQL 8.4 as the first LTS release in April 2024, consolidating changes introduced throughout the 8.x Innovation releases. MySQL 8.0 introduced &lt;code&gt;caching_sha2_password&lt;/code&gt; as the new default authentication plugin in 2018, but left &lt;code&gt;mysql_native_password&lt;/code&gt; available as a fallback. Many applications stayed on the native password plugin because connector support for &lt;code&gt;caching_sha2_password&lt;/code&gt; was uneven in the early years. In MySQL 8.4, that path is now narrower: &lt;code&gt;caching_sha2_password&lt;/code&gt; is fully enforced as the default, and &lt;code&gt;mysql_native_password&lt;/code&gt; is deprecated and disabled by default.&lt;/p&gt;
&lt;p&gt;The LTS designation matters operationally: 8.4 will receive bug fixes and security patches through a longer window than standard Innovation releases, making it the natural target for organizations that want a stable upgrade from 8.0. But “long-term support” does not mean “backward compatible with everything in 8.0.” Five specific changes require explicit verification before any production upgrade.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The authentication change is the most disruptive because it fails at connection time, before the application executes any SQL. A Django app using &lt;code&gt;mysqlclient&lt;/code&gt; 1.x, a PHP application using an outdated &lt;code&gt;mysqlnd&lt;/code&gt;, or any service using the legacy &lt;code&gt;mysql-connector-python&lt;/code&gt; without SHA-2 support will fail to connect to a MySQL 8.4 server where user accounts are configured with the new default plugin.&lt;/p&gt;
&lt;p&gt;Beyond authentication, MySQL 8.4 removes two features that appear in more production codebases than most DBAs realize: &lt;code&gt;SQL_CALC_FOUND_ROWS&lt;/code&gt; and the associated &lt;code&gt;FOUND_ROWS()&lt;/code&gt; function, which are commonly used for pagination. Applications that use &lt;code&gt;SELECT SQL_CALC_FOUND_ROWS * FROM table WHERE ... LIMIT 20&lt;/code&gt; to get both the page results and the total row count in one query will encounter a syntax error after the upgrade. How can engineering teams ensure their applications survive the transition to MySQL 8.4 LTS?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;The core concept for a safe MySQL 8.4 upgrade is a pre-flight verification checklist that audits client connector capabilities, application query patterns, and server configuration prior to the cutover.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Pre-flight Check] --&gt; B[Audit Authentication]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; C[Audit Query Patterns]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; D[Audit Server Config]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; E[Identify Legacy Accounts]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; F[Verify SHA-2 Support]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; G[Remove SQL_CALC_FOUND_ROWS]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; H[Add Explicit ORDER BY]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; I[Enforce GTID Consistency]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; J[Audit utf8mb3 Usage]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;1. Authentication plugin: caching_sha2_password enforcement&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Check which accounts still use &lt;code&gt;mysql_native_password&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; User, Host, plugin&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; mysql&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;user&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; plugin &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;mysql_native_password&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For each account returned, verify the connecting client library version supports &lt;code&gt;caching_sha2_password&lt;/code&gt;. Upgrade connectors before migrating accounts. To migrate an account:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; USER&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;appuser&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;@&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; IDENTIFIED &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WITH&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; caching_sha2_password &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BY&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;password&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;2. SQL_CALC_FOUND_ROWS removal&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Search application code for &lt;code&gt;SQL_CALC_FOUND_ROWS&lt;/code&gt; and &lt;code&gt;FOUND_ROWS()&lt;/code&gt;. The replacement is a separate COUNT() subquery:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Old pattern (breaks in 8.4)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SQL_CALC_FOUND_ROWS &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;active&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; FOUND_ROWS();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Replacement pattern&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; COUNT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;active&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;active&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The MySQL 8.4 release notes document this removal explicitly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. GROUP BY implicit sort behavior&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;MySQL historically returned GROUP BY results in the grouped column order as a side effect of implementation. This was not documented behavior, but applications developed against it. MySQL 8.0 already weakened this guarantee; 8.4 continues that path. Any query relying on implicit GROUP BY ordering needs an explicit ORDER BY clause added before the upgrade.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. GTID enforcement&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;MySQL 8.4 more strongly encourages &lt;code&gt;gtid_mode=ON&lt;/code&gt; and treats GTID-related settings as preferred defaults. Verify your replication setup:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; @@gtid_mode, @@enforce_gtid_consistency;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you are on &lt;code&gt;OFF&lt;/code&gt; or &lt;code&gt;OFF_PERMISSIVE&lt;/code&gt;, test the upgrade path in staging with GTID implications in scope.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. utf8mb3 deprecation acceleration&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;MySQL 8.4 accelerates warnings around &lt;code&gt;utf8mb3&lt;/code&gt; (the 3-byte UTF-8 variant that MySQL labeled as &lt;code&gt;utf8&lt;/code&gt;). Any schema still using the &lt;code&gt;utf8&lt;/code&gt; alias that intends 3-byte encoding should be explicitly audited. The MySQL documentation notes that &lt;code&gt;utf8mb3&lt;/code&gt; remains functional but its deprecation path is active.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern from Oracle’s MySQL engineering team confirms that &lt;code&gt;mysql_native_password&lt;/code&gt; is officially deprecated in MySQL 8.4 and disabled by default. Based on how MySQL’s authentication handshake behaves, the server will reject connections from clients lacking SHA-2 capabilities with a fatal error, rather than falling back to older mechanisms.&lt;/p&gt;
&lt;p&gt;Oracle’s public release notes for MySQL 8.4 explicitly document the removal of &lt;code&gt;SQL_CALC_FOUND_ROWS&lt;/code&gt; and &lt;code&gt;FOUND_ROWS()&lt;/code&gt;, noting that the features were deprecated in MySQL 8.0.20 and are now entirely removed from the parser. Any application submitting these tokens will receive a syntax error.&lt;/p&gt;
&lt;p&gt;Furthermore, the behavior of MySQL’s optimizer regarding &lt;code&gt;GROUP BY&lt;/code&gt; sorting has been formally documented as non-deterministic unless an &lt;code&gt;ORDER BY&lt;/code&gt; clause is provided. Systems relying on legacy implicit sorting will observe unpredictable result sets when upgrading to the 8.4 execution engine.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Old client library without SHA-2 support&lt;/td&gt;&lt;td&gt;Hard connection failure at connect time&lt;/td&gt;&lt;td&gt;Client cannot negotiate caching_sha2_password handshake&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SQL_CALC_FOUND_ROWS in pagination layer&lt;/td&gt;&lt;td&gt;Syntax error on execution&lt;/td&gt;&lt;td&gt;Function removed from MySQL 8.4 parser&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Implicit GROUP BY ordering in report queries&lt;/td&gt;&lt;td&gt;Result order changes silently&lt;/td&gt;&lt;td&gt;Undocumented sort behavior not guaranteed in 8.4&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: The upcoming MySQL 8.4 LTS has breaking changes that fail silently or hard depending on the client library, query patterns, and schema encoding in use.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Run the authentication query to find &lt;code&gt;mysql_native_password&lt;/code&gt; accounts, search application code for &lt;code&gt;SQL_CALC_FOUND_ROWS&lt;/code&gt;, and verify connector versions before any upgrade.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Deploy to a staging environment running 8.4 with production schema and a representative set of application queries; connection failures and syntax errors surface immediately.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, run &lt;code&gt;SELECT User, Host, plugin FROM mysql.user WHERE plugin = &apos;mysql_native_password&apos;&lt;/code&gt; on any server targeted for 8.4 upgrade and cross-reference each account against the connecting application’s connector version.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The LTS designation makes 8.4 worth upgrading to — but LTS means the maintenance window is longer, not that the upgrade is risk-free. The five checks above are the difference between a smooth cutover and an unplanned rollback at 2 AM.&lt;/p&gt;</content:encoded><category>databases</category><category>checklist</category></item><item><title>Shopify-Style Multi-Tenant Commerce Databases: Isolation, Sharding, and Operational Controls</title><link>https://rajivonai.com/blog/2024-04-15-shopify-style-multi-tenant-commerce-databases-isolation-sharding-and-operational-controls/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-04-15-shopify-style-multi-tenant-commerce-databases-isolation-sharding-and-operational-controls/</guid><description>Shopify-style per-merchant sharding prevents one large tenant from turning shared commerce database infrastructure into a shared outage.</description><pubDate>Mon, 15 Apr 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The dangerous part of a multi-tenant commerce database is not that one merchant becomes large; it is that one merchant can turn shared infrastructure into a shared failure.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Commerce platforms start with an attractive database model: every shop shares one application, one schema, and one operational surface. A &lt;code&gt;shop_id&lt;/code&gt; column scopes orders, products, customers, inventory, discounts, and fulfillment state. The product team moves quickly because every feature lands once. The platform team can provision a new merchant without creating databases, queues, caches, dashboards, and backup policies for each account.&lt;/p&gt;
&lt;p&gt;That model is rational. Early in the life of a commerce platform, tenant-per-database looks cleaner on a whiteboard but expensive in practice. It multiplies migrations, connection pools, backups, schema drift, and incident response. Shared tables with strict tenant scoping are often the correct first architecture.&lt;/p&gt;
&lt;p&gt;The shift comes when the workload stops being statistically smooth. A flash sale, bot campaign, import job, app integration, or checkout burst can make one shop dominate write IOPS, row locks, cache churn, background jobs, and replication lag. The platform is still logically multi-tenant, but operationally it behaves like the largest tenant owns the database.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The failure mode is subtle because the schema still looks isolated. Queries include &lt;code&gt;shop_id&lt;/code&gt;. Authorization checks pass. Unit tests prove that one shop cannot read another shop’s rows. Yet the database has no idea that tenants deserve independent blast radii. A hot merchant can fill the buffer pool with its products, pin locks around its checkouts, delay replication for unrelated shops, and consume worker capacity through retries.&lt;/p&gt;
&lt;p&gt;The usual reaction is to add read replicas, indexes, queue workers, or cache layers. Those help until the shared writer, shared migration path, or shared operational runbook becomes the bottleneck. The deeper problem is that tenant isolation has been implemented as a query predicate, not as an operational control.&lt;/p&gt;
&lt;p&gt;The design question is therefore: how do you keep the developer ergonomics of a shared commerce platform while making failures, migrations, and capacity decisions tenant-aware?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;A Shopify-style answer is to treat the tenant key as both a data model primitive and an operations primitive. The platform still presents one product, one admin, and one API surface, but internally each shop maps to a pod: a bounded slice of databases, caches, queues, and runtime capacity.&lt;/p&gt;
&lt;p&gt;The pod is not just a shard. A shard answers where the rows live. A pod answers what fails together, what scales together, what is drained together, and what can be moved under operational control.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[commerce request — shop context required] --&gt; B[tenant resolver — authenticated shop id]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; C[routing catalog — shop id to pod]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; D[pod boundary — app workers and caches]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; E[writer shard — shop owned tables]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; F[replica set — guarded reads]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; G[async jobs — tenant scoped queues]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; H[CDC stream — logical table topics]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; I[control plane — shard moves and kill switches]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  I --&gt; D&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  I --&gt; E&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The request path must resolve tenant identity before touching application state. That identity chooses the pod, the writer shard, the replica policy, cache namespace, job routing, and operational limits. Once the request enters the pod, every downstream system should still carry the tenant context. The architecture should assume that missing tenant context is a production bug, not a convenience.&lt;/p&gt;
&lt;p&gt;The control plane is the important part. It owns the routing catalog, tenant placement, shard movement, read routing policy, throttles, and emergency controls. Without that layer, sharding becomes a library call scattered through application code. With it, operators can move a hot shop, drain a pod, disable expensive background work, or pin reads to a writer during replica lag without shipping a feature change.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Context.&lt;/strong&gt; Shopify publicly described reaching the point where buying a larger database server was no longer viable in 2015, then moving toward pods as an isolation model for its Rails monolith. In Shopify’s description, a pod is an isolated instance containing a MySQL shard and related datastores such as Redis and Memcached, while some infrastructure remains shared outside the pod boundary. See Shopify Engineering’s &lt;a href=&quot;https://shopify.engineering/a-pods-architecture-to-allow-shopify-to-scale&quot;&gt;“A Pods Architecture to Allow Shopify to Scale”&lt;/a&gt; and &lt;a href=&quot;https://shopify.engineering/blogs/engineering/mysql-database-shard-balancing-terabyte-scale&quot;&gt;“Shard Balancing: Moving Shops Confidently with Zero-Downtime at Terabyte-scale”&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action.&lt;/strong&gt; Shopify attached &lt;code&gt;shop_id&lt;/code&gt; to shop-owned tables and used it as the sharding key, according to its shard balancing write-up. That action matters because it makes tenant placement explicit. The data model, routing layer, and operational tooling can all agree on the same unit of movement: the shop.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result.&lt;/strong&gt; Shopify’s public Rails patterns article describes Core as using a podded architecture where each pod contains a distinct subset of shops, and notes that if one pod shuts down temporarily, the other pods are not affected. That is the architectural result to target: not perfect uptime, but bounded failure. See &lt;a href=&quot;https://shopify.engineering/shopify-made-patterns-in-our-rails-apps&quot;&gt;“Shopify-Made Patterns in Our Rails Apps”&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning.&lt;/strong&gt; Sharding alone does not solve multi-tenancy. The documented pattern is that the shard key must become a control surface. Shopify’s CDC work shows the same lesson on the analytics side: their public write-up describes consuming changes from 100-plus MySQL shards and producing Kafka topics per logical table so downstream consumers did not need to understand source shard topology. See &lt;a href=&quot;https://shopify.engineering/capturing-every-change-shopify-sharded-monolith&quot;&gt;“Capturing Every Change From Shopify’s Sharded Monolith”&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The broader learning is portable: operational isolation should be designed before the first emergency shard split. If the only way to react to a noisy tenant is to add capacity to everyone, the architecture is still shared in the place that matters.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;













































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Why it happens&lt;/th&gt;&lt;th&gt;Control&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Cross-tenant reads&lt;/td&gt;&lt;td&gt;Tenant context is optional in application code&lt;/td&gt;&lt;td&gt;Require tenant resolution at request entry and enforce scoped data access helpers&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Hot merchant overload&lt;/td&gt;&lt;td&gt;One shop dominates writer, cache, queue, or replica capacity&lt;/td&gt;&lt;td&gt;Move the shop, throttle expensive paths, isolate queues, and set pod-level budgets&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replica inconsistency&lt;/td&gt;&lt;td&gt;Reads go to lagging replicas after writes&lt;/td&gt;&lt;td&gt;Track replication lag and route sensitive reads to the writer when needed&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Shard imbalance&lt;/td&gt;&lt;td&gt;Tenant growth changes after initial placement&lt;/td&gt;&lt;td&gt;Maintain shard balancing tooling and measure load by tenant, not only by database&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Global migrations stall&lt;/td&gt;&lt;td&gt;Schema changes execute across every shard at once&lt;/td&gt;&lt;td&gt;Roll out by pod, pause safely, and verify per-shard completion&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Analytics coupling&lt;/td&gt;&lt;td&gt;Downstream systems depend on physical shard layout&lt;/td&gt;&lt;td&gt;Publish logical streams that hide shard placement&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Control plane drift&lt;/td&gt;&lt;td&gt;Routing metadata differs from actual data placement&lt;/td&gt;&lt;td&gt;Treat routing changes as audited operations with validation and rollback&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The hardest breakage is cultural. Once a platform shards by tenant, product teams can no longer pretend the database is a single invisible resource. They need APIs for tenant-scoped jobs, shard-safe migrations, cross-shop reporting, and backfills. Querying across all shops becomes an explicit platform workflow, not an accidental SQL habit.&lt;/p&gt;
&lt;p&gt;That cost is worth paying only when the shared model is already creating operational risk. Premature sharding slows engineering. Late sharding turns every incident into archaeology. The right time is when the team can name the tenants, jobs, tables, and operational events that would benefit from a smaller blast radius.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Identify the top tenant-driven failure modes: write saturation, lock contention, replica lag, cache churn, job backlog, and migration duration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Make tenant identity mandatory at the request boundary, then route data, cache, queues, and controls through a pod-aware control plane.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Run failure drills by disabling a pod, forcing replica lag, moving a tenant, pausing a shard migration, and replaying CDC from one shard.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Build the smallest operational primitive first: a routing catalog that maps tenant to shard, is audited, is testable, and can be changed without redeploying application code.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>databases</category><category>cloud</category></item><item><title>MongoDB Version Upgrade Risk Review</title><link>https://rajivonai.com/blog/2024-04-08-mongodb-version-upgrade-risk-review/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-04-08-mongodb-version-upgrade-risk-review/</guid><description>A systematic runbook for assessing MongoDB version upgrade risk — FCV, driver compatibility, deprecated operators, and rollback paths before any production cutover.</description><pubDate>Mon, 08 Apr 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;MongoDB version upgrades carry more production risk than most teams account for, because the feature compatibility version (FCV) mechanism decouples the binary version from the data format — and most rollback paths close permanently once FCV advances past the point where downgrade is possible.&lt;/strong&gt; An upgrade that goes wrong after FCV has been bumped is not a rollback problem. It is a restore-from-backup problem.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;A team is planning a MongoDB upgrade from 5.0 to 6.0, or 6.0 to 7.0. The driver compatibility matrix has changed. Several aggregation operators behave differently or are deprecated. The replica set protocol version may need to advance. And someone on the platform team has noted that the mongosh syntax for a few administrative commands changed.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;MongoDB upgrades require sequential major version hops — you cannot skip from 5.0 to 7.0 directly. Each hop involves verifying FCV, testing driver compatibility, checking for removed or changed operators in application code, running staging validation, and confirming the rollback window before advancing FCV.&lt;/p&gt;
&lt;p&gt;This is not a simple package upgrade. The upgrade and the FCV advancement are two separate actions with different risk profiles. If a team simply upgrades the binaries and immediately bumps the FCV without validating application driver compatibility or verifying the removal of deprecated operators, they can trigger an immediate production outage. Worse, because the FCV bump updates internal catalog formats, the team can no longer simply downgrade the binaries to recover.&lt;/p&gt;
&lt;p&gt;Symptoms that an upgrade is poorly prepared or encountering friction include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;FCV below current server version:&lt;/strong&gt; &lt;code&gt;db.adminCommand({getParameter:1, featureCompatibilityVersion:1})&lt;/code&gt; shows a lower version, meaning features are locked.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Driver version mismatch warnings:&lt;/strong&gt; Seen in the &lt;code&gt;mongod&lt;/code&gt; log at startup when the client driver version is not supported by the target MongoDB version.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deprecated operator warnings:&lt;/strong&gt; Seen in the &lt;code&gt;mongod&lt;/code&gt; log during query execution if the application uses operators slated for removal.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unexpected replica set elections:&lt;/strong&gt; Protocol version changes triggering re-elections post-upgrade.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Application connection failures:&lt;/strong&gt; Authentication plugin or TLS changes breaking connections immediately after the upgrade.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The core question is: how can a team safely upgrade MongoDB while preserving a fast rollback path until stability is proven?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;To manage MongoDB upgrades safely, the binary upgrade must be decoupled from the FCV advancement, with rigorous validation gates in between.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[MongoDB version upgrade planned] --&gt; B{FCV at current version}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no| C[Set FCV to current version — validate stability]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; D[Wait 24h — confirm no issues]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; B&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| E{Driver version compatible with target}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|no| F[Upgrade drivers first — deploy app changes]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; G[Validate app against current server with new driver]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt; E&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|yes| H{Staging environment tested}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|no| I[Run full upgrade in staging — execute application test suite]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|yes| J{Removed operators found in app code}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|yes| K[Update application code — remove deprecated operators]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|no| L{Rollback plan documented}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|no| M[Document FCV downgrade path and backup restore procedure]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|yes| N[Proceed with binary upgrade on replica set members]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    N --&gt; O[Validate application — then advance FCV]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;pre-flight-checks&quot;&gt;Pre-Flight Checks&lt;/h3&gt;
&lt;p&gt;Before touching any binaries, the following conditions must be validated:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Feature Compatibility Version — current state:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;adminCommand&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ getParameter: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, featureCompatibilityVersion: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The FCV must be set to the current major version before starting the upgrade. If you are on MongoDB 5.0 and FCV is &lt;code&gt;&quot;4.4&quot;&lt;/code&gt;, you need to advance FCV to &lt;code&gt;&quot;5.0&quot;&lt;/code&gt; first and confirm stability before proceeding to 6.0. Running a higher binary version with a lower FCV is a temporary supported state, not a stable configuration.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Driver version compatibility:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each MongoDB driver has a minimum supported server version. The compatibility matrix is published in the MongoDB documentation. Key checks:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// In your application, log the driver version at startup&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// For Python (pymongo):&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;import&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pymongo; print(pymongo.version)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// For Node.js (mongodb driver):&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Check package.json for mongodb driver version&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The MongoDB 6.0 server dropped support for drivers older than specific versions. Any driver that predates the compatibility matrix minimum will fail to connect or exhibit undefined behavior.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Deprecated or removed commands:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// List available commands on current server&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;adminCommand&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ listCommands: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;MongoDB 6.0 removed several commands and changed the behavior of others. The release notes are authoritative.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Deprecated aggregation operators:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Key changes documented in release notes include &lt;code&gt;$where&lt;/code&gt; behavior restrictions, and &lt;code&gt;$accumulator&lt;/code&gt; / &lt;code&gt;$function&lt;/code&gt; flag requirements. Search application code for these patterns before upgrading:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Search for commonly changed operators in application code&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;grep&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -r&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;\$where\|\$function\|\$accumulator\|\$group.*\$sort&apos;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; ./src/&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Replica set protocol version:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;adminCommand&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ replSetGetConfig: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; }).config&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Check &lt;code&gt;protocolVersion&lt;/code&gt; — MongoDB 4.0 and later use protocol version 1. Any legacy replica set configuration referencing protocol version 0 needs to be updated. Review election-related settings that may behave differently if the consensus implementation changed.&lt;/p&gt;
&lt;h3 id=&quot;remediation-paths&quot;&gt;Remediation Paths&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Sequential FCV advancement with validation gates&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The safe upgrade path requires waiting before executing the final step:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Step 1: Confirm current FCV&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;adminCommand&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ getParameter: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, featureCompatibilityVersion: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Step 2: After binary upgrade, validate application for 24-48 hours&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// DO NOT advance FCV until validation is complete&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Step 3: Advance FCV only after application validates&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;adminCommand&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ setFeatureCompatibilityVersion: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;6.0&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Rolling upgrades&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;MongoDB supports rolling upgrades: upgrade secondaries first, step down the primary, then upgrade the former primary.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Step down primary after secondaries are upgraded and caught up&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;adminCommand&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ replSetStepDown: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;60&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Upgrade primary binary, then confirm replica set is healthy&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;rs.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h3&gt;
&lt;p&gt;A pre-upgrade validation script in staging can catch failure modes before they reach production:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Validate FCV is at current version&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;let&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; fcv &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;adminCommand&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ getParameter: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, featureCompatibilityVersion: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; });&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;assert.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;eq&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(fcv.featureCompatibilityVersion.version, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;EXPECTED_VERSION&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &quot;FCV not at current version — do not proceed&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Check for active connections with outdated drivers&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;currentOp&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;().inprog.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;forEach&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#FFAB70&quot;&gt;op&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  if&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (op.clientMetadata &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&amp;#x26;&amp;#x26;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; op.clientMetadata.driver) {&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;    print&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;Driver:&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, op.clientMetadata.driver.name, op.clientMetadata.driver.version);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;});&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;A)&lt;/strong&gt; The engineering team at Coinbase has publicly documented their MongoDB cluster management strategies, emphasizing that major upgrades at scale require rigorous, automated testing of driver compatibility and data format changes in staging before touching production.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;B)&lt;/strong&gt; Derived directly from MongoDB’s architecture, the &lt;code&gt;setFeatureCompatibilityVersion&lt;/code&gt; command actively rewrites internal system collections. For example, upgrading to 6.0 and setting FCV to “6.0” alters how change streams and time-series collections are structured, permanently preventing older 5.0 binaries from reading the files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;C)&lt;/strong&gt; The documented pattern across high-reliability platform teams is to leave the FCV at the older version for days or even weeks after a rolling binary upgrade, treating the final FCV bump as the true point-of-no-return.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Tradeoff&lt;/th&gt;&lt;th&gt;Why it fails&lt;/th&gt;&lt;th&gt;How to mitigate&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Driver Mismatches&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Upgraded MongoDB servers drop support for older drivers, causing connection drops or authentication failures at startup.&lt;/td&gt;&lt;td&gt;Always upgrade application drivers and validate against the current MongoDB version &lt;em&gt;before&lt;/em&gt; touching the database binaries.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Premature FCV Bump&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Running &lt;code&gt;setFeatureCompatibilityVersion&lt;/code&gt; immediately after a binary upgrade destroys the ability to downgrade if application bugs appear.&lt;/td&gt;&lt;td&gt;Enforce a strict 24 to 48 hour validation period between binary upgrade and FCV advancement.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Deprecated Operators&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Target versions remove deprecated aggregation pipeline stages (e.g., specific &lt;code&gt;$where&lt;/code&gt; behaviors), breaking queries dynamically.&lt;/td&gt;&lt;td&gt;Audit application code via static analysis and review slow query logs for deprecated operators before starting the upgrade.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Protocol Version Changes&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Upgrading replica sets with legacy protocol configurations can trigger unexpected elections or split-brain scenarios.&lt;/td&gt;&lt;td&gt;Verify &lt;code&gt;protocolVersion&lt;/code&gt; is 1 and review election timeout settings before upgrading secondaries.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Data Format Rollback&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;After FCV is advanced, binary downgrade is blocked. The database will refuse to start.&lt;/td&gt;&lt;td&gt;The only recovery path is a full snapshot restore from a backup taken before the FCV change. Ensure restores are tested in staging.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; In-place MongoDB upgrades risk irreversible data format changes and application outages if compatibility is not strictly validated before the point of no return.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Decouple the binary upgrade from the Feature Compatibility Version (FCV) advancement, use a rolling replica set upgrade, and codify a strict validation window.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; MongoDB’s internal architecture requires FCV bumps to restructure data formats, meaning rollback paths permanently close the moment the command is executed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt;
&lt;ol&gt;
&lt;li&gt;Confirm FCV is at the current major version via &lt;code&gt;db.adminCommand({getParameter:1, featureCompatibilityVersion:1})&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Upgrade application drivers to target-compatible versions.&lt;/li&gt;
&lt;li&gt;Perform a rolling binary upgrade on secondaries, step down the primary, and upgrade the new secondary.&lt;/li&gt;
&lt;li&gt;Validate application behavior against the new binary for 24–48 hours before running &lt;code&gt;db.adminCommand({setFeatureCompatibilityVersion: &quot;X.0&quot;})&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category><category>architecture</category></item><item><title>Index Debt Review: How to Find Bad, Missing, and Duplicate Indexes</title><link>https://rajivonai.com/blog/2024-03-18-index-debt-review-bad-missing-duplicate-indexes/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-03-18-index-debt-review-bad-missing-duplicate-indexes/</guid><description>A SQL-driven audit workflow for identifying unused, duplicate, bloated, and missing indexes in PostgreSQL before they drain write performance and storage.</description><pubDate>Mon, 18 Mar 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Indexes accumulate silently.&lt;/strong&gt; Engineers add them to fix slow queries, migration scripts add them to enforce constraints, ORM scaffolding adds them speculatively, and nobody systematically removes them. Over several years, a database with 50 tables can accumulate 200 indexes — half of which are never used, a tenth of which duplicate each other, and several of which are invalid or bloated. The cost is paid on every write: each insert, update, and delete must maintain every index on the affected table, whether or not that index is ever scanned.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s &lt;code&gt;pg_stat_user_indexes&lt;/code&gt; tracks cumulative scan counts for every index since the last statistics reset. An index with &lt;code&gt;idx_scan = 0&lt;/code&gt; has never been used in a query plan. An index that duplicates another index means two identical maintenance operations happen on every write. An invalid index — one that failed partway through a &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; — takes up space and maintenance overhead without ever being selected by the planner.&lt;/p&gt;
&lt;p&gt;Index debt reviews should happen on a schedule, not just when disk is running low. Write amplification from carrying 40 unused indexes on a high-write table is not dramatic — it adds microseconds per write — but it compounds. At high write volume, the cumulative effect shows up as elevated lock contention during bulk operations and higher checkpoint I/O pressure.&lt;/p&gt;
&lt;p&gt;The review is a structured SQL audit. No tools required beyond &lt;code&gt;psql&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Table size growing faster than row count&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_size_pretty(pg_total_relation_size(...))&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Index bloat accumulating alongside table bloat&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slow bulk inserts or updates on large tables&lt;/td&gt;&lt;td&gt;Application timing logs&lt;/td&gt;&lt;td&gt;Too many indexes being maintained per write&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;idx_scan = 0&lt;/code&gt; on multiple indexes&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_user_indexes&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Unused indexes consuming write bandwidth&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Duplicate entries in &lt;code&gt;pg_index&lt;/code&gt; by &lt;code&gt;indrelid&lt;/code&gt; and &lt;code&gt;indkey&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_index&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Redundant indexes doubling maintenance overhead&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;indisvalid = false&lt;/code&gt; in &lt;code&gt;pg_index&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_index&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Invalid indexes from failed concurrent builds&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High seq_scan count with low idx_scan&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_user_tables&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Missing index on a frequently filtered column&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Unused indexes (zero scan count)&lt;/strong&gt; — the first thing to remove:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schemaname&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;tablename&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexname&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;idx_scan&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_size_pretty(pg_relation_size(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexrelid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; index_size,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  i&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indisprimary&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_indexes s&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_index i &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; i&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexrelid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexrelid&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;idx_scan&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; i&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indisprimary&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; i&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indisunique&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_relation_size(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexrelid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Sort by size to prioritize — a 10 GB unused index is a higher-priority removal than a 10 MB one. Exclude primary keys and unique constraints; those enforce data integrity regardless of query usage.&lt;/p&gt;
&lt;p&gt;Check when statistics were last reset before acting on zero-scan counts:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; stats_reset &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_database &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; datname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; current_database();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If &lt;code&gt;stats_reset&lt;/code&gt; was yesterday, a zero scan count is not evidence. If it was 60+ days ago, it is reliable.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Duplicate indexes&lt;/strong&gt; — same table, same column list:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  indrelid::regclass &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  array_agg(indexrelid::regclass &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_relation_size(indexrelid) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; indexes,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  array_agg(pg_size_pretty(pg_relation_size(indexrelid)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_relation_size(indexrelid) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; sizes&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_index&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; indrelid, indkey&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;HAVING&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Two indexes on &lt;code&gt;(customer_id)&lt;/code&gt; with identical definitions are pure overhead — keep the one with higher &lt;code&gt;idx_scan&lt;/code&gt; and drop the other. Duplicates often result from migration tools generating a new index when a unique constraint was added on a column that already had a regular index.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Bloated or low-use large indexes&lt;/strong&gt; — high storage cost relative to usage:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexrelid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;::regclass &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; indexname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;tablename&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;idx_scan&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_size_pretty(pg_relation_size(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexrelid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; index_size,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_relation_size(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexrelid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; raw_size&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_indexes s&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;idx_scan&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; raw_size &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;An index with fewer than 10 scans that takes 5 GB of storage is worth examining closely. Combine with the age of statistics reset to determine if ”&amp;#x3C; 10 scans” reflects weeks of production traffic or just a few hours.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Tables with high sequential scan counts and missing indexes&lt;/strong&gt; — potential missing indexes:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  relname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  seq_scan,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  idx_scan,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_live_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  seq_scan &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_scan &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; seq_excess&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; seq_scan &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_scan&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_live_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; seq_scan &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; seq_scan &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 15&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A table with 500,000 rows where &lt;code&gt;seq_scan = 10000&lt;/code&gt; and &lt;code&gt;idx_scan = 50&lt;/code&gt; is performing full table scans on almost every access. Pair this with &lt;code&gt;EXPLAIN (ANALYZE)&lt;/code&gt; on the most frequent queries against that table to identify which column would benefit from an index.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Invalid indexes&lt;/strong&gt; — indexes that must be rebuilt:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  indexrelid::regclass &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; indexname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  indrelid::regclass &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_size_pretty(pg_relation_size(indexrelid)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; index_size&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_index&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; indisvalid;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;An invalid index results from a &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; that failed partway through, typically due to a deadlock or constraint violation. PostgreSQL keeps the partially-built index but marks it as invalid — it takes up space and triggers write maintenance but is never used by the planner. These must be rebuilt or dropped.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Index audit triggered] --&gt; B{stats_reset recent?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes — under 30 days| C[Wait for 30 days of data before removing]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no — over 30 days of data| D{idx_scan = 0 indexes found?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes| E{Primary key or unique constraint?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|yes| F[Keep — data integrity requirement]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|no| G[DROP INDEX CONCURRENTLY]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no| H{Duplicate indexes found?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|yes| I[Keep higher-scan index — drop duplicate]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|no| J{Invalid indexes found?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|yes| K[REINDEX CONCURRENTLY]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|no| L{High seq_scan on large table?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|yes| M[EXPLAIN slow query — add covering index]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|no| N[Index health OK — schedule next audit]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Drop unused indexes&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Always use &lt;code&gt;CONCURRENTLY&lt;/code&gt; to avoid blocking writes:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Drop a specific unused index&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; CONCURRENTLY &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schema_name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;unused_index_name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify it is gone&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; indexname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_indexes&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; indexname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;unused_index_name&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;DROP INDEX CONCURRENTLY&lt;/code&gt; waits for all transactions that reference the index to complete, then removes it. It does not hold an ACCESS EXCLUSIVE lock for the duration — it uses multiple lower-level locks and can coexist with reads and writes. It cannot run inside a transaction block.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Rebuild invalid or bloated indexes&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For invalid indexes from failed concurrent builds:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Rebuild concurrently — creates new valid index, replaces old&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;REINDEX &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INDEX&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; CONCURRENTLY &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schema_name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;invalid_index_name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Or drop and recreate&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; CONCURRENTLY &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schema_name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;invalid_index_name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; CONCURRENTLY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_orders_status &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For bloated indexes where the size has grown disproportionately to the data (common on tables with many deletes and updates), &lt;code&gt;REINDEX CONCURRENTLY&lt;/code&gt; reclaims the space. The bloat is visible by comparing &lt;code&gt;pg_relation_size(indexrelid)&lt;/code&gt; against &lt;code&gt;pg_relation_size(indrelid) * 0.1&lt;/code&gt; — an index larger than 10% of its table’s size on a low-selectivity column is worth investigating.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Create missing indexes for high-seq-scan tables&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When &lt;code&gt;pg_stat_user_tables&lt;/code&gt; shows a table with &lt;code&gt;seq_scan &gt;&gt; idx_scan&lt;/code&gt; and large &lt;code&gt;n_live_tup&lt;/code&gt;, identify the query pattern and create a covering index:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Always create concurrently in production&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; CONCURRENTLY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_orders_status_created&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; IN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;pending&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;processing&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);  &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- partial index if applicable&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify the index is used after creation&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pending&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; interval &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;7 days&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 50&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A partial index (&lt;code&gt;WHERE status IN (...)&lt;/code&gt;) is smaller, faster to maintain, and more selective than a full index on the same column. Use it when the query always filters to a known subset.&lt;/p&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;DROP INDEX CONCURRENTLY&lt;/code&gt;: reversible by recreating the index with &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt;. Keep the original index DDL in a migration file before dropping so reconstruction is a single command. Note that recreation is not instant on large tables — budget time for it.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;REINDEX CONCURRENTLY&lt;/code&gt;: leaves the original index in place until the rebuild is complete, then swaps atomically. Safe to abort at any point — if aborted, the original index is still valid.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt;: if the new index turns out to worsen plan choices, drop it with &lt;code&gt;DROP INDEX CONCURRENTLY&lt;/code&gt;. The planner will revert to its prior plan immediately.&lt;/li&gt;
&lt;li&gt;No rollback is needed for the read-only audit queries — they have no side effects.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Index audits are well-suited to a quarterly automated report. This query generates a prioritized removal candidate list:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Quarterly index debt report&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &apos;DROP INDEX CONCURRENTLY &apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ||&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; schemaname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;||&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;.&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ||&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; indexname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;||&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;;&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; removal_sql,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_size_pretty(pg_relation_size(indexrelid)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; reclaimed,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  idx_scan,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_idx_scan&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_indexes s&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_index i &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; i&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexrelid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexrelid&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;idx_scan&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; i&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indisprimary&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; i&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indisunique&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_relation_size(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexrelid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1024&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1024&lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;  -- &gt; 10 MB only&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_relation_size(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;indexrelid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;last_idx_scan&lt;/code&gt; (added in PostgreSQL 16) shows the timestamp of the last use, which is more precise than relying on &lt;code&gt;stats_reset&lt;/code&gt;. For earlier versions, &lt;code&gt;stats_reset&lt;/code&gt; from &lt;code&gt;pg_stat_database&lt;/code&gt; is the best proxy.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The PostgreSQL documentation for &lt;code&gt;pg_stat_user_indexes&lt;/code&gt; explicitly notes that &lt;code&gt;idx_scan&lt;/code&gt; is reset by &lt;code&gt;pg_stat_reset()&lt;/code&gt; and reflects cumulative counts since the last reset. This means that before acting on zero-scan counts, verifying the age of the statistics reset is not optional — it is required. The PostgreSQL wiki recommends a minimum of 2–4 weeks of production traffic before treating a zero scan count as evidence of permanent non-use.&lt;/p&gt;
&lt;p&gt;The documented behavior of &lt;code&gt;DROP INDEX CONCURRENTLY&lt;/code&gt; is that it requires two table scans — one to mark the index invalid, one to remove it — and uses a series of lower-level locks rather than a single ACCESS EXCLUSIVE lock. Per the PostgreSQL documentation, it is safe to run on production tables under normal load, with the caveat that it cannot be executed inside an explicit transaction block.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Dropped index turns out to be needed&lt;/td&gt;&lt;td&gt;Statistics reset was recent; index was used before reset&lt;/td&gt;&lt;td&gt;Recreate with &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt;; add to rollback script before next drop&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;DROP INDEX CONCURRENTLY&lt;/code&gt; hangs&lt;/td&gt;&lt;td&gt;Long-running transaction holds a lock on the table&lt;/td&gt;&lt;td&gt;Wait for transaction to complete; monitor &lt;code&gt;pg_stat_activity&lt;/code&gt; for blockers&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;REINDEX CONCURRENTLY&lt;/code&gt; fails midway&lt;/td&gt;&lt;td&gt;Disk full during index rebuild&lt;/td&gt;&lt;td&gt;Free disk space; the original index is still valid after failure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Duplicate index removal breaks constraint&lt;/td&gt;&lt;td&gt;Duplicate was actually a unique constraint enforced via index&lt;/td&gt;&lt;td&gt;Check &lt;code&gt;indisunique&lt;/code&gt; in &lt;code&gt;pg_index&lt;/code&gt; before dropping — never drop unique indexes without confirming the constraint is covered elsewhere&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;New covering index triggers plan regression&lt;/td&gt;&lt;td&gt;Planner prefers new index for a query it should not&lt;/td&gt;&lt;td&gt;Drop the new index and use &lt;code&gt;pg_hint_plan&lt;/code&gt; or partial index to constrain scope&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Unused and duplicate indexes consume write bandwidth on every insert, update, and delete, with no benefit — and invalid indexes waste space and maintenance work while never being selected by the planner.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Run the five audit queries on a schedule, confirm statistics age, and use &lt;code&gt;DROP INDEX CONCURRENTLY&lt;/code&gt; and &lt;code&gt;REINDEX CONCURRENTLY&lt;/code&gt; to clean up — always with CONCURRENTLY to avoid locking.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After removing a high-overhead unused index, &lt;code&gt;pg_stat_bgwriter.buffers_clean&lt;/code&gt; should stabilize or decrease on write-heavy tables, and bulk insert timing should improve.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run Check 1 and Check 5 this week. Drop any invalid indexes immediately with &lt;code&gt;REINDEX CONCURRENTLY&lt;/code&gt;, and flag any zero-scan indexes over 1 GB for the next review cycle.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;checklist&quot;&gt;Checklist&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Check &lt;code&gt;pg_stat_database.stats_reset&lt;/code&gt; — confirm statistics are at least 30 days old before acting&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_stat_user_indexes&lt;/code&gt; for &lt;code&gt;idx_scan = 0&lt;/code&gt; — exclude primary keys and unique constraints&lt;/li&gt;
&lt;li&gt;Sort zero-scan indexes by &lt;code&gt;pg_relation_size&lt;/code&gt; — prioritize largest for removal&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_index&lt;/code&gt; for duplicate &lt;code&gt;indrelid + indkey&lt;/code&gt; combinations — identify redundant indexes&lt;/li&gt;
&lt;li&gt;For duplicates, keep the index with the higher &lt;code&gt;idx_scan&lt;/code&gt; count and drop the other&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_index WHERE NOT indisvalid&lt;/code&gt; — list all invalid indexes&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;REINDEX CONCURRENTLY&lt;/code&gt; on all invalid indexes immediately&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;pg_stat_user_tables&lt;/code&gt; for tables with &lt;code&gt;seq_scan &gt;&gt; idx_scan&lt;/code&gt; and &lt;code&gt;n_live_tup &gt; 10000&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;For high-seq-scan tables, run &lt;code&gt;EXPLAIN (ANALYZE)&lt;/code&gt; on frequent queries to identify missing indexes&lt;/li&gt;
&lt;li&gt;Create any missing indexes with &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Document all dropped indexes with their original DDL before removing&lt;/li&gt;
&lt;li&gt;Schedule the next index audit for 90 days out — add to the team runbook&lt;/li&gt;
&lt;/ol&gt;</content:encoded><category>databases</category><category>checklist</category><category>failures</category></item><item><title>Aurora Serverless v2: Good Fit, Bad Fit</title><link>https://rajivonai.com/blog/2024-03-11-aurora-serverless-v2-good-fit-bad-fit/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-03-11-aurora-serverless-v2-good-fit-bad-fit/</guid><description>Aurora Serverless v2 scales ACUs rather than to zero — understanding the cost floor, scale-up lag, and workload fit before you commit to it for production OLTP.</description><pubDate>Mon, 11 Mar 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Aurora Serverless v2 is not a zero-cost idle database. It does not scale to zero. The minimum ACU setting is a cost floor, not a free tier — and the seconds-long lag while capacity adds is invisible in load tests until it hits you at 9am on a Monday when traffic ramps faster than the scaler reacts. Picking the right workload for this product matters more than the configuration.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Aurora Serverless v2 replaced the original Aurora Serverless (v1) as AWS’s elastic capacity layer for Aurora MySQL and PostgreSQL. The core pitch is straightforward: instead of choosing an instance class and living with it, you set a minimum and maximum in Aurora Capacity Units (ACUs), and Aurora scales between them as your workload changes. One ACU is approximately 2 GiB of memory with proportional CPU.&lt;/p&gt;
&lt;p&gt;Engineers encounter Aurora Serverless v2 in two scenarios: they are building a new application and want to avoid instance sizing decisions, or they are running development and staging databases that sit idle most of the day. Both are valid entry points. The confusion arrives when teams read “serverless” and assume it behaves like Lambda — scaling to zero and costing nothing when unused. That is not how v2 works.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Aurora Serverless v2 does not scale to zero. Per &lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.html&quot;&gt;AWS Aurora Serverless v2 documentation&lt;/a&gt;, the minimum ACU setting is 0.5 ACU. A cluster sitting at 0.5 ACU is still running, still consuming storage, and still billing you for compute capacity — just at the floor. At 0.5 ACU the cluster is not responsive enough for most production workloads; it is a warm-standby state, not an off state.&lt;/p&gt;
&lt;p&gt;The second operational problem is scale-up latency. AWS documentation describes Aurora Serverless v2 scaling as happening in increments as fine as 0.5 ACU, and the scaling response is measured in seconds rather than the minutes v1 required. But “seconds” still means your application sees elevated latency during a rapid ramp. A workload that goes from idle to peak in under 30 seconds — a flash sale, a morning cron job flushing a large batch, a viral event — will encounter query latency spikes while ACUs catch up. That behavior does not show up in steady-state load tests.&lt;/p&gt;
&lt;p&gt;The core question becomes: Which production workloads can actually tolerate Aurora Serverless v2’s scaling latency and cost floor, and which should stay on provisioned instances?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;Aurora Serverless v2 and a provisioned Aurora instance solve different cost problems. The architectural behavior dictating this is that scaling events monitor CPU and memory constraints continuously, stepping up capacity only when thresholds are breached.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    App[&quot;Application Workload&quot;] --&gt; Router[&quot;Aurora Query Router&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Router --&gt; Instance[&quot;Serverless v2 Instance&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Instance --&gt; Monitor[&quot;Capacity Monitor — CPU and Memory&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Monitor --&gt;|&quot;Demand Exceeds Threshold&quot;| ScaleUp[&quot;Step Up ACU Allocation&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Monitor --&gt;|&quot;Demand Drops&quot;| ScaleDown[&quot;Step Down ACU Allocation&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    ScaleUp --&gt; Storage[&quot;Aurora Shared Cluster Volume&quot;]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    ScaleDown --&gt; Storage&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The table below reflects the documented scaling behavior and AWS’s own guidance on workload suitability based on these architectural constraints.&lt;/p&gt;















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Workload type&lt;/th&gt;&lt;th&gt;Serverless v2 fit&lt;/th&gt;&lt;th&gt;Provisioned fit&lt;/th&gt;&lt;th&gt;Reason&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Development and staging databases&lt;/td&gt;&lt;td&gt;Good&lt;/td&gt;&lt;td&gt;Acceptable&lt;/td&gt;&lt;td&gt;Usage is variable; v2 saves money vs always-on provisioned at dev scale&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Unpredictable traffic spikes — e-commerce, events&lt;/td&gt;&lt;td&gt;Good&lt;/td&gt;&lt;td&gt;Acceptable&lt;/td&gt;&lt;td&gt;v2 scales up to handle bursts; burst lag is usually tolerable if gradual&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Multi-tenant SaaS — many low-utilization tenant DBs&lt;/td&gt;&lt;td&gt;Good&lt;/td&gt;&lt;td&gt;Poor&lt;/td&gt;&lt;td&gt;Per-tenant provisioned capacity wastes money; v2 consolidates cost&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Steady high-throughput OLTP — payment rails, order processing&lt;/td&gt;&lt;td&gt;Poor&lt;/td&gt;&lt;td&gt;Good&lt;/td&gt;&lt;td&gt;Provisioned is cheaper at consistent high utilization; no scale-lag risk&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Latency-sensitive workloads with P99 budget under 100ms&lt;/td&gt;&lt;td&gt;Poor&lt;/td&gt;&lt;td&gt;Good&lt;/td&gt;&lt;td&gt;Scale-up pause exceeds latency budget during capacity adds&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Workloads that regularly hit the ACU maximum&lt;/td&gt;&lt;td&gt;Poor&lt;/td&gt;&lt;td&gt;Good&lt;/td&gt;&lt;td&gt;You are paying provisioned-equivalent prices with serverless overhead&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The pattern in the “Poor” column is a single failure mode in different clothing: you are running a workload whose demand profile does not benefit from dynamic scaling, but you are paying the operational cost of it anyway.&lt;/p&gt;
&lt;p&gt;Unlike Aurora Serverless v1, v2 supports Multi-AZ deployments, Global Database, and read replicas. For teams that rejected v1 because of those feature gaps, v2 is worth re-evaluating — the operational parity with provisioned Aurora is close. Aurora Global Database architecture details, including how the storage-level replication layer works beneath both provisioned and serverless configurations, are covered in &lt;a href=&quot;https://rajivonai.com/blog/2024-02-19-aurora-global-database-what-it-solves-and-does-not/&quot;&gt;Aurora Global Database: What It Solves and What It Does Not&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented behavior from AWS makes the cost model explicit: Aurora Serverless v2 bills per ACU-hour for the capacity consumed, with a floor at whatever minimum ACU you configure. A cluster set to a minimum of 0.5 ACU and a maximum of 16 ACU will never bill less than 0.5 ACU-hours per hour — even at 3am with zero connections. Because 0.5 ACUs represents a strict running floor, the documented pattern is that overnight idle cost remains a factor for production databases compared to stopping a traditional RDS instance.&lt;/p&gt;
&lt;p&gt;The scaling increment behavior — as small as 0.5 ACU per step — is explicitly described in &lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2-setting-acus.html&quot;&gt;AWS Aurora Serverless v2 capacity documentation&lt;/a&gt;. The architectural consequence is that a cluster at minimum ACU receiving a sudden large query load will step up through multiple increments before reaching steady-state capacity, and each step takes a moment. Writer and reader instances scale independently, which matters for read-heavy workloads using read replicas — adding read capacity does not help a CPU-bound writer.&lt;/p&gt;
&lt;p&gt;The documented pattern from AWS is that workloads matching development environments or low-traffic production use-cases see meaningful savings from v2 over always-on provisioned instances. Conversely, workloads with consistent high utilization do not see these savings and incur the scale-up latency penalty unnecessarily.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Sudden traffic burst from a low ACU floor&lt;/td&gt;&lt;td&gt;Query latency spikes for seconds to tens of seconds&lt;/td&gt;&lt;td&gt;ACU scaling is fast but not instant; gap between demand arrival and capacity availability causes queuing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Minimum ACU misread as zero-cost idle&lt;/td&gt;&lt;td&gt;Surprise monthly bill for compute on a database with no traffic&lt;/td&gt;&lt;td&gt;0.5 ACU minimum is always running; “idle” is not “off”&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Maximum ACU cap during sustained high load&lt;/td&gt;&lt;td&gt;Connections queue or queries fail when ACU ceiling is hit&lt;/td&gt;&lt;td&gt;v2 does not exceed the maximum you set; a too-low ceiling behaves like an undersized provisioned instance&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High-utilization steady OLTP workload&lt;/td&gt;&lt;td&gt;v2 cost exceeds provisioned equivalent&lt;/td&gt;&lt;td&gt;At constant high utilization, provisioned instance pricing is cheaper and eliminates scale-up lag risk&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: A team selects Aurora Serverless v2 for production OLTP expecting elastic cost savings, sets a low minimum ACU to reduce idle cost, and discovers latency spikes every morning when traffic ramps faster than ACUs add.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Match the ACU minimum to the lowest acceptable sustained capacity for your P99 latency target, not to the cheapest idle state; use provisioned Aurora for workloads with consistent high utilization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Set minimum ACU at least to the capacity needed to handle your initial morning ramp without queuing — then observe scale-up events in CloudWatch Aurora metrics (the &lt;code&gt;ServerlessDatabaseCapacity&lt;/code&gt; metric shows ACU consumption in real time) and verify latency does not spike during ramp-up.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Pull one week of CloudWatch &lt;code&gt;ServerlessDatabaseCapacity&lt;/code&gt; metrics for any existing Aurora Serverless v2 cluster and compare average ACU consumption to your configured maximum; if average is consistently above 80% of maximum, the workload belongs on provisioned.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category><category>architecture</category></item><item><title>Vector Search on GPU Databases</title><link>https://rajivonai.com/blog/2024-03-06-vector-search-on-gpu-databases/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-03-06-vector-search-on-gpu-databases/</guid><description>A DBA-friendly explanation of how vector search works, why GPUs help, and where vector retrieval fits inside modern database and AI systems.</description><pubDate>Wed, 06 Mar 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Vector search sounds mysterious until you map it to familiar database concepts.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Retrieval systems are shifting from pure lexical matching to meaning-based retrieval. Developers are generating high-dimensional embeddings—numerical representations of meaning—for documents, chat logs, and product catalogs to enable semantic search. Traditional databases have bolted on vector data types to support this new access pattern. In DBA language, embeddings place content into coordinates in a high-dimensional space so semantically related items are close, even when the exact text differs.&lt;/p&gt;
&lt;p&gt;Traditional indexes optimize exact or ordered lookups. Embeddings optimize semantic proximity. Production systems now regularly combine metadata filters, keyword retrieval, and vector similarity retrieval into a single serving path.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Traditional indexing strategies break down when the core query requirement shifts from equality to similarity. Instead of exact match queries like:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; products&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; category &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;laptop&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;vector retrieval executes:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;text&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;query vector -&gt; nearest stored vectors&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This requires comparing a query vector against millions of stored vectors to find the nearest neighbors. At scale, that means repeated arithmetic over large arrays—such as dot products, cosine similarity, or Euclidean distance. Exact vector search compares against all candidates, which is accurate but computationally costly. When the vector corpus is large and queries per second (QPS) are meaningful, CPU-based execution bottlenecks on candidate scoring. How do you maintain strict latency targets when distance calculations dominate the runtime?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;Vector search is nearest-neighbor retrieval over high-dimensional coordinates, and GPU databases accelerate the specific mathematical bottlenecks of this workload.&lt;/p&gt;
&lt;p&gt;Approximate Nearest Neighbor (ANN) indexes reduce the search space to hit practical latency targets. ANN narrows candidate sets quickly, and then GPU acceleration scores and ranks these large candidate sets efficiently. This combination is why vector search and GPU databases are frequently paired.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Client Query] --&gt; B[Embedding Model]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; C[Query Vector]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; D[Database Engine]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; E[Metadata Filter]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; F[ANN Index Search]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; G[Candidate Set Fetch]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt; H[GPU Scoring Engine]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt; I[Top K Reranked Results]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To build a DBA mental model, this is not a different universe; it is a new retrieval access pattern with familiar system tradeoffs:&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Traditional DB Concept&lt;/th&gt;&lt;th&gt;Vector Search Equivalent&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Row&lt;/td&gt;&lt;td&gt;Content item — chunk&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Indexed column&lt;/td&gt;&lt;td&gt;Embedding vector&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Equality predicate&lt;/td&gt;&lt;td&gt;Similarity function&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Top-N query&lt;/td&gt;&lt;td&gt;Top-K nearest neighbors&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Post-filtering&lt;/td&gt;&lt;td&gt;Metadata filtering and reranking&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Production retrieval usually combines metadata filters (tenant, region, ACL scope, content type, time window) with semantic search. This is why databases still matter deeply in AI retrieval systems: governance, filtering, structure, and access control do not disappear.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern is that CPU-based databases struggle under high QPS when computing exact distances on large vector dimensions. Systems like PostgreSQL using &lt;code&gt;pgvector&lt;/code&gt; behave efficiently with HNSW (Hierarchical Navigable Small World) indexes for moderate workloads, but finding the exact top candidates still requires significant distance calculations on the final candidate set.&lt;/p&gt;
&lt;p&gt;NVIDIA’s RAPIDS RAFT library demonstrates how GPUs handle these operations in production. The SIMT (Single Instruction, Multiple Threads) architecture of a GPU is a perfect fit for repeated vector arithmetic over large arrays. By offloading candidate scoring and reranking to GPUs, systems like Milvus (using GPU-accelerated indexes like IVF-PQ) can evaluate larger candidate sets without missing latency targets. The GPU accelerates the exact math repeated many times in parallel, allowing the system to scale throughput without degrading response times.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;
&lt;p&gt;GPU acceleration introduces setup complexity and is not a universal solution. It is a specific tool for candidate scoring bottlenecks.&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Dimension&lt;/th&gt;&lt;th&gt;CPU Vector Search&lt;/th&gt;&lt;th&gt;GPU Vector Search&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Setup complexity&lt;/td&gt;&lt;td&gt;Lower&lt;/td&gt;&lt;td&gt;Higher&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Small datasets&lt;/td&gt;&lt;td&gt;Usually fine&lt;/td&gt;&lt;td&gt;Often overkill&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Large candidate scoring&lt;/td&gt;&lt;td&gt;Can bottleneck&lt;/td&gt;&lt;td&gt;Strong fit&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Throughput&lt;/td&gt;&lt;td&gt;Moderate&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Latency under load&lt;/td&gt;&lt;td&gt;Degrades sooner&lt;/td&gt;&lt;td&gt;Stronger at scale&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Best fit&lt;/td&gt;&lt;td&gt;Smaller and simpler workloads&lt;/td&gt;&lt;td&gt;Large-scale retrieval and ranking&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;CPU-only architectures are often sufficient when the corpus is small, QPS is low, latency constraints are loose, or retrieval runs as an offline batch process. GPU acceleration is worth serious consideration when candidate scoring dominates runtime, retrieval is user-facing, or reranking and inference exist in the same serving path.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: CPU candidate scoring bottlenecks high-throughput semantic search when exact distance calculations scale linearly with candidate size.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Offload candidate scoring and vector similarity math to GPU execution to process large arrays in parallel.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Database implementations leveraging NVIDIA RAFT or GPU-accelerated Milvus indexes demonstrate high throughput scaling for dense vector workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Profile your vector search workloads to determine if distance arithmetic is the primary bottleneck before adopting GPU instances.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>gpu</category><category>vector-search</category><category>retrieval</category></item><item><title>How a 10 Billion Row SQL Query Runs in 200ms on a GPU Database</title><link>https://rajivonai.com/blog/2024-03-05-how-a-10-billion-row-sql-query-runs-in-200ms-on-a-gpu-database/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-03-05-how-a-10-billion-row-sql-query-runs-in-200ms-on-a-gpu-database/</guid><description>A DBA-friendly walkthrough of how modern GPU databases execute large analytical SQL queries using columnar storage, parallel scans, and GPU aggregation.</description><pubDate>Tue, 05 Mar 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The same SQL that takes 60 seconds on a CPU database runs in 200ms on a GPU database — and the reason is not that GPUs are faster processors, it is that the execution model changes what happens between query plan and result.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Every database engineer has seen a query that looks harmless in code review and painful in production:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; country, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;SUM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(revenue)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; events&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; country;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At 10,000 rows, nobody cares. At 10 billion rows, this becomes a serious execution problem. CPU-based execution engines process this query through a bounded number of threads, each handling a sequential slice of the data. The query is I/O-intensive and compute-intensive, but the CPU serializes its work in ways that GPU execution does not.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The structural gap is parallelism. A CPU-based database runs this query with dozens to hundreds of parallel workers. A GPU-based engine runs it with thousands to tens of thousands of parallel threads, each processing a slice of columnar data simultaneously. The difference in wall time is not incremental — it is a category change for the right workload shape.&lt;/p&gt;
&lt;p&gt;The engineering question is not “why is this fast?” but rather “which queries change category, and which don’t?” Getting this wrong leads to GPU infrastructure that produces no benefit for the actual hot paths, because the bottleneck is I/O or coordination, not compute throughput.&lt;/p&gt;
&lt;h2 id=&quot;step-by-step-how-the-query-executes&quot;&gt;Step-by-Step: How the Query Executes&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://rajivonai.com/diagrams/gpu-database-execution/10b_row_query_gpu_timeline.svg&quot; alt=&quot;10B row GPU query timeline&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: CPU plans the query&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The request starts as a normal SQL path: parse SQL, resolve objects, build logical plan, choose physical plan. CPU remains the control plane for planning, scheduling, and orchestration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 2: Engine isolates the heavy path&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The planner identifies operators suitable for acceleration. In most systems, this is hybrid execution — CPU keeps control-flow-heavy tasks, GPU takes scan/compute-heavy operators. The right model is not “GPU-only database” but “GPU-accelerated execution.”&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 3: Columnar data minimizes work&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For this query, the engine only needs &lt;code&gt;country&lt;/code&gt; and &lt;code&gt;revenue&lt;/code&gt;. Columnar layouts avoid moving irrelevant columns and align better with parallel arithmetic over dense vectors.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 4: GPU fan-out across threads&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The heavy scan/compute path is fanned out across many threads:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;text&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Thread 1     -&gt; rows 1-1M&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Thread 2     -&gt; rows 1M-2M&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Thread 3     -&gt; rows 2M-3M&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;...&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Thread 10000 -&gt; rows 9.9B-10B&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each thread performs repeated, regular work over a slice of data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 5: Partial aggregation and reduction&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Each worker builds partial aggregates, then the engine reduces them into final grouped totals. This is familiar database behavior, but at much higher degrees of parallelism.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 6: Finalize on CPU&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;After heavy compute, final result shaping and response serialization return through CPU-side control flow.&lt;/p&gt;
&lt;p&gt;The complete flow:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;text&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;SQL query&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;-&gt; CPU planner&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;-&gt; column selection&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;-&gt; GPU scan + compute&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;-&gt; GPU partial aggregates&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;-&gt; GPU reduction&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;-&gt; CPU final return&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Stage ownership summary&lt;/strong&gt;&lt;/p&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Stage&lt;/th&gt;&lt;th&gt;CPU-centric path&lt;/th&gt;&lt;th&gt;GPU-accelerated path&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Parse + optimize&lt;/td&gt;&lt;td&gt;CPU&lt;/td&gt;&lt;td&gt;CPU&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Column selection&lt;/td&gt;&lt;td&gt;CPU&lt;/td&gt;&lt;td&gt;CPU&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Large scan&lt;/td&gt;&lt;td&gt;CPU workers&lt;/td&gt;&lt;td&gt;GPU threads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Partial aggregation&lt;/td&gt;&lt;td&gt;CPU workers&lt;/td&gt;&lt;td&gt;GPU threads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Reduction&lt;/td&gt;&lt;td&gt;CPU merge&lt;/td&gt;&lt;td&gt;GPU reduction + CPU finalize&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Result shaping&lt;/td&gt;&lt;td&gt;CPU&lt;/td&gt;&lt;td&gt;CPU&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;img src=&quot;https://rajivonai.com/diagrams/gpu-database-execution/inside_gpu_database_engine.svg&quot; alt=&quot;Inside a GPU database engine&quot;&gt;&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;NVIDIA RAPIDS cuDF documents the execution pattern for DataFrame aggregations: the GPU receives a columnar memory representation, applies the projection and filter kernels in parallel across all rows, builds partial hash aggregates per thread block, then reduces across blocks. The documented behavior is that this execution model is fastest when the working set fits in GPU VRAM — data spills to system RAM through NVLink or PCIe, and the bandwidth of that interconnect becomes the new bottleneck when the query exceeds VRAM capacity.&lt;/p&gt;
&lt;p&gt;BlazeIT and similar GPU-accelerated SQL engines (documented in academic literature, e.g., &lt;a href=&quot;https://dl.acm.org/doi/10.14778/1453856.1453915&quot;&gt;He et al., VLDB 2008&lt;/a&gt;) established the baseline behavior: scan-heavy queries with low selectivity (reading most of a table) see the largest speedups because the GPU’s memory bandwidth advantage over CPU memory bandwidth is largest for sequential reads. Selective point lookups see no benefit because GPU thread management overhead dominates the per-row compute time.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Query workload is OLTP&lt;/td&gt;&lt;td&gt;No speedup, higher latency&lt;/td&gt;&lt;td&gt;GPU kernel overhead is larger than the compute savings for small, indexed lookups&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Working set exceeds GPU VRAM&lt;/td&gt;&lt;td&gt;Speedup collapses to CPU-level or slower&lt;/td&gt;&lt;td&gt;PCIe/NVLink transfer becomes the bottleneck; GPU’s internal bandwidth advantage disappears&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Query is I/O-bound, not compute-bound&lt;/td&gt;&lt;td&gt;Adding GPU does not help&lt;/td&gt;&lt;td&gt;The storage read is the bottleneck; GPU sits idle waiting for data&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Write-heavy workload&lt;/td&gt;&lt;td&gt;Incorrect fit&lt;/td&gt;&lt;td&gt;Transactional writes require coordination machinery that GPUs do not accelerate&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Irregular or sparse data access&lt;/td&gt;&lt;td&gt;Lower GPU utilization&lt;/td&gt;&lt;td&gt;Branching access patterns lead to thread divergence, reducing GPU parallelism efficiency&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: At 10B row scale, CPU-based analytical engines hit a parallelism ceiling that cannot be solved by adding CPU cores — the bottleneck is the number of simultaneous arithmetic operations, not the sophistication of the logic.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Move scan-heavy, aggregate-heavy SQL workloads to a GPU-accelerated execution engine; verify the query is compute-bound (not I/O-bound) before attributing speedup to GPU offload.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on the target query and confirm the majority of time is in scan, aggregate, or join operators (not in network or storage I/O), then benchmark on a GPU-enabled instance with the same query and data volume.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Identify your three slowest analytical queries this week and profile whether the bottleneck is CPU compute, memory bandwidth, or storage I/O — only CPU compute bottlenecks are GPU-offload candidates.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category><category>ai-engineering</category></item><item><title>Why Databases Are Moving Toward GPU Execution Engines</title><link>https://rajivonai.com/blog/2024-03-04-why-databases-are-moving-toward-gpu-execution-engines/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-03-04-why-databases-are-moving-toward-gpu-execution-engines/</guid><description>A practical, DBA-friendly explanation of why modern analytical databases are increasingly using GPUs for scans, joins, aggregations, and AI-adjacent workloads.</description><pubDate>Mon, 04 Mar 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The CPU-centric query engine is not being replaced — it is being augmented, and the teams who are not planning for that shift are about to face a capacity ceiling on their analytical workloads.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Database engines were designed around one default assumption: the CPU is the center of query execution. That was the right design for an era dominated by OLTP, indexed lookups, branch-heavy logic, and transaction coordination. Workload shape has changed. Modern platforms increasingly need to support large analytical scans, interactive dashboards, join-heavy columnar queries, vector search and retrieval, and AI-adjacent ranking and reranking. CPU-only systems are being asked to handle execution patterns they were not optimized for.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The operational symptom is predictable: a query that looked fine at 10 million rows becomes a sustained 60-second runtime at 10 billion rows, and adding more CPU capacity produces diminishing returns. The underlying problem is structural. CPU execution is sequential within a core — even well-parallelized CPU queries are constrained by thread count, cache pressure, and branch prediction overhead. The expensive paths in modern analytical workloads — scan, filter, join, aggregate — are massively data-parallel operations, not coordination-heavy operations. CPUs are excellent at coordination. They are less efficient at executing the same arithmetic operation across a billion rows.&lt;/p&gt;
&lt;p&gt;The core question for operators: when does a GPU-accelerated execution engine produce a different result than throwing more CPU capacity at the problem?&lt;/p&gt;
&lt;h2 id=&quot;gpu-accelerated-database-architecture&quot;&gt;GPU-Accelerated Database Architecture&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Layer&lt;/th&gt;&lt;th&gt;CPU-only&lt;/th&gt;&lt;th&gt;GPU-augmented&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Planning and coordination&lt;/td&gt;&lt;td&gt;CPU&lt;/td&gt;&lt;td&gt;CPU&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Heavy analytical execution&lt;/td&gt;&lt;td&gt;CPU&lt;/td&gt;&lt;td&gt;CPU + GPU&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;AI retrieval and vector serving&lt;/td&gt;&lt;td&gt;External stack&lt;/td&gt;&lt;td&gt;Integrated into the data platform&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The shift is not CPU replaced by GPU. The shift is: &lt;strong&gt;CPU for control, GPU for throughput.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://rajivonai.com/diagrams/gpu-database-execution/inside_gpu_database_engine.svg&quot; alt=&quot;Inside a GPU database engine&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What problem GPUs solve&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A lot of analytical SQL reduces to this execution shape:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;text&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;SCAN -&gt; FILTER -&gt; PROJECT -&gt; JOIN -&gt; AGGREGATE&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Take:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; country, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;SUM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(revenue)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; events&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; country;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At billion-row scale, this is a throughput problem. The engine repeatedly does similar work — read values, compare values, transform values, aggregate partial results — over large datasets. That repeated, data-parallel pattern maps well to GPU execution.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why columnar storage enabled the shift&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;GPU execution fits far better with columnar data than row-heavy transactional layouts. If a query only needs &lt;code&gt;price&lt;/code&gt; and &lt;code&gt;quantity&lt;/code&gt;, a columnar engine can feed only those vectors into execution. That aligns with GPU-friendly flow:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;text&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;vector in -&gt; vector transform -&gt; vector reduce&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The industry trend followed a progression: vectorized execution → columnar storage and compression → GPU-aware operator offload.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why AI is accelerating adoption&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;AI-oriented data systems increasingly require embeddings, nearest-neighbor retrieval, reranking, vector similarity, and inference near data. Those are not classic OLTP operations. They align with accelerator-friendly execution patterns, making GPU-capable systems easier to justify for combined analytical + AI workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Architecture evaluation checklist&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;What dominates the hot path: transactions, scans, joins, vector math, or ranking?&lt;/li&gt;
&lt;li&gt;Is the data layout GPU-friendly: columnar, batched, predictable access?&lt;/li&gt;
&lt;li&gt;Is the workload large enough to amortize offload overhead?&lt;/li&gt;
&lt;li&gt;Is the bottleneck compute, or actually data movement, modeling, or partitioning?&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;NVIDIA’s RAPIDS cuDF library documents the design split explicitly: the GPU handles columnar data operations while the CPU handles query planning, result finalization, and control flow. The documented limitation is PCIe transfer overhead — data movement between CPU memory and GPU memory is the dominant latency cost for small-to-medium datasets. RAPIDS’ own documentation recommends GPU offload only when the working set is large enough that the transfer overhead is amortized across the computation.&lt;/p&gt;
&lt;p&gt;PostgreSQL extensions for GPU offload, such as PG-Strom (documented at heterodb.com), follow the same documented hybrid pattern: the PostgreSQL planner runs on CPU, while scan-heavy and join-heavy operators are offloaded to the GPU. PG-Strom’s documented design states that only operators with high arithmetic intensity are candidates for GPU offload — point lookups and index scans remain on CPU.&lt;/p&gt;
&lt;p&gt;DuckDB’s documented vectorized execution (CPU-based, not GPU) is a useful reference point for the floor: a CPU-based columnar engine can execute analytical queries at speeds that were GPU-exclusive five years ago, which means the decision to add GPU hardware requires a workload that exceeds what modern in-process columnar execution can handle.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;GPU for small indexed lookups&lt;/td&gt;&lt;td&gt;No throughput gain, higher latency&lt;/td&gt;&lt;td&gt;GPU kernel launch overhead exceeds the per-request compute time&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GPU for write-heavy OLTP&lt;/td&gt;&lt;td&gt;Incorrect fit — no benefit&lt;/td&gt;&lt;td&gt;Transactional writes are coordination-bound, not compute-bound&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GPU for branch-heavy procedural logic&lt;/td&gt;&lt;td&gt;Falls back to CPU or performs worse&lt;/td&gt;&lt;td&gt;Divergent execution paths across GPU threads reduce parallelism&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GPU without columnar storage&lt;/td&gt;&lt;td&gt;Poor data locality and excess data movement&lt;/td&gt;&lt;td&gt;Row-oriented layouts require reading irrelevant columns into GPU memory&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Adding GPU without profiling the hot path&lt;/td&gt;&lt;td&gt;Wasted infrastructure spend&lt;/td&gt;&lt;td&gt;GPU acceleration only moves the needle when compute, not I/O or coordination, is the bottleneck&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: CPU-only analytical engines hit a scalability ceiling on scan-heavy, aggregate-heavy workloads — and that ceiling arrives earlier as AI retrieval and vector search enter the data platform.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Classify hot paths by execution pattern first; move scan-heavy, arithmetic-heavy workloads to GPU-accelerated execution while keeping planning, coordination, and OLTP on CPU.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run your top five analytical queries on a GPU-enabled instance or a GPU-accelerated engine such as RAPIDS cuDF, compare elapsed time and I/O throughput, and confirm the query is actually compute-bound (not I/O-bound) before attributing speedup to GPU offload.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, profile your three slowest analytical queries and determine whether the bottleneck is CPU compute, memory bandwidth, storage I/O, or query plan shape — only the CPU compute bottleneck is a GPU-offload candidate.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category><category>ai-engineering</category></item><item><title>PostgreSQL Statistics Drift Workflow</title><link>https://rajivonai.com/blog/2024-02-26-postgresql-statistics-drift-workflow/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-02-26-postgresql-statistics-drift-workflow/</guid><description>When the query planner gets row estimates wrong, queries regress silently. This runbook diagnoses statistics drift and restores accurate plans.</description><pubDate>Mon, 26 Feb 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A query that ran in 8 milliseconds last week and now takes 4 seconds has not changed — but the planner’s model of the data has.&lt;/strong&gt; PostgreSQL’s query optimizer builds execution plans from table statistics: column value distributions, row counts, and correlation coefficients stored in &lt;code&gt;pg_statistic&lt;/code&gt;. When those statistics drift from reality, the optimizer chooses wrong plans with confidence, and the resulting regressions are difficult to catch because no error is raised — just slower queries.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL uses a cost-based optimizer that estimates how many rows each plan step will process. Those estimates come from statistics gathered by &lt;code&gt;ANALYZE&lt;/code&gt;. If statistics are stale — from a bulk load, a large delete, or simply not running &lt;code&gt;ANALYZE&lt;/code&gt; for an extended period — the planner’s row estimates diverge from actual counts, and plan choices that were correct for the old data distribution become wrong for the current one.&lt;/p&gt;
&lt;p&gt;The most common presentation: a query that joins two tables starts doing a nested loop instead of a hash join because the planner underestimates the inner table’s row count. Or an index scan gets chosen when the data has changed enough that a sequential scan would be faster. Or a partial index gets selected for a query where the filtered row count no longer makes that index selective.&lt;/p&gt;
&lt;p&gt;Statistics drift is distinct from index bloat or table bloat. The physical storage might be fine. The problem is that the optimizer’s mental model of the data is wrong, and it is building plans optimized for a database that no longer exists.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; estimated rows far from actual rows&lt;/td&gt;&lt;td&gt;Query plan output&lt;/td&gt;&lt;td&gt;Statistics are stale or the column distribution is unusual&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;last_analyze&lt;/code&gt; or &lt;code&gt;last_autoanalyze&lt;/code&gt; is days old&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_user_tables&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Automatic statistics updates not running on this table&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Query plan changed after a bulk load or large delete&lt;/td&gt;&lt;td&gt;Application performance logs&lt;/td&gt;&lt;td&gt;The new data volume or distribution triggered a different plan&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Planner chooses sequential scan on a selective query&lt;/td&gt;&lt;td&gt;&lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; output&lt;/td&gt;&lt;td&gt;Row count estimate too high; planner thinks index would cost more&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Planner chooses nested loop for a large result set&lt;/td&gt;&lt;td&gt;&lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; output&lt;/td&gt;&lt;td&gt;Row count estimate too low; planner underestimated join output&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;n_distinct&lt;/code&gt; in &lt;code&gt;pg_stats&lt;/code&gt; shows -1 for a column with few distinct values&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stats&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Statistics estimate is extrapolated, not exact&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Confirm the estimate-vs-actual divergence&lt;/strong&gt; — the EXPLAIN output is the primary diagnostic:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN (ANALYZE, BUFFERS, FORMAT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TEXT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders o&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; customers c &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; o&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;customer_id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; o&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pending&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; o&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;created_at&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; interval &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;7 days&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Look for rows where &lt;code&gt;rows=N (actual rows=M)&lt;/code&gt; and &lt;code&gt;N&lt;/code&gt; is off by more than a factor of 10. A nested loop chosen over a hash join when the actual row count exceeds 10,000 is a clear statistics failure. Note the exact node type (SeqScan, IndexScan, Hash, NestLoop) — this tells you which estimate was wrong.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Inspect column statistics for the affected table&lt;/strong&gt; — &lt;code&gt;pg_stats&lt;/code&gt; stores what the planner knows:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  attname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_distinct,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  correlation,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  null_frac,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  avg_width,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  most_common_vals,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  most_common_freqs&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stats&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; attname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;status&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;created_at&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;customer_id&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; attname;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;n_distinct &gt; 0&lt;/code&gt; means an absolute count; &lt;code&gt;n_distinct &amp;#x3C; 0&lt;/code&gt; means a fraction of the table. If &lt;code&gt;n_distinct = -1&lt;/code&gt;, PostgreSQL is guessing that every row is unique — problematic for low-cardinality columns. Low &lt;code&gt;correlation&lt;/code&gt; (near 0) on a column used in a range scan means physical row order does not match logical sort order, which raises index scan costs.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Check when statistics were last collected&lt;/strong&gt; — stale analyze timestamps are the first explanation:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  relname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_analyze,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_autoanalyze,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_live_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_dead_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_mod_since_analyze&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; relname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;customers&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; last_analyze &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;NULLS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LAST&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;n_mod_since_analyze&lt;/code&gt; is the counter that autovacuum uses to decide whether to run &lt;code&gt;ANALYZE&lt;/code&gt;. If it is large relative to &lt;code&gt;n_live_tup&lt;/code&gt;, statistics are definitely stale. A &lt;code&gt;last_analyze&lt;/code&gt; of NULL means &lt;code&gt;ANALYZE&lt;/code&gt; has never run on this table.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Check for bulk data changes that were not followed by ANALYZE&lt;/strong&gt; — look at table modification counts:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  relname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_mod_since_analyze,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_live_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  round&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_mod_since_analyze::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;numeric&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; nullif&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_live_tup, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;2&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; mod_pct&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_mod_since_analyze &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; mod_pct &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A &lt;code&gt;mod_pct&lt;/code&gt; above 20% means more than 20% of the table has changed since the last statistics collection — the autovacuum &lt;code&gt;analyze_scale_factor&lt;/code&gt; default is 0.2, so autovacuum should have triggered, but may not have if the table is very large or autovacuum was busy.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Check raw statistics storage&lt;/strong&gt; — to understand what the planner is actually seeing:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  staattnum,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  stakind1,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  stavalues1,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  stanumbers1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_statistic&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; starelid &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;::regclass&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;stakind&lt;/code&gt; 1 = most-common-values, 2 = histogram, 3 = correlation. If &lt;code&gt;stavalues1&lt;/code&gt; is sparse or missing, the planner has no useful distribution data for that column. This is the raw form of what &lt;code&gt;pg_stats&lt;/code&gt; presents in human-readable form.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Slow query — plan regression suspected] --&gt; B{EXPLAIN estimated rows match actual?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes — estimates correct| C[Statistics not the problem — check indexes or locks]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no — large divergence| D{last_analyze recent?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no — stale or never| E[ANALYZE tablename — re-check plan]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes — but still wrong| F{Column has unusual distribution?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|yes — skewed or correlated| G[ALTER COLUMN SET STATISTICS 500]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt; H[ANALYZE tablename — re-check plan]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|no| I{Multiple columns in WHERE clause?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt;|yes| J[CREATE STATISTICS for correlated columns]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt; K[ANALYZE tablename — re-check plan]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt;|no| L{n_distinct estimate wrong?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|yes| M[ALTER COLUMN SET n_distinct — explicit override]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|no| N[Check for partial index mismatch or planner bugs]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Run ANALYZE to refresh statistics&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The simplest fix — and always the first step:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Analyze a specific table&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VERBOSE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Analyze multiple tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VERBOSE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders, customers, order_items;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Analyze a specific column (faster on large tables)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE orders (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, created_at, customer_id);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;ANALYZE VERBOSE&lt;/code&gt; prints a summary of rows sampled, which is useful for confirming the statistics update ran successfully. After &lt;code&gt;ANALYZE&lt;/code&gt;, re-run &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt; on the slow query to see if the estimates improved.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;ANALYZE&lt;/code&gt; takes a &lt;code&gt;SHARE UPDATE EXCLUSIVE&lt;/code&gt; lock — it blocks DDL but not reads or writes. It is safe to run on production tables at any time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Increase statistics target for selective columns&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The default &lt;code&gt;default_statistics_target = 100&lt;/code&gt; samples 300 * 100 = 30,000 rows for statistics. For columns with many distinct values or highly skewed distributions, this sample may not capture the tail. Increase the per-column target:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Increase statistics detail for a specific column&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; STATISTICS&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 500&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; STATISTICS&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 500&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Then refresh statistics&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A &lt;code&gt;statistics target&lt;/code&gt; of 500 collects approximately 150,000 rows — 5x the default. The &lt;code&gt;pg_stats&lt;/code&gt; documentation notes that &lt;code&gt;n_distinct&lt;/code&gt; estimates and histogram bucket counts improve with higher targets, especially for columns where the value distribution has a long tail.&lt;/p&gt;
&lt;p&gt;After increasing the target, verify in &lt;code&gt;pg_stats&lt;/code&gt; that &lt;code&gt;most_common_vals&lt;/code&gt; is more populated and that histogram buckets look representative:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; attname, array_length(most_common_vals, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; mcv_count,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       array_length(histogram_bounds, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; histogram_buckets&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stats&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Create extended statistics for correlated columns&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When a &lt;code&gt;WHERE&lt;/code&gt; clause filters on two columns that are correlated — e.g., &lt;code&gt;status = &apos;shipped&apos; AND region = &apos;EU&apos;&lt;/code&gt; where shipped orders are disproportionately from EU — the planner multiplies the selectivity of each column independently and underestimates the result set. PostgreSQL 10 introduced extended statistics to model this:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Create statistics tracking correlation between two columns&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; STATISTICS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders_status_region (dependencies)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, region&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Collect the extended statistics&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; stxname, stxkind, stxdefined&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_statistic_ext&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; stxrelid &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;::regclass;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Extended statistics with &lt;code&gt;dependencies&lt;/code&gt; teaches the planner that the two columns are correlated. The &lt;code&gt;ndistinct&lt;/code&gt; option captures combined distinct value counts; &lt;code&gt;mcv&lt;/code&gt; captures the most common value combinations. After collecting, re-run &lt;code&gt;EXPLAIN&lt;/code&gt; to see if the multi-column estimate improved.&lt;/p&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;ANALYZE&lt;/code&gt; is always safe to run and always safe to re-run. It does not modify data. The only rollback consideration is performance: on a very large table with a high statistics target, &lt;code&gt;ANALYZE&lt;/code&gt; can take minutes and create I/O pressure. Run during off-peak hours on tables over 100 GB.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ALTER COLUMN SET STATISTICS N&lt;/code&gt; is reversible: &lt;code&gt;ALTER TABLE orders ALTER COLUMN status SET STATISTICS -1&lt;/code&gt; returns to the default. No &lt;code&gt;ANALYZE&lt;/code&gt; re-run is needed to revert — the change takes effect on the next &lt;code&gt;ANALYZE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CREATE STATISTICS&lt;/code&gt; is reversible: &lt;code&gt;DROP STATISTICS orders_status_region&lt;/code&gt;. The planner reverts to independent column estimates immediately.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ALTER TABLE ... SET (n_distinct = N)&lt;/code&gt; — an explicit override that bypasses sampling — is reversible: &lt;code&gt;ALTER TABLE orders ALTER COLUMN col SET (n_distinct = -1)&lt;/code&gt; restores to estimated mode.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Stale statistics are predictable: they happen after bulk loads and large deletes. A pattern worth automating is a post-ETL &lt;code&gt;ANALYZE&lt;/code&gt; call baked into the data pipeline itself, rather than relying on autovacuum timing:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- After any bulk insert, run ANALYZE immediately&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders_archive &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;completed&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; interval &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;1 year&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DELETE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;completed&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; interval &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;1 year&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE orders;  &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- do not skip this&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For monitoring, a pg_cron query that alerts when &lt;code&gt;n_mod_since_analyze&lt;/code&gt; exceeds a threshold gives advance notice before the planner starts making wrong decisions:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; cron&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schedule&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;stats-staleness-check&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;30 * * * *&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, $$&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; ops&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;stats_alerts&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (tablename, mod_pct, captured_at)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    relname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    round&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_mod_since_analyze::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;numeric&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; nullif&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_live_tup, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;2&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;),&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_live_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_mod_since_analyze::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;numeric&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; nullif&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_live_tup, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;15&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$$);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The PostgreSQL statistics documentation describes the statistics target as controlling both the number of histogram buckets and the most-common-values list length. The documented relationship is: &lt;code&gt;statistics_target&lt;/code&gt; × 300 = rows sampled. For a column where 0.01% of rows have a specific value that is frequently queried, the default 30,000-row sample will often miss that value entirely, producing a histogram-based estimate that is substantially wrong.&lt;/p&gt;
&lt;p&gt;The documented behavior of &lt;code&gt;CREATE STATISTICS&lt;/code&gt; with &lt;code&gt;dependencies&lt;/code&gt; is that it computes functional dependency statistics between columns. Where the selectivity of &lt;code&gt;col_a = &apos;x&apos;&lt;/code&gt; is 0.01 and &lt;code&gt;col_b = &apos;y&apos;&lt;/code&gt; is 0.05, the planner without extended statistics estimates the joint selectivity as 0.01 × 0.05 = 0.0005. With a dependencies statistic showing that &lt;code&gt;col_a = &apos;x&apos;&lt;/code&gt; implies &lt;code&gt;col_b = &apos;y&apos;&lt;/code&gt; with 95% probability, the planner correctly estimates closer to 0.01.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;ANALYZE&lt;/code&gt; runs but estimates still wrong&lt;/td&gt;&lt;td&gt;Column has extreme skew — 99% of rows share one value&lt;/td&gt;&lt;td&gt;Increase &lt;code&gt;statistics_target&lt;/code&gt; to 1000; use &lt;code&gt;CREATE STATISTICS mcv&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Extended statistics do not help&lt;/td&gt;&lt;td&gt;Correlation is partial, not functional dependency&lt;/td&gt;&lt;td&gt;Try &lt;code&gt;ndistinct&lt;/code&gt; variant of &lt;code&gt;CREATE STATISTICS&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;ANALYZE&lt;/code&gt; is too slow on large table&lt;/td&gt;&lt;td&gt;Table has 1B+ rows and wide schema&lt;/td&gt;&lt;td&gt;Analyze specific columns only: &lt;code&gt;ANALYZE table (col1, col2)&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Autovacuum is running ANALYZE but estimates still drift&lt;/td&gt;&lt;td&gt;&lt;code&gt;analyze_scale_factor&lt;/code&gt; threshold crossed only after large drift&lt;/td&gt;&lt;td&gt;Lower &lt;code&gt;autovacuum_analyze_scale_factor&lt;/code&gt; per-table to 0.01&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Plan regression returns after ANALYZE&lt;/td&gt;&lt;td&gt;Statistics are correct but planner constant factors are wrong&lt;/td&gt;&lt;td&gt;Consider &lt;code&gt;pg_hint_plan&lt;/code&gt; as a temporary override while investigating&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Stale or low-resolution statistics cause the planner to choose wrong join types and scan methods, producing query regressions that look like load spikes but are actually optimizer failures.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Run &lt;code&gt;ANALYZE&lt;/code&gt; after bulk loads, raise &lt;code&gt;statistics target&lt;/code&gt; to 500 for join and filter columns on large tables, and create extended statistics for correlated column pairs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After &lt;code&gt;ANALYZE&lt;/code&gt;, &lt;code&gt;EXPLAIN (ANALYZE)&lt;/code&gt; estimated rows should be within a factor of 2 of actual rows for the primary scan nodes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run the &lt;code&gt;n_mod_since_analyze&lt;/code&gt; query from Check 4 this week. Any table where &lt;code&gt;mod_pct &gt; 20%&lt;/code&gt; needs an &lt;code&gt;ANALYZE&lt;/code&gt; run today.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;checklist&quot;&gt;Checklist&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Run &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt; on the slow query — compare estimated vs actual rows at each node&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_stats&lt;/code&gt; for the filtered columns — check &lt;code&gt;n_distinct&lt;/code&gt;, &lt;code&gt;correlation&lt;/code&gt;, and &lt;code&gt;most_common_vals&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_stat_user_tables&lt;/code&gt; for &lt;code&gt;last_analyze&lt;/code&gt;, &lt;code&gt;last_autoanalyze&lt;/code&gt;, and &lt;code&gt;n_mod_since_analyze&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;last_analyze&lt;/code&gt; is stale or NULL: run &lt;code&gt;ANALYZE tablename&lt;/code&gt; immediately&lt;/li&gt;
&lt;li&gt;Re-run &lt;code&gt;EXPLAIN (ANALYZE)&lt;/code&gt; after &lt;code&gt;ANALYZE&lt;/code&gt; to verify estimates improved&lt;/li&gt;
&lt;li&gt;If estimates still wrong: check for correlated columns in the &lt;code&gt;WHERE&lt;/code&gt; clause&lt;/li&gt;
&lt;li&gt;Raise &lt;code&gt;statistics_target&lt;/code&gt; to 500 for high-cardinality or skewed columns&lt;/li&gt;
&lt;li&gt;Create extended statistics with &lt;code&gt;CREATE STATISTICS (dependencies)&lt;/code&gt; for correlated column pairs&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;ANALYZE&lt;/code&gt; again after any statistics configuration change&lt;/li&gt;
&lt;li&gt;Lower &lt;code&gt;autovacuum_analyze_scale_factor&lt;/code&gt; to 0.01 per-table for high-write tables&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;ANALYZE&lt;/code&gt; calls to ETL pipelines immediately after bulk loads or large deletes&lt;/li&gt;
&lt;li&gt;Add a monitoring query on &lt;code&gt;n_mod_since_analyze&lt;/code&gt; — alert when &lt;code&gt;mod_pct &gt; 15%&lt;/code&gt; on production tables&lt;/li&gt;
&lt;/ol&gt;</content:encoded><category>databases</category><category>checklist</category><category>failures</category></item><item><title>Aurora Global Database: What It Solves and What It Does Not</title><link>https://rajivonai.com/blog/2024-02-19-aurora-global-database-what-it-solves-and-does-not/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-02-19-aurora-global-database-what-it-solves-and-does-not/</guid><description>Aurora Global Database delivers sub-second cross-region replication and under-one-minute RTO for disaster recovery — but it is not active-active, and application failover is never automatic.</description><pubDate>Mon, 19 Feb 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Aurora Global Database is frequently evaluated as an active-active multi-region database. It is not. The secondary region is read-only until you explicitly promote it, promotion does not re-point your application endpoints, and the RPO on an unplanned failover is measured in seconds, not zero. Understanding what the product actually delivers — and what it leaves to you — is the only way to size it correctly for a DR or read-scale design.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Multi-region database architecture sits at the intersection of two pressures: latency-sensitive reads that cross region boundaries unnecessarily, and disaster recovery designs that require tighter RTO/RPO than a daily snapshot gives you. Aurora Global Database is the AWS answer to both, and the marketing framing — “single database spanning multiple regions” — sounds closer to active-active than the implementation actually is.&lt;/p&gt;
&lt;p&gt;Engineers evaluating Global Database typically encounter it while building a DR failover plan or routing global reads to a closer region. Both use cases are real. The confusion starts when teams assume they compound into active-active behavior.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Aurora Global Database does not detect primary region failure and promote the secondary automatically. Promotion is an API call — manually triggered or triggered by your application logic. The application’s connection string still points at the old primary endpoint after promotion. The database cluster comes up cleanly; your application is still talking to a dead region.&lt;/p&gt;
&lt;p&gt;The “sub-one-minute RTO” claim is precise: it covers the time to promote a new primary cluster. It does not include DNS propagation, application reconfiguration, or connection pool drain. The actual application recovery time is longer, and the gap is entirely under your control rather than Aurora’s.&lt;/p&gt;
&lt;p&gt;What does Aurora Global Database actually guarantee, where does that guarantee stop, and what does your application need to provide for the rest?&lt;/p&gt;
&lt;h2 id=&quot;how-aurora-global-database-replicates&quot;&gt;How Aurora Global Database Replicates&lt;/h2&gt;
&lt;p&gt;Aurora’s replication mechanism is not binlog-based or WAL-shipping-based in the traditional sense. The Aurora storage layer replicates storage-level redo log records directly between regions. According to &lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database.html&quot;&gt;AWS Aurora documentation&lt;/a&gt;, this typically achieves under one second of replication lag using dedicated infrastructure separate from database compute nodes. Because replication does not go through the compute layer, writes on the primary are not slowed by cross-region replication — the storage tier handles it asynchronously.&lt;/p&gt;
&lt;p&gt;The secondary cluster can serve reads from its local storage copy. Those reads are up to one second stale. For dashboards, reporting, and non-transactional API endpoints that is fine. For reads that must reflect a just-completed write, it is not.&lt;/p&gt;
&lt;h3 id=&quot;planned-vs-unplanned-failover&quot;&gt;Planned vs. Unplanned Failover&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database-disaster-recovery.html&quot;&gt;AWS documents two distinct failover modes&lt;/a&gt; with different guarantees.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Managed planned failover&lt;/strong&gt; is for intentional region migrations: maintenance, a region move, or a DR drill. Aurora coordinates the promotion, waits for the secondary to fully catch up, and promotes with RPO of zero — no data loss. The original primary must be reachable, and the operation takes longer than a forced failover.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Unplanned failover&lt;/strong&gt; is what you invoke when the primary region has failed. There is no coordination; the secondary region’s data reflects whatever was replicated before the failure. Given sub-one-second typical lag, RPO in practice is low — but it is not zero. AWS documentation states the RPO depends on replication lag at the time of failure.&lt;/p&gt;
&lt;p&gt;The promotion is an API call you must issue explicitly. For an unplanned failover:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;aws&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; rds&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; failover-global-cluster&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --global-cluster-identifier&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; my-global-cluster&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --target-db-cluster-identifier&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; arn:aws:rds:us-west-2:123456789:cluster:my-secondary-cluster&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --allow-data-loss&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After promotion, the secondary cluster becomes the new writer. Your application’s connection string still points at the old primary endpoint — updating that is separate from the promotion step and is your responsibility.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The &lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database.html&quot;&gt;Aurora Global Database user guide&lt;/a&gt; documents three patterns worth internalizing before committing to the architecture.&lt;/p&gt;
&lt;p&gt;Storage-layer replication means the secondary cluster can be promoted without replaying a long log — a genuine DR advantage over traditional streaming replication, where a lagging replica must finish replay before accepting writes.&lt;/p&gt;
&lt;p&gt;Read routing is not automatic. The application must explicitly send reads to the secondary cluster endpoint. Reads on the secondary reflect data up to the current replication lag behind the primary.&lt;/p&gt;
&lt;p&gt;Cost includes storage in both regions (a full copy in each) plus cross-region data transfer for replication. For large databases, storage cost effectively doubles. This is rarely in the first-pass sizing estimate.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Application assumes automatic endpoint failover&lt;/td&gt;&lt;td&gt;Application continues targeting the old primary endpoint after promotion&lt;/td&gt;&lt;td&gt;Aurora promotes the cluster but does not update the application’s connection string&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Writes needed in both regions simultaneously&lt;/td&gt;&lt;td&gt;Active-active writes are not supported&lt;/td&gt;&lt;td&gt;The secondary is read-only until promoted; there is no multi-primary write path&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;RPO must be exactly zero on unplanned failure&lt;/td&gt;&lt;td&gt;RPO on unplanned failover is bounded by replication lag, not guaranteed zero&lt;/td&gt;&lt;td&gt;Only managed planned failover provides zero data loss&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Aurora Global Database does not automatically re-point application traffic after a regional failure, so an untested failover plan typically means manual intervention under pressure during an outage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Build and test the full failover path — promotion API call, DNS update or connection-string reconfiguration, connection pool reset — as a runbook that runs end-to-end in a staging environment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A successful failover drill where the application resumes writes within your RTO target, with the promotion time and application re-point time measured separately.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, find your current RTO target in your DR documentation, then measure how long the non-Aurora steps (DNS propagation, app reconfiguration, connection validation) actually take in your environment. That is your gap.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category><category>architecture</category></item><item><title>Why SELECT * Still Hurts Production Systems</title><link>https://rajivonai.com/blog/2023-10-02-why-select-star-still-hurts-production-systems/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-10-02-why-select-star-still-hurts-production-systems/</guid><description>SELECT * causes four distinct problems that compound at scale: it prevents covering index usage, transfers unnecessary data, breaks application code silently, and defeats column pruning in analytical systems.</description><pubDate>Mon, 02 Oct 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;&lt;code&gt;SELECT *&lt;/code&gt; is not a minor style violation. It is a query that opts out of covering indexes, pulls every TOAST column unconditionally, and defeats columnar storage’s only performance advantage — column pruning.&lt;/strong&gt; Engineers know the advice, but most have never seen the actual mechanism that makes &lt;code&gt;SELECT *&lt;/code&gt; expensive in production. The problem almost always shows up the same way: the query ran fine in development, shipped, then became the top line in I/O bytes as the table grew.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Applications accumulate columns over time. A &lt;code&gt;users&lt;/code&gt; table starts with a dozen fields and grows incrementally — a &lt;code&gt;preferences&lt;/code&gt; JSONB column here, a &lt;code&gt;bio&lt;/code&gt; TEXT there, an audit field, a feature flag blob. Each migration is routine. The &lt;code&gt;SELECT *&lt;/code&gt; queries that read that table are unchanged.&lt;/p&gt;
&lt;p&gt;By the time a query shows up in slow query logs, the table has 50 columns and two of them are 40KB per row on average. Development databases rarely catch this because dev data is small and large TEXT or JSONB values are usually short.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;There are four distinct mechanisms through which &lt;code&gt;SELECT *&lt;/code&gt; degrades production workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Covering indexes become useless.&lt;/strong&gt; PostgreSQL’s index-only scan resolves a query entirely from the index without touching the heap — but only when every output column is present in the index. &lt;code&gt;SELECT *&lt;/code&gt; forces a heap fetch for every matching row regardless, turning a fast index-only scan into a random I/O operation per result.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;TOAST columns are fetched unconditionally.&lt;/strong&gt; PostgreSQL stores values larger than roughly 2KB out-of-line in a secondary TOAST table. A &lt;code&gt;TEXT&lt;/code&gt;, &lt;code&gt;JSONB&lt;/code&gt;, or &lt;code&gt;BYTEA&lt;/code&gt; column that exceeds the threshold is fetched separately when accessed. &lt;code&gt;SELECT *&lt;/code&gt; includes every column, so every oversized value triggers a secondary read — even when the application uses only two fields from the row.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema changes break application code silently.&lt;/strong&gt; ORM code that maps &lt;code&gt;SELECT *&lt;/code&gt; results onto struct fields may corrupt state when a new &lt;code&gt;NOT NULL&lt;/code&gt; column is added or columns are reordered. The query succeeds; the struct carries unexpected data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Columnar systems lose column pruning.&lt;/strong&gt; Redshift, BigQuery, and DuckDB store data by column. Their foundational I/O optimization is reading only the columns the query names. &lt;code&gt;SELECT *&lt;/code&gt; forces reads across every column in the table, with I/O cost proportional to column count.&lt;/p&gt;
&lt;p&gt;What does a query that avoids all four problems look like, and what needs to change at the schema and index layer?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s index-only scan allows the executor to return results directly from index pages without visiting heap pages at all. For this to work, every column in the SELECT list and WHERE clause must be present in the index.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Query execution] --&gt; B{All selected columns in index?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B -- Yes --&gt; C[Index-only Scan]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B -- No — SELECT star used --&gt; D[Fetch full row from heap]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; E{Has out-of-line TOAST columns?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E -- Yes --&gt; F[Fetch secondary TOAST pages]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E -- No --&gt; G[Return heap data]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A query like this can use an index-only scan if an index exists on &lt;code&gt;(email, id, name)&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; id, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;name&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; users &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; email &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;user@example.com&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Change that to &lt;code&gt;SELECT *&lt;/code&gt; and the covering index is bypassed. The executor must fetch the full heap row for every match regardless of index efficiency. The practical guidance from PostgreSQL’s documentation is direct: include output columns in the index using &lt;code&gt;INCLUDE&lt;/code&gt;, and name only the columns the query needs. &lt;code&gt;SELECT *&lt;/code&gt; makes both impossible because the output column list is unbounded.&lt;/p&gt;
&lt;p&gt;For EXPLAIN-based verification, &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt; before and after switching from &lt;code&gt;SELECT *&lt;/code&gt; to named columns makes the heap fetch cost visible as the difference in &lt;code&gt;Buffers: shared hit&lt;/code&gt; counts. The &lt;a href=&quot;https://rajivonai.com/blog/2022-06-06-mysql-explain-reading-the-plan/&quot;&gt;MySQL EXPLAIN post&lt;/a&gt; walks through reading query plans systematically — the same principle applies to PostgreSQL’s EXPLAIN ANALYZE output when comparing index-only scan eligibility.&lt;/p&gt;
&lt;p&gt;For vector queries, column selection matters in the same way. A query retrieving pgvector embeddings alongside large JSON metadata columns pays the TOAST cost on every result row when &lt;code&gt;SELECT *&lt;/code&gt; is used. Selecting only the embedding and the fields the application reads avoids that fetch entirely. Index setup is only half the battle; column selection determines what gets fetched once the index returns its matches.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented behavior of PostgreSQL’s index-only scan is that it is unavailable when the query output includes columns not present in the index. The PostgreSQL documentation states this explicitly: every column in the query’s target list and WHERE clause must be available from the index. &lt;code&gt;SELECT *&lt;/code&gt; prevents this by construction.&lt;/p&gt;
&lt;p&gt;The PostgreSQL TOAST documentation describes out-of-line threshold behavior: values are not fetched unless the column is accessed. This means &lt;code&gt;SELECT id, name FROM users&lt;/code&gt; genuinely avoids reading oversized &lt;code&gt;metadata&lt;/code&gt; values, while &lt;code&gt;SELECT *&lt;/code&gt; fetches them for every row regardless of whether the application uses them.&lt;/p&gt;
&lt;p&gt;Google’s BigQuery documentation is explicit under query optimization guidance: selecting only needed columns reduces bytes scanned and therefore cost. The documented design of Redshift and DuckDB follows the same principle — column pruning requires a bounded output list. &lt;code&gt;SELECT *&lt;/code&gt; removes that bound entirely.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Covering index bypassed&lt;/td&gt;&lt;td&gt;Index-only scan degrades to heap fetch per row&lt;/td&gt;&lt;td&gt;&lt;code&gt;SELECT *&lt;/code&gt; requires columns the index cannot contain&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;TOAST column on every row&lt;/td&gt;&lt;td&gt;Seconds of extra I/O per query execution&lt;/td&gt;&lt;td&gt;Large out-of-line values fetched even when the app discards them&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ORM struct mapping&lt;/td&gt;&lt;td&gt;Application reads wrong values after schema migration&lt;/td&gt;&lt;td&gt;Positional mapping breaks when columns are added or reordered&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Columnar storage full-scan&lt;/td&gt;&lt;td&gt;Query cost proportional to column count instead of query selectivity&lt;/td&gt;&lt;td&gt;Column pruning requires knowing the output columns at parse time&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: &lt;code&gt;SELECT *&lt;/code&gt; bypasses covering indexes, unconditionally fetches TOAST columns, and eliminates column pruning — costs invisible in development, expensive in production.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Name only the columns the application consumes, and build indexes with &lt;code&gt;INCLUDE&lt;/code&gt; to cover the output columns needed on frequent read paths.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt; before and after switching from &lt;code&gt;SELECT *&lt;/code&gt; to named columns — a drop in &lt;code&gt;shared hit&lt;/code&gt; buffer counts confirms the heap fetch is no longer happening.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Audit the top 10 queries by I/O bytes in &lt;code&gt;pg_stat_statements&lt;/code&gt; this week and identify which use &lt;code&gt;SELECT *&lt;/code&gt; on tables containing TEXT, JSONB, or BYTEA columns.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The rule exists not because of style but because the optimizer needs a bounded column list to make cost decisions. Give the optimizer that list and three of these four problems disappear entirely.&lt;/p&gt;</content:encoded><category>databases</category><category>failures</category></item><item><title>Product Catalog Modeling: Relational, Document, Search Index, or All Three</title><link>https://rajivonai.com/blog/2023-09-18-product-catalog-modeling-relational-document-search-index-or-all-three/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-09-18-product-catalog-modeling-relational-document-search-index-or-all-three/</guid><description>Modeling a product catalog across relational, document, and search-index layers: where each fits and why a single schema fails all three workloads.</description><pubDate>Mon, 18 Sep 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Product catalogs fail when teams treat “the product” as one data shape instead of three competing workloads: correctness, merchandising flexibility, and discovery.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;A catalog begins innocently. There is a &lt;code&gt;products&lt;/code&gt; table, a few categories, a price, a description, and an image URL. Then the business asks for variants, bundles, regional availability, marketplace sellers, promotions, localized copy, regulated attributes, and category-specific fields.&lt;/p&gt;
&lt;p&gt;Shoes need size and material. Laptops need CPU, RAM, warranty, and energy labels. Groceries need allergens, pack size, substitution rules, and fulfillment temperature. The product catalog stops being a table of products and becomes the contract between commerce, fulfillment, search, analytics, ads, and customer support.&lt;/p&gt;
&lt;p&gt;At that point the database question becomes architectural. A relational model gives integrity and joins. A document model gives shape flexibility. A search index gives retrieval behavior that neither of the first two should be forced to emulate.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The common failure is picking one model and making it serve all catalog workloads.&lt;/p&gt;
&lt;p&gt;A purely relational catalog often starts clean, then accumulates entity-attribute-value tables, nullable columns, category-specific side tables, and migration anxiety. The schema protects invariants, but product teams wait on DDL for every new attribute family.&lt;/p&gt;
&lt;p&gt;A purely document catalog moves faster, but correctness gets harder. If price, availability, tax classification, seller state, and compliance flags live as loosely governed blobs, downstream systems have to rediscover which fields are authoritative.&lt;/p&gt;
&lt;p&gt;A search-only catalog feels fast until the index becomes the source of truth. Search indexes are optimized for denormalized retrieval, ranking, tokenization, and filtering. They are not designed to be the system of record for transactional correctness.&lt;/p&gt;
&lt;p&gt;The core question is not “which database stores products best?” It is: which parts of the product catalog must be correct, which parts must be flexible, and which parts must be discoverable?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;The strongest pattern is usually not relational or document or search. It is relational and document and search, with ownership boundaries that prevent each store from pretending to be the others.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[merchant tools — catalog edits] --&gt; B[relational core — identity and invariants]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A --&gt; C[document attributes — category shape]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; D[change stream — catalog events]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; D&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; E[index builder — denormalized projection]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; F[search index — retrieval and ranking]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt; G[customer experience — browse and search]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; H[commerce services — price and availability checks]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; I[content services — product detail pages]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The relational core owns identity and invariants: product ID, SKU, variant relationships, seller ownership, lifecycle state, tax classification references, and other fields where duplication or ambiguity creates operational risk.&lt;/p&gt;
&lt;p&gt;The document layer owns attribute shape: category-specific specs, localized content blocks, merchandising metadata, and optional fields that change faster than the canonical model. This can be a document database, a JSON column, or a structured object store. The key is governance: the document is flexible, but not lawless.&lt;/p&gt;
&lt;p&gt;The search index owns retrieval: tokenized text, facets, ranking signals, autocomplete fields, synonyms, and denormalized category views. It is rebuilt from upstream truth. It can be tuned aggressively because losing or corrupting it should degrade discovery, not corrupt orders.&lt;/p&gt;
&lt;p&gt;This split also clarifies write paths. Merchant edits update the system of record. A change stream or outbox emits catalog events. Index builders create projections for search and browse. Customer-facing product pages can read from a precomputed projection, but checkout-critical decisions still revalidate against authoritative services.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; PostgreSQL documents two catalog-relevant capabilities that are often combined: relational constraints for integrity and &lt;code&gt;jsonb&lt;/code&gt; for semi-structured data, including GIN indexes for querying JSON content. The documented pattern is not “put everything in JSON.” It is that relational and semi-structured fields can coexist when the boundary is deliberate. See the PostgreSQL documentation on JSON types and indexing: &lt;a href=&quot;https://www.postgresql.org/docs/current/datatype-json.html&quot;&gt;https://www.postgresql.org/docs/current/datatype-json.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Keep product identity, variant hierarchy, lifecycle state, and ownership in relational columns and tables. Put category-specific attributes in governed JSON only when they do not define core transactional identity. Validate those JSON documents with application schema checks or database constraints where appropriate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The catalog can evolve attribute families without turning every new merchandising idea into a schema migration, while preserving relational guarantees where duplicate or inconsistent state would break commerce.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; JSON inside a relational database is useful when it extends a relational model. It becomes a liability when it replaces the model’s authority.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Elasticsearch describes its core strength as search over indexed documents, including full-text search, filtering, aggregations, and relevance scoring. The documented behavior is projection-oriented: documents are indexed for retrieval, not normalized for source-of-truth integrity. See Elastic’s guide to mapping and search behavior: &lt;a href=&quot;https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html&quot;&gt;https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Build the search document as a derived catalog projection. Include names, descriptions, category paths, normalized facets, popularity signals, availability hints, and merchandising boosts. Do not make the search document the final authority for price, inventory, seller eligibility, or compliance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Search can be tuned for relevance and latency without coupling ranking experiments to transactional correctness. If an index build fails, the recovery path is to replay events or rebuild from source, not manually repair business truth inside the index.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Search indexes are excellent read models. They are poor systems of record.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; MongoDB’s public schema design guidance uses product catalogs as a natural fit for document modeling because products in different categories can carry different attribute sets. The documented pattern is flexible representation for heterogeneous entities, not abandoning data ownership. See MongoDB’s data modeling guidance: &lt;a href=&quot;https://www.mongodb.com/docs/manual/data-modeling/&quot;&gt;https://www.mongodb.com/docs/manual/data-modeling/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Use document modeling for product attributes when category diversity is the main source of change. Keep cross-product invariants explicit: identifiers, references, lifecycle state, and integration contracts should remain stable and validated.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Attribute-heavy catalogs avoid brittle table explosions, but downstream systems still receive predictable contracts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Document flexibility pays off when the business changes shape faster than the core identity model changes.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;









































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Architecture choice&lt;/th&gt;&lt;th&gt;Works well when&lt;/th&gt;&lt;th&gt;Breaks when&lt;/th&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Relational only&lt;/td&gt;&lt;td&gt;Catalog shape is stable and invariants dominate&lt;/td&gt;&lt;td&gt;Category attributes change constantly&lt;/td&gt;&lt;td&gt;EAV tables, nullable sprawl, slow schema evolution&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Document only&lt;/td&gt;&lt;td&gt;Products are heterogeneous and mostly read as whole objects&lt;/td&gt;&lt;td&gt;Checkout correctness depends on embedded mutable fields&lt;/td&gt;&lt;td&gt;Conflicting truth across services&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Search index only&lt;/td&gt;&lt;td&gt;The problem is discovery and ranking&lt;/td&gt;&lt;td&gt;The index becomes authoritative&lt;/td&gt;&lt;td&gt;Orders use stale or denormalized data&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Relational plus document&lt;/td&gt;&lt;td&gt;Core identity is stable but attributes vary&lt;/td&gt;&lt;td&gt;JSON fields are unvalidated&lt;/td&gt;&lt;td&gt;Flexible fields become hidden contracts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Relational plus document plus search&lt;/td&gt;&lt;td&gt;Multiple workloads need different read shapes&lt;/td&gt;&lt;td&gt;Eventing and rebuild paths are weak&lt;/td&gt;&lt;td&gt;Index drift, stale projections, unclear ownership&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The combined model has real cost. You now own propagation, idempotency, rebuilds, schema versioning, and observability across stores. The win is not simplicity of implementation. The win is operational clarity.&lt;/p&gt;
&lt;p&gt;You should be able to answer these questions during an incident:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Which store is authoritative for this field?&lt;/li&gt;
&lt;li&gt;Can this projection be rebuilt from upstream state?&lt;/li&gt;
&lt;li&gt;What happens if the search index is ten minutes stale?&lt;/li&gt;
&lt;li&gt;Which fields must be revalidated before checkout?&lt;/li&gt;
&lt;li&gt;Which schema changes require backfills?&lt;/li&gt;
&lt;li&gt;Which consumers are pinned to old document versions?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those answers are unclear, adding more databases will amplify the failure rather than contain it.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Your catalog probably contains multiple workloads hidden behind one noun: product.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Separate the relational core, flexible attribute model, and search projection by ownership and failure behavior.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Use relational constraints for invariants, governed documents for heterogeneous attributes, and rebuildable indexes for discovery.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Audit the top twenty catalog fields by authority, freshness requirement, write owner, read path, and rebuild strategy before changing the storage engine.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>databases</category><category>cloud</category></item><item><title>Partitioning Is Not a Performance Feature by Default</title><link>https://rajivonai.com/blog/2023-08-21-partitioning-is-not-a-performance-feature-by-default/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-08-21-partitioning-is-not-a-performance-feature-by-default/</guid><description>PostgreSQL declarative partitioning only speeds up queries when the partition key appears in the WHERE clause — without it, you get the overhead of many tables with none of the pruning benefit.</description><pubDate>Mon, 21 Aug 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Partitioning a PostgreSQL table does not make queries faster. Partition pruning makes queries faster — and pruning only happens when the query’s WHERE clause includes the partition key.&lt;/strong&gt; Teams partition large tables expecting a general performance improvement, then discover that analytics queries without a date filter now touch every partition instead of one unified table, and the planner overhead makes things worse than before. Partitioning is a data management feature first; it is a performance feature only under specific, verifiable conditions.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL declarative partitioning (introduced in PG10, significantly improved in PG11–PG13) routes rows to child tables based on a partition key — most commonly a date column for time-series data. The mental model engineers carry is usually: “the table is split into smaller pieces, so queries run faster.” That is true only when the planner can eliminate the pieces that are not relevant.&lt;/p&gt;
&lt;p&gt;Teams with large event, audit, order, or log tables encounter partitioning as the recommended solution to table size problems. The recommendation is often correct, but the mechanism is misunderstood. Partitioning helps with archival (you can drop a partition instantly rather than running a DELETE), parallel query (PG11+ can parallelize across partitions), and large-table DDL operations. It does not help — and can hurt — when queries touch all partitions.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;When PostgreSQL receives a query against a partitioned table, it checks whether the planner can eliminate partitions based on the WHERE clause. This is partition pruning. PostgreSQL documents two types: static pruning at planning time (for literal values in the WHERE clause) and runtime pruning during execution (for parameterized queries, available since PG11 with &lt;code&gt;enable_partition_pruning = on&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;Pruning requires the WHERE clause to include the partition key with a condition that maps to a subset of partitions. A range-partitioned table on &lt;code&gt;created_at&lt;/code&gt; prunes when you write &lt;code&gt;WHERE created_at &gt;= &apos;2024-01-01&apos; AND created_at &amp;#x3C; &apos;2024-02-01&apos;&lt;/code&gt;. It does not prune when you write &lt;code&gt;WHERE user_id = 12345&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The failure mode: a team partitions an &lt;code&gt;orders&lt;/code&gt; table by &lt;code&gt;created_at&lt;/code&gt; month, creating 36 partitions for three years of data. Most OLTP queries are by &lt;code&gt;order_id&lt;/code&gt; or &lt;code&gt;user_id&lt;/code&gt; — neither of which is the partition key. The planner must now plan against 36 child tables instead of one, generate separate plan nodes for each, and execute the query across all of them. Parallel query on partitions helps only if the query is large enough to benefit from parallelism — for point lookups, it adds overhead without benefit.&lt;/p&gt;
&lt;p&gt;You can verify whether pruning is happening using &lt;code&gt;EXPLAIN&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2024-03-01&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2024-04-01&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The plan should show only the relevant partition(s) under &lt;code&gt;Append&lt;/code&gt; or &lt;code&gt;Merge Append&lt;/code&gt;. If you see all 36 listed, the prune did not occur.&lt;/p&gt;
&lt;p&gt;The core question: what conditions must be true for partitioning to improve — rather than degrade — performance?&lt;/p&gt;
&lt;h2 id=&quot;how-partition-pruning-actually-works&quot;&gt;How Partition Pruning Actually Works&lt;/h2&gt;
&lt;p&gt;The planner evaluates partition constraints during planning. For a range partition on &lt;code&gt;created_at&lt;/code&gt;, the constraint is effectively &lt;code&gt;created_at &gt;= lower_bound AND created_at &amp;#x3C; upper_bound&lt;/code&gt;. If the WHERE clause contains a compatible condition on &lt;code&gt;created_at&lt;/code&gt;, the planner eliminates non-matching partitions before execution.&lt;/p&gt;
&lt;p&gt;Two settings control this behavior:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;enable_partition_pruning&lt;/code&gt; (default: &lt;code&gt;on&lt;/code&gt;) — enables both static and runtime pruning. Disabling this will cause the planner to scan all partitions on every query.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;constraint_exclusion&lt;/code&gt; (default: &lt;code&gt;partition&lt;/code&gt;) — enables exclusion based on &lt;code&gt;CHECK&lt;/code&gt; constraints for inheritance-based partitioning (pre-PG10 style). For declarative partitioning, &lt;code&gt;partition&lt;/code&gt; is the correct setting; setting this to &lt;code&gt;on&lt;/code&gt; adds unnecessary overhead on non-partitioned tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When partitioning genuinely helps:&lt;/p&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Use case&lt;/th&gt;&lt;th&gt;Why partitioning helps&lt;/th&gt;&lt;th&gt;What to verify&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Time-series archival&lt;/td&gt;&lt;td&gt;Drop old partitions instantly without a table lock&lt;/td&gt;&lt;td&gt;&lt;code&gt;DROP TABLE orders_2021&lt;/code&gt; completes in milliseconds&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Range-filtered analytics&lt;/td&gt;&lt;td&gt;Prune scans to relevant time window&lt;/td&gt;&lt;td&gt;EXPLAIN shows only matching partitions in plan&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Parallel query on large scans&lt;/td&gt;&lt;td&gt;PG11+ can assign workers per partition&lt;/td&gt;&lt;td&gt;&lt;code&gt;EXPLAIN&lt;/code&gt; shows &lt;code&gt;Parallel Append&lt;/code&gt; with multiple workers&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Bulk data ingestion&lt;/td&gt;&lt;td&gt;New data lands in the current-period partition, reducing index maintenance scope&lt;/td&gt;&lt;td&gt;Insert throughput measured before and after&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;When partitioning hurts or provides no benefit:&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Pattern&lt;/th&gt;&lt;th&gt;Problem&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Queries filter only on non-partition-key columns&lt;/td&gt;&lt;td&gt;All partitions scanned; planner overhead added&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Default partition exists&lt;/td&gt;&lt;td&gt;Some planners cannot prune past a default partition, causing all partitions to be scanned&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Very high partition count (500+)&lt;/td&gt;&lt;td&gt;Planning time increases linearly with partition count even when pruning works&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Foreign keys referencing a partitioned table&lt;/td&gt;&lt;td&gt;Foreign key checks must scan all partitions&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s declarative partitioning documentation (postgresql.org/docs/current/ddl-partitioning.html) describes partition pruning explicitly: “The query planner will only apply partition pruning when the query’s WHERE clause contains a condition on the partition key.” The documentation also notes that runtime pruning requires &lt;code&gt;enable_partition_pruning = on&lt;/code&gt; and is available for parameterized queries when the partition key appears in the plan’s parameter bindings.&lt;/p&gt;
&lt;p&gt;The documented PostgreSQL behavior for &lt;code&gt;DROP TABLE&lt;/code&gt; on a partition is that it completes in milliseconds regardless of partition size, because it removes the child table’s storage files without scanning rows — this is the principal operational benefit of partitioning for time-series data with defined retention policies.&lt;/p&gt;
&lt;p&gt;PostgreSQL 11’s release notes document the introduction of partition-wise joins and partition-wise aggregation as explicit opt-in settings (&lt;code&gt;enable_partitionwise_join&lt;/code&gt;, &lt;code&gt;enable_partitionwise_aggregate&lt;/code&gt;). These are off by default because they can increase planning time significantly on highly partitioned schemas.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Query lacks partition key in WHERE&lt;/td&gt;&lt;td&gt;All partitions scanned; query may be slower than on a non-partitioned table of the same total size&lt;/td&gt;&lt;td&gt;Planner cannot eliminate any partition; must generate plan nodes for all child tables&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Default partition prevents pruning&lt;/td&gt;&lt;td&gt;Even queries with the partition key may scan the default partition&lt;/td&gt;&lt;td&gt;Planner cannot prove a value is not in the default partition without scanning it&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Partition key does not match primary query access pattern&lt;/td&gt;&lt;td&gt;Partitioning optimizes the wrong dimension; primary key and foreign key lookups cross all partitions&lt;/td&gt;&lt;td&gt;Design decision cannot be undone without a full table rewrite&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Partitioning a table on a date column and then running OLTP queries filtered by user ID or order ID produces a plan that scans all partitions — no pruning, more overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Validate that the most frequent WHERE clause patterns include the partition key before committing to a partitioning scheme; use &lt;code&gt;EXPLAIN&lt;/code&gt; to confirm partition pruning in production-representative queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: &lt;code&gt;EXPLAIN&lt;/code&gt; output for a date-filtered query shows only the relevant partition(s) listed under the &lt;code&gt;Append&lt;/code&gt; node — not all 36.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, run &lt;code&gt;EXPLAIN&lt;/code&gt; on the five highest-volume queries against any recently partitioned table and check whether the plan shows one partition or many — if the answer is many, the partitioning key is wrong for those queries.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category><category>failures</category></item><item><title>OCI for Oracle-Heavy Enterprises: Migration Pattern, Risk Boundary, and Cost Model</title><link>https://rajivonai.com/blog/2023-08-19-oci-for-oracle-heavy-enterprises-migration-pattern-risk-boundary-and-cost-model/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-08-19-oci-for-oracle-heavy-enterprises-migration-pattern-risk-boundary-and-cost-model/</guid><description>OCI migration risk model for Oracle-heavy enterprises — where the lift-and-shift boundary shifts from the database tier into dependent application contracts.</description><pubDate>Sat, 19 Aug 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The expensive OCI migration is not the one where Oracle databases move slowly; it is the one where the enterprise accidentally moves the risk boundary from the database tier into every dependent application at the same time.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Oracle-heavy enterprises rarely start cloud migration from a clean portfolio. They usually start with decades of Oracle Database, RAC, Exadata, Data Guard, RMAN, batch schedulers, ERP integrations, reporting replicas, vendor packages, and operational runbooks that assume stable network topology and known failure behavior.&lt;/p&gt;
&lt;p&gt;That estate creates a different cloud question from a generic replatforming program. The strategic issue is not whether workloads can run on Kubernetes, whether object storage is cheaper than SAN, or whether a new data platform would be more modern. The first-order issue is that the database is already the system of record, the operational contracts are already written around Oracle behavior, and the blast radius of a failed migration includes month-end close, payroll, order capture, tax, inventory, and customer commitments.&lt;/p&gt;
&lt;p&gt;OCI is attractive in this context because it gives Oracle-heavy enterprises a lower-friction target for Oracle Database services, Exadata-based capacity, managed database operations, and multicloud adjacency. But that does not make the migration simple. It changes the shape of the problem: the safest migration is usually not a full-stack rewrite, but a staged relocation of the Oracle control plane with hard gates around latency, licensing, failover, and cost attribution.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Most cloud migration plans fail Oracle estates in one of three ways.&lt;/p&gt;
&lt;p&gt;The first failure mode is treating database migration as an application migration dependency. Teams create a massive dependency graph, declare that app and database tiers must move together, and then discover that every cutover window requires coordinated changes across connection pools, DNS, batch jobs, firewall rules, reporting users, and operational dashboards. The program becomes a release train with database physics attached.&lt;/p&gt;
&lt;p&gt;The second failure mode is underestimating stateful rollback. Stateless services can often redeploy, reroute, or scale out. Oracle databases require point-in-time recovery strategy, redo transport design, replication lag monitoring, backup validation, and a decision about whether the old primary can safely resume writes after a cutover failure.&lt;/p&gt;
&lt;p&gt;The third failure mode is treating cloud cost as a rate-card exercise. For Oracle estates, cost is not just compute, storage, and network. It is license position, Exadata shape, database edition, support model, backup retention, disaster recovery capacity, migration overlap, reserved capacity, and the operational cost of keeping parallel environments alive.&lt;/p&gt;
&lt;p&gt;The question is therefore: how do you move an Oracle-heavy enterprise to OCI without turning the database migration into a full-enterprise outage domain?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;The practical architecture is a database-first migration boundary. Move the Oracle estate into an OCI landing zone designed for database operations, keep application movement optional, and use private connectivity to preserve controlled communication between tiers during transition.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[Oracle estate — RAC, Exadata, ERP databases] --&gt; B[Discovery — workload classes]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; C[Risk boundary — database first]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; D[OCI database landing zone — VCN, IAM, keys]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; E[Migration lane — ZDM, Data Guard, GoldenGate]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; F[Cutover gate — lag, backups, rollback]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt; G[Application remap — connection pools and batch]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  G --&gt; H[Cost loop — tags, budgets, unit metrics]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; I[Keep app tier where it runs]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  I --&gt; J[Private connectivity — FastConnect or interconnect]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  J --&gt; G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The boundary has one rule: only dependencies required for database correctness cross it early. That usually includes identity, networking, key management, backup storage, observability, replication, and runbooks. It does not automatically include every application server, reporting tool, ETL job, or vendor appliance.&lt;/p&gt;
&lt;p&gt;This pattern gives the program three control points.&lt;/p&gt;
&lt;p&gt;First, classify workloads by recoverability, not by org chart. A Tier 0 database with synchronous business impact needs a different lane from a reporting replica. For each database, document RPO, RTO, peak write rate, backup size, maintenance windows, database version, option usage, character set, external directory dependencies, and application connection behavior.&lt;/p&gt;
&lt;p&gt;Second, build the OCI landing zone around operational contracts. The database subnet, route tables, security lists or network security groups, IAM policies, KMS keys, vaults, backup policy, monitoring, DNS, and logging must exist before migration tooling touches production. This is where many programs lose time: they build a cloud account and call it a landing zone, but the database team still cannot answer who can restore, who can rotate keys, who can approve failover, and who gets paged on replication lag.&lt;/p&gt;
&lt;p&gt;Third, treat cutover as a controlled state transition. A safe cutover gate includes validated backup, measured replication lag, application freeze rules, connection drain behavior, rollback authority, post-cutover smoke tests, and a written rule for when rollback is no longer safe because writes have committed on the target.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Oracle documents Zero Downtime Migration as a migration utility for moving Oracle databases into Oracle-owned infrastructure, including OCI and Exadata Cloud targets. The documented pattern supports online and offline migration paths, and the offline path can use Object Storage as the intermediate backup location. See Oracle’s &lt;a href=&quot;https://docs.oracle.com/en/database/oracle/zero-downtime-migration/19.7/zdmug/introduction-to-zero-downtime-migration.html&quot;&gt;Zero Downtime Migration documentation&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Use ZDM as the orchestrated migration lane when the source and target meet support requirements. Keep the migration lane separate from the application modernization lane. That means the database team owns replication, backup, restore, and cutover verification, while application teams own connection behavior and functional validation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The result is not literally zero risk; it is a smaller risk boundary. The operational result is that the enterprise can rehearse database movement before committing every application tier to OCI. Failed rehearsals produce database-specific fixes instead of enterprise-wide release delays.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; The documented pattern is that stateful migration needs a migration control plane, not a collection of manual restore steps. ZDM is useful because it makes the migration sequence explicit, but the engineering value comes from the surrounding gates: prechecks, backup validation, lag measurement, and rollback decision points.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Oracle’s Maximum Availability Architecture patterns use technologies such as Data Guard, Active Data Guard, backups, and cross-region deployment to define database availability posture. Oracle’s MAA guidance for Exadata and cloud database services emphasizes role transition, protection mode, and recovery design rather than simple VM placement. See Oracle’s &lt;a href=&quot;https://docs.oracle.com/en/database/oracle/oracle-database/19/haovw/oracle-maximum-availability-architecture-oracle-databaseaws.html&quot;&gt;MAA documentation&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Map each workload to an availability tier before choosing the OCI service shape. A dev database, a reporting standby, a regional ERP database, and a global financial close system should not share the same architecture just because they are all Oracle.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The result is a cost and resilience model with visible tradeoffs. Some systems justify Exadata Database Service, cross-region standby, and aggressive recovery objectives. Others are better served by simpler database services, backup-driven recovery, or scheduled migration windows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; The documented pattern is that high availability is an application contract expressed through database topology. OCI does not remove the need to choose protection levels; it makes the cost of each protection level more explicit.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Oracle and Microsoft document private interconnection between Azure and OCI through ExpressRoute and FastConnect for cross-cloud Oracle workloads. This matters because many Oracle-heavy enterprises also have application, identity, analytics, or integration tiers in Azure. See Microsoft’s &lt;a href=&quot;https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/configure-azure-oci-networking&quot;&gt;Azure and OCI networking guidance&lt;/a&gt; and Oracle’s &lt;a href=&quot;https://blogs.oracle.com/cloud-infrastructure/post/overview-of-the-interconnect-between-oracle-and-microsoft&quot;&gt;interconnect overview&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Use private connectivity when the application tier stays outside OCI during the first migration phase. Measure latency and failure behavior under production-like load before declaring the architecture acceptable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The result is a migration path that does not require all application tiers to move on the database cutover date. It also exposes hidden assumptions: chatty SQL access, hardcoded database addresses, batch windows that depend on LAN latency, and reporting jobs that overload the primary.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; The documented pattern is that multicloud adjacency is useful only when latency, routing, DNS, and failover behavior are engineered as first-class production dependencies.&lt;/p&gt;
&lt;h2 id=&quot;cost-model&quot;&gt;Cost Model&lt;/h2&gt;
&lt;p&gt;The useful OCI cost model is not a single monthly estimate. It is a set of cost buckets tied to architectural decisions.&lt;/p&gt;
&lt;p&gt;Start with database capacity: service type, Exadata shape, OCPU allocation, storage, database edition, options, and license model. Then add resilience: standby capacity, cross-region replication, backup retention, recovery service, test restores, and nonproduction environments. Then add network: FastConnect, VPN, interconnect, data transfer, DNS, and observability traffic. Then add migration overlap: source environment, target environment, replication tooling, temporary storage, parallel support, and extended freeze windows.&lt;/p&gt;
&lt;p&gt;The model should produce three numbers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Steady-state run cost:&lt;/strong&gt; what the estate costs after migration and decommissioning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Migration overlap cost:&lt;/strong&gt; what the enterprise pays while both old and new environments run.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Risk-reduction cost:&lt;/strong&gt; what is intentionally spent on standby, backup, rehearsal, monitoring, and rollback.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OCI Cost Management supports cost analysis, reports, budgets, and scheduled reporting, which makes it suitable for a tagged cost loop rather than a one-time spreadsheet. See Oracle’s &lt;a href=&quot;https://docs.oracle.com/en-us/iaas/Content/Billing/Concepts/costmanagementoverview.htm&quot;&gt;Cost Management overview&lt;/a&gt; and &lt;a href=&quot;https://docs.oracle.com/iaas/Content/Billing/Concepts/FinOps.htm&quot;&gt;FinOps Hub documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Why it happens&lt;/th&gt;&lt;th&gt;Mitigation&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Application latency surprise&lt;/td&gt;&lt;td&gt;The app tier remains outside OCI but was written for low-latency database access&lt;/td&gt;&lt;td&gt;Run production-like SQL traces and batch tests across the private link before cutover&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Rollback ambiguity&lt;/td&gt;&lt;td&gt;Teams do not define when writes make rollback unsafe&lt;/td&gt;&lt;td&gt;Create a written rollback gate with ownership, timing, and data divergence rules&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cost overrun&lt;/td&gt;&lt;td&gt;Source and target run in parallel longer than planned&lt;/td&gt;&lt;td&gt;Track migration overlap as its own cost category with an executive burn-down&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;License confusion&lt;/td&gt;&lt;td&gt;Database options and editions are not inventoried before sizing&lt;/td&gt;&lt;td&gt;Run option usage discovery and map license position before target architecture selection&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Standby underdesign&lt;/td&gt;&lt;td&gt;DR is copied from on-premises without validating cloud failure domains&lt;/td&gt;&lt;td&gt;Assign each workload an RPO and RTO tier, then design standby topology from that contract&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tooling optimism&lt;/td&gt;&lt;td&gt;ZDM or replication tooling is treated as the whole plan&lt;/td&gt;&lt;td&gt;Pair migration tooling with rehearsals, observability, backup validation, and cutover authority&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Oracle estates fail cloud migration when the database move becomes coupled to every application and operational dependency at once.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Put OCI behind a database-first risk boundary, migrate Oracle systems through explicit lanes, and keep application movement optional until latency and cutover behavior are proven.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Use documented Oracle migration, availability, interconnect, and cost-management patterns rather than invented transformation stories.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Inventory workload tiers, build the OCI database landing zone, rehearse one representative migration per tier, publish the rollback gate, and track steady-state, overlap, and risk-reduction cost separately.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>databases</category><category>cloud</category></item><item><title>Deadlocks vs Blocking: The Difference Engineers Miss</title><link>https://rajivonai.com/blog/2023-07-31-deadlocks-vs-blocking-the-difference-engineers-miss/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-07-31-deadlocks-vs-blocking-the-difference-engineers-miss/</guid><description>Blocking and deadlocks are two distinct failure modes that require opposite responses — confusing them leads to retry logic that doesn&apos;t help and investigations that point at the wrong cause.</description><pubDate>Mon, 31 Jul 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Deadlocks and blocking look similar in a dashboard — queries stuck, latency climbing, transactions piling up — but the database resolves them differently, and so must you.&lt;/strong&gt; Adding retry logic when you have a blocking problem won’t help. Investigating lock contention when you have a long-running transaction holding locks will send you down the wrong path entirely. These are two distinct failure modes. Treating them as one is how engineers waste hours in incident response.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Row-level locking is how relational databases protect concurrent writes. Any transaction that modifies a row acquires a lock on it; others that need the same row wait. This is expected behavior — not a bug — and for most workloads it resolves quickly as transactions commit or roll back.&lt;/p&gt;
&lt;p&gt;Lock problems surface when that assumption breaks: a transaction holds a lock longer than expected, two transactions each wait for what the other holds, or a missing index forces the database to lock far more rows than necessary. The symptoms look similar from the outside — stalled queries, timeouts, connection pool pressure — but the causes and correct responses are completely different.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Engineers see “lock wait timeout exceeded” or a deadlock error, conclude there is a locking problem, and apply whatever fix they read about most recently — retry logic, a &lt;code&gt;lock_timeout&lt;/code&gt; change, an index. Any of those might be wrong for the actual problem present.&lt;/p&gt;
&lt;p&gt;Blocking and deadlocks have different root causes, different detection mechanisms, and different remediation paths. Applying deadlock fixes to a blocking problem — or vice versa — obscures the real signal and delays finding the actual cause.&lt;/p&gt;
&lt;p&gt;The core question: given a stalled transaction or a lock error, how do you determine which condition you have, and what do you do about each one?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;These are not the same condition expressed at different severity levels. They are structurally different.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Blocking&lt;/strong&gt; is one transaction waiting for a lock held by another. The waiter sits until the holder commits or rolls back — no automatic resolution occurs. The database waits indefinitely (or until a &lt;code&gt;lock_timeout&lt;/code&gt; fires). The fix is almost always about the holder: find it, understand why it’s holding the lock longer than expected, and address that.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A deadlock&lt;/strong&gt; is a cycle. Transaction A holds lock X and waits for lock Y. Transaction B holds lock Y and waits for lock X. Neither can proceed. PostgreSQL and MySQL InnoDB detect this automatically via a wait-for graph, pick one transaction as the victim, and terminate it — the other proceeds. Deadlocks resolve themselves; the application must handle the error and retry. The fix is about eliminating the cycle, typically by acquiring locks in a consistent order across transactions.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    subgraph Blocking [Blocking — Linear Wait]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        T1[Transaction A] --&gt;|Holds Lock| R1[Row 1]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        T2[Transaction B] --&gt;|Waits for Lock| R1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    end&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    subgraph Deadlock [Deadlock — Circular Wait]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        T3[Transaction C] --&gt;|Holds Lock| R2[Row 2]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        T4[Transaction D] --&gt;|Holds Lock| R3[Row 3]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        T3 --&gt;|Waits for Lock| R3&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        T4 --&gt;|Waits for Lock| R2&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    end&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;/th&gt;&lt;th&gt;Blocking&lt;/th&gt;&lt;th&gt;Deadlock&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Cause&lt;/td&gt;&lt;td&gt;One transaction holds a lock another needs&lt;/td&gt;&lt;td&gt;Two transactions each wait for what the other holds&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Resolution&lt;/td&gt;&lt;td&gt;Manual — requires the holder to commit or roll back&lt;/td&gt;&lt;td&gt;Automatic — database detects the cycle and kills one victim&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Error surfaced&lt;/td&gt;&lt;td&gt;&lt;code&gt;lock_timeout&lt;/code&gt; if configured; otherwise the query just waits&lt;/td&gt;&lt;td&gt;Explicit deadlock error (PostgreSQL: &lt;code&gt;ERROR: deadlock detected&lt;/code&gt;; MySQL: &lt;code&gt;ERROR 1213: Deadlock found&lt;/code&gt;)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Correct response&lt;/td&gt;&lt;td&gt;Find and address the long-running transaction&lt;/td&gt;&lt;td&gt;Handle the error in the application; fix lock ordering&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Where to look&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt; (PostgreSQL); &lt;code&gt;SHOW ENGINE INNODB STATUS&lt;/code&gt; (MySQL)&lt;/td&gt;&lt;td&gt;PostgreSQL server log; MySQL &lt;code&gt;SHOW ENGINE INNODB STATUS&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;PostgreSQL detection:&lt;/strong&gt; &lt;code&gt;pg_stat_activity&lt;/code&gt; surfaces every session currently blocked on a lock via &lt;code&gt;SELECT pid, state, wait_event_type, wait_event, query FROM pg_stat_activity WHERE wait_event_type = &apos;Lock&apos;;&lt;/code&gt;. Deadlocks are logged at &lt;code&gt;ERROR&lt;/code&gt; level in the server log.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MySQL InnoDB detection:&lt;/strong&gt; &lt;code&gt;SHOW ENGINE INNODB STATUS\G&lt;/code&gt; includes a &lt;code&gt;LATEST DETECTED DEADLOCK&lt;/code&gt; section showing the two transactions, the locks held and waited for, and which was rolled back as the victim. For blocking, &lt;code&gt;information_schema.INNODB_LOCK_WAITS&lt;/code&gt; shows live lock waits.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lock timeout vs deadlock detection&lt;/strong&gt; are separate mechanisms. &lt;code&gt;lock_timeout&lt;/code&gt; (PostgreSQL) and &lt;code&gt;innodb_lock_wait_timeout&lt;/code&gt; (MySQL) abort a waiting transaction after a configured interval — that is a timeout, not a deadlock. Deadlock detection runs independently on the server side regardless of timeout settings. A blocking event terminated by a timeout was never a deadlock; the application log error codes differ accordingly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Row-level vs table-level locking:&lt;/strong&gt; missing indexes force broader locks. A &lt;code&gt;DELETE WHERE status = &apos;pending&apos;&lt;/code&gt; without an index on &lt;code&gt;status&lt;/code&gt; may escalate to a table lock in InnoDB rather than acquiring row locks for only matching rows — turning a narrow delete into a blocking event for every other writer on that table.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s lock management documentation describes the wait-for graph approach: “PostgreSQL automatically detects deadlock situations and resolves them by aborting one of the transactions involved, allowing the other(s) to complete.” It explicitly recommends consistent lock ordering as the prevention strategy (&lt;a href=&quot;https://www.postgresql.org/docs/current/explicit-locking.html&quot;&gt;https://www.postgresql.org/docs/current/explicit-locking.html&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;MySQL’s InnoDB deadlock documentation draws a sharp distinction from lock wait timeouts: a lock wait timeout rolls back only the current SQL statement, whereas a deadlock detection event rolls back the entire transaction (&lt;a href=&quot;https://dev.mysql.com/doc/refman/8.0/en/innodb-deadlocks.html&quot;&gt;https://dev.mysql.com/doc/refman/8.0/en/innodb-deadlocks.html&lt;/a&gt;). That distinction matters for application retry logic — a partial statement rollback and a full transaction rollback require different recovery paths.&lt;/p&gt;
&lt;p&gt;The documented pattern from both systems: deadlock handling belongs in the application layer with a full-transaction retry. Blocking calls for operational investigation — find the long-running holder and address it at source.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;ORM batch inserts without consistent row ordering&lt;/td&gt;&lt;td&gt;Deadlocks under concurrent batch operations&lt;/td&gt;&lt;td&gt;Two batches inserting the same rows in different orders create lock cycle; ORM doesn’t guarantee insertion order&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Missing index on a filtered column used in writes&lt;/td&gt;&lt;td&gt;Blocking affects all writers to the table, not just contended rows&lt;/td&gt;&lt;td&gt;No row-level lock available, so InnoDB or PostgreSQL acquires a broader lock than necessary&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Connection pool holding open transactions&lt;/td&gt;&lt;td&gt;Long-running blocking events that appear intermittent&lt;/td&gt;&lt;td&gt;Idle connections holding uncommitted transactions keep locks live; the blocking appears random because it follows the pool’s transaction lifecycle, not the application’s&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Engineers apply the wrong fix because blocking and deadlocks produce similar symptoms but have structurally different causes and resolution paths.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Identify which condition you have first — use &lt;code&gt;pg_stat_activity&lt;/code&gt; or &lt;code&gt;SHOW ENGINE INNODB STATUS&lt;/code&gt; to determine whether a lock cycle or a long-running holder is the root cause — then respond accordingly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: If &lt;code&gt;pg_stat_activity&lt;/code&gt; shows one session in &lt;code&gt;Lock&lt;/code&gt; wait state with a single blocking pid, you have blocking. If the PostgreSQL log shows &lt;code&gt;ERROR: deadlock detected&lt;/code&gt; or MySQL reports a deadlock in &lt;code&gt;SHOW ENGINE INNODB STATUS&lt;/code&gt;, you have a deadlock.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, add &lt;code&gt;lock_timeout = &apos;5s&apos;&lt;/code&gt; (PostgreSQL) or lower &lt;code&gt;innodb_lock_wait_timeout&lt;/code&gt; (MySQL) to surface blocking events that would otherwise wait silently, and confirm your application explicitly handles the &lt;code&gt;40P01&lt;/code&gt; error code (PostgreSQL deadlock) with a retry path.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>failures</category></item><item><title>Logical Replication Failure Workflow</title><link>https://rajivonai.com/blog/2023-07-17-logical-replication-failure-workflow/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-07-17-logical-replication-failure-workflow/</guid><description>A diagnostic runbook for logical replication lag, apply worker failures, replication conflicts, and schema drift between publisher and subscriber.</description><pubDate>Mon, 17 Jul 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Logical replication lag does not announce itself with an error message — it accumulates silently in the WAL retention on the publisher, and the subscriber falls further and further behind until either the replication slot fills the disk or you notice the data is hours stale.&lt;/strong&gt; Unlike streaming replication, which breaks loudly, logical replication degrades quietly: the subscription stays connected, the apply worker reports running, and the divergence grows until something downstream catches it.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL logical replication works by decoding WAL changes on the publisher into a row-level change stream, which the subscriber applies table by table. This is fundamentally different from physical replication, which ships binary WAL blocks. Logical replication lets you replicate subsets of tables, replicate across major versions, and fan out to multiple subscribers — but it introduces failure modes that streaming replication does not have.&lt;/p&gt;
&lt;p&gt;The most common operational problems: a subscription falls behind because the apply worker hit a conflict (an update arriving for a row that does not exist on the subscriber); the subscription is technically active but the apply worker is stalled waiting for a lock; the publisher and subscriber diverge on schema, causing the apply worker to crash with a type mismatch; or the replication slot on the publisher accumulates enough unreleased WAL to fill the disk.&lt;/p&gt;
&lt;p&gt;The diagnostic workflow must cover all four of these. They share symptoms but have different root causes and different remediations.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Increasing lag between publisher and subscriber&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_replication_slots.confirmed_flush_lsn&lt;/code&gt; vs &lt;code&gt;pg_current_wal_lsn()&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Apply worker not keeping up — lag in bytes growing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replication slot holding excessive WAL&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_replication_slots&lt;/code&gt; — slot not advancing&lt;/td&gt;&lt;td&gt;Subscriber disconnected or stalled; disk risk if slot persists&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Apply worker process absent from &lt;code&gt;pg_stat_subscription&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt;, &lt;code&gt;pg_stat_subscription&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Apply worker crashed — check PostgreSQL error log&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Subscription state &lt;code&gt;e&lt;/code&gt; (error) in &lt;code&gt;pg_subscription_rel&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_subscription_rel.srsubstate&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Specific table failed to apply — conflict or schema mismatch&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Error message in logs — “conflict in logical replication”&lt;/td&gt;&lt;td&gt;&lt;code&gt;postgresql.log&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Row-level conflict on insert, update, or delete&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Schema-related error in logs — “column X of relation Y does not exist”&lt;/td&gt;&lt;td&gt;&lt;code&gt;postgresql.log&lt;/code&gt;&lt;/td&gt;&lt;td&gt;DDL executed on publisher without matching DDL on subscriber&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Replication lag in bytes&lt;/strong&gt; — the most immediate measure of how far behind the subscriber is:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Run on the publisher&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  slot_name,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  plugin,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  active,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  confirmed_flush_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_current_wal_lsn() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; confirmed_flush_lsn &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; lag_bytes,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_size_pretty(pg_current_wal_lsn() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; confirmed_flush_lsn) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; lag_human&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_replication_slots&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; slot_type &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;logical&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A growing &lt;code&gt;lag_bytes&lt;/code&gt; means the subscriber is not applying changes as fast as they are being generated. A slot that is not &lt;code&gt;active&lt;/code&gt; (no connected subscriber) is holding WAL indefinitely — disk risk. A slot that is active but &lt;code&gt;lag_bytes&lt;/code&gt; is growing means the apply worker is falling behind.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Subscription status&lt;/strong&gt; — verify the subscription is enabled and the apply worker is running:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Run on the subscriber&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  subname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  subenabled,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  subpublications,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  subconninfo&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_subscription;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;subenabled = false&lt;/code&gt; means the subscription was manually disabled. It will not apply changes until re-enabled. This is the most common cause of lag that looks like a network issue but is actually an administrative action that was forgotten.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Per-table replication state&lt;/strong&gt; — identify which tables are in which state:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Run on the subscriber&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  srrelid::regclass &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  srsubstate,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  srsublsn&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_subscription_rel&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; srsubstate;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;State codes: &lt;code&gt;i&lt;/code&gt; = initialize, &lt;code&gt;d&lt;/code&gt; = data copy in progress, &lt;code&gt;s&lt;/code&gt; = synchronized, &lt;code&gt;r&lt;/code&gt; = ready, &lt;code&gt;e&lt;/code&gt; = error. A table in state &lt;code&gt;e&lt;/code&gt; has failed to apply changes — check the error log for the specific conflict or error. A table stuck in state &lt;code&gt;d&lt;/code&gt; for an extended period means the initial data copy is running slowly or stalled.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Apply worker activity&lt;/strong&gt; — check what the apply worker is currently doing:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Run on the subscriber&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  application_name,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  client_addr,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  sent_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  write_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  flush_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  replay_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; backend_start &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; worker_age&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_replication;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Also check the subscription worker directly&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  subname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  received_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_msg_send_time,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_msg_receipt_time,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  latest_end_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  latest_end_time&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_subscription;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A &lt;code&gt;pid&lt;/code&gt; that is NULL in &lt;code&gt;pg_stat_subscription&lt;/code&gt; means no worker is running for that subscription. Check the PostgreSQL log for the crash reason.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Error log review&lt;/strong&gt; — the log contains the exact conflict type and LSN:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Find conflict-related errors in the PostgreSQL log&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;grep&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -E&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;ERROR|conflict|replication&apos;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /var/log/postgresql/postgresql.log&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; tail&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -50&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# More targeted&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;grep&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;logical replication&apos;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /var/log/postgresql/postgresql.log&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; tail&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -20&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The log will contain lines like &lt;code&gt;ERROR: duplicate key value violates unique constraint&lt;/code&gt; or &lt;code&gt;ERROR: could not find row for updating&lt;/code&gt; — these identify the conflict type. The log also shows the LSN at which the conflict occurred, which is needed for the &lt;code&gt;SKIP&lt;/code&gt; remediation in Option 1 below.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Logical replication lag growing] --&gt; B{Subscription enabled?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no| C[ALTER SUBSCRIPTION sub ENABLE]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| D{Apply worker running?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no — pid null| E[Check pg_subscription_rel for error state]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; F{Table in error state?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|yes| G{Conflict type?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|insert conflict| H[ALTER SUBSCRIPTION sub SKIP lsn]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|update or delete missing row| I[ALTER SUBSCRIPTION sub SKIP lsn]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|schema mismatch| J[Apply DDL to subscriber — re-enable]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes — worker running| K{Lag growing despite active worker?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|yes| L{Publisher write rate too high?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|yes| M[Tune max_logical_replication_workers]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|no| N{Lock wait on subscriber?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    N --&gt;|yes| O[Identify blocking query on subscriber]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    N --&gt;|no| P[Check network throughput publisher to subscriber]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|no — stuck in data copy| Q[Check disk and I/O on subscriber]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Skip a conflicting transaction&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When the apply worker fails due to a row conflict — an update or delete targeting a row that does not exist on the subscriber, or an insert violating a unique constraint — the correct resolution is to identify the LSN of the conflicting transaction and skip it:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- On the subscriber, find the last received LSN from pg_stat_subscription&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; received_lsn &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_subscription &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; subname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;my_subscription&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Skip the conflicting transaction (PostgreSQL 15+)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUBSCRIPTION my_subscription &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SKIP&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (lsn &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;LSN_VALUE&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- For PostgreSQL 14 and earlier, use:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_replication_origin_advance(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;pg_16399&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;LSN_VALUE&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- where 16399 is the subscription OID from pg_subscription&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After skipping, re-enable the subscription if it was auto-disabled:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUBSCRIPTION my_subscription &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ENABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The skipped transaction is permanently lost on the subscriber. Before skipping, verify the row conflict is expected — for example, the subscriber already has the correct version of that row through another path. If data integrity is critical, investigate why the divergence occurred before skipping blindly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Resync after schema drift&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When a schema change (DDL) was applied to the publisher without also being applied to the subscriber, the apply worker will crash with a column or type mismatch error. The fix is to apply the matching DDL to the subscriber, then re-enable the subscription:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- On the subscriber: apply the matching DDL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ADD&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN shipped_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;timestamptz&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Re-enable the subscription&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUBSCRIPTION my_subscription &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ENABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify lag starts recovering&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_size_pretty(pg_current_wal_lsn() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; confirmed_flush_lsn)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_replication_slots&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; slot_name &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;my_subscription&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;  &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- check on publisher&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Logical replication does not replicate DDL. Every schema change on the publisher must be manually applied to the subscriber in the correct order before re-enabling the subscription.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Full resync of a specific table&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When the data divergence is too large to resolve by skipping individual transactions, resync the affected table:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- On the subscriber: refresh the subscription for a specific table&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUBSCRIPTION my_subscription REFRESH PUBLICATION &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FOR&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ALL TABLES;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Or drop and recreate with initial data copy&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUBSCRIPTION my_subscription &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DISABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUBSCRIPTION my_subscription;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUBSCRIPTION my_subscription&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  CONNECTION&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;host=publisher port=5432 dbname=mydb&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  PUBLICATION my_publication&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  WITH&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (copy_data &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; true, create_slot &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; true);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A full resync will re-copy all data for subscribed tables. On large tables this can take hours. During resync, the subscriber is in an inconsistent state. If downstream applications read from the subscriber during resync, they should be aware the data is being rebuilt.&lt;/p&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;ALTER SUBSCRIPTION sub ENABLE&lt;/code&gt; and &lt;code&gt;DISABLE&lt;/code&gt; are immediately reversible — toggle between them as needed. No data is lost.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ALTER SUBSCRIPTION sub SKIP (lsn)&lt;/code&gt; is irreversible — the skipped transaction is permanently lost on the subscriber. There is no undo. The only recovery if the skipped data was needed is a full table resync.&lt;/li&gt;
&lt;li&gt;DDL applied to the subscriber for schema drift: cannot be automatically undone — but the DDL itself can be reversed (e.g., &lt;code&gt;ALTER TABLE DROP COLUMN&lt;/code&gt;) if the column is not yet populated. Coordinate DDL rollback with the publisher-side change.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DROP SUBSCRIPTION&lt;/code&gt; followed by &lt;code&gt;CREATE SUBSCRIPTION&lt;/code&gt;: dropping a subscription removes the replication slot on the publisher. The slot must be recreated (it happens automatically with &lt;code&gt;create_slot = true&lt;/code&gt;). Once dropped, WAL that was retained for the old slot is released.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Replication lag monitoring should be a first-class alert, not a periodic check. The key metric is the byte lag at the replication slot:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Scheduled query to capture slot lag for alerting&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; cron&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schedule&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;replication-lag-monitor&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;*/5 * * * *&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, $$&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; ops&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;replication_lag&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (slot_name, lag_bytes, active, captured_at)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    slot_name,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    pg_current_wal_lsn() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; confirmed_flush_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    active,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_replication_slots&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; slot_type &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;logical&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$$);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alert thresholds: lag exceeding 1 GB warrants a warning; lag exceeding 10 GB is an incident — the publisher is retaining that much WAL, and disk exhaustion is a real risk. A slot that becomes &lt;code&gt;active = false&lt;/code&gt; for more than 5 minutes outside a maintenance window should page immediately.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The PostgreSQL logical replication documentation describes conflict handling behavior: when an apply worker encounters a conflict (e.g., a unique constraint violation), it pauses the apply process and waits for manual intervention. The documented resolution is either to skip the conflicting transaction using &lt;code&gt;ALTER SUBSCRIPTION ... SKIP&lt;/code&gt; (PostgreSQL 15+) or to use &lt;code&gt;pg_replication_origin_advance&lt;/code&gt; on earlier versions. The documentation explicitly states that skipping is a destructive operation — the skipped changes are permanently absent from the subscriber.&lt;/p&gt;
&lt;p&gt;The documented constraint on logical replication and DDL is unambiguous: DDL changes are not replicated. The PostgreSQL replication documentation requires that schema changes be applied to all subscribers before or simultaneously with the publisher, depending on whether the change is backward-compatible. Adding a nullable column with a default is backward-compatible and can be applied to the subscriber after the publisher; removing a column is not backward-compatible and must be applied to both simultaneously.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Replication slot fills disk on publisher&lt;/td&gt;&lt;td&gt;Subscriber disconnected for hours while high-write workload runs&lt;/td&gt;&lt;td&gt;Monitor slot lag; set &lt;code&gt;max_slot_wal_keep_size&lt;/code&gt; to cap WAL retention&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Apply worker stuck waiting for lock&lt;/td&gt;&lt;td&gt;Long-running query on subscriber table being replicated&lt;/td&gt;&lt;td&gt;Identify and terminate the blocking query on subscriber&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;SKIP&lt;/code&gt; causes downstream data inconsistency&lt;/td&gt;&lt;td&gt;Skipped row was a critical update needed for referential integrity&lt;/td&gt;&lt;td&gt;Resync the table after skip; audit downstream data for orphaned rows&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Schema divergence not caught until conflict&lt;/td&gt;&lt;td&gt;Publisher DDL run without notifying the subscriber&lt;/td&gt;&lt;td&gt;Add subscriber DDL to publisher migration scripts; use migration locking tools&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;max_wal_senders&lt;/code&gt; exceeded&lt;/td&gt;&lt;td&gt;Too many replication connections — logical and physical combined&lt;/td&gt;&lt;td&gt;Increase &lt;code&gt;max_wal_senders&lt;/code&gt; in &lt;code&gt;postgresql.conf&lt;/code&gt;; requires restart&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Logical replication lag accumulates silently, WAL retention grows on the publisher, and by the time the disk alert fires, the subscriber is hours behind with no fast path to catch up.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Add active monitoring on replication slot lag bytes with an alert threshold at 1 GB, set &lt;code&gt;max_slot_wal_keep_size&lt;/code&gt; as a disk safety cap, and treat any &lt;code&gt;pg_subscription_rel&lt;/code&gt; table in &lt;code&gt;e&lt;/code&gt; state as an incident requiring same-day resolution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After resolving a conflict and re-enabling the subscription, &lt;code&gt;pg_size_pretty(pg_current_wal_lsn() - confirmed_flush_lsn)&lt;/code&gt; from the publisher should decrease steadily — the subscriber is catching up.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run Check 1 on the publisher this week. If any replication slot shows &lt;code&gt;lag_bytes &gt; 1 GB&lt;/code&gt; or &lt;code&gt;active = false&lt;/code&gt;, treat it as an open incident. If lag is normal, add a monitoring alert so you know before it becomes critical.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;checklist&quot;&gt;Checklist&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Query &lt;code&gt;pg_replication_slots&lt;/code&gt; on publisher — check &lt;code&gt;active&lt;/code&gt; status and &lt;code&gt;lag_bytes&lt;/code&gt; for each logical slot&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_subscription&lt;/code&gt; on subscriber — verify &lt;code&gt;subenabled = true&lt;/code&gt; for each subscription&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_subscription_rel&lt;/code&gt; on subscriber — check &lt;code&gt;srsubstate&lt;/code&gt; for any tables in &lt;code&gt;e&lt;/code&gt; (error) state&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_stat_subscription&lt;/code&gt; on subscriber — confirm &lt;code&gt;pid&lt;/code&gt; is not NULL for each subscription&lt;/li&gt;
&lt;li&gt;Review PostgreSQL log on subscriber for conflict type and LSN&lt;/li&gt;
&lt;li&gt;If table in error state with row conflict: use &lt;code&gt;ALTER SUBSCRIPTION sub SKIP (lsn)&lt;/code&gt; to unblock&lt;/li&gt;
&lt;li&gt;If schema mismatch: apply matching DDL to subscriber, then re-enable subscription&lt;/li&gt;
&lt;li&gt;If apply worker stalled on lock: identify and resolve the blocking query on subscriber&lt;/li&gt;
&lt;li&gt;After resolution, monitor &lt;code&gt;lag_bytes&lt;/code&gt; decreasing — confirm subscriber is catching up&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;max_slot_wal_keep_size&lt;/code&gt; on publisher to cap disk usage from stalled slots&lt;/li&gt;
&lt;li&gt;Add monitoring alert at lag &gt; 1 GB per logical replication slot&lt;/li&gt;
&lt;li&gt;Document schema change protocol — every publisher DDL must have a matching subscriber DDL step&lt;/li&gt;
&lt;/ol&gt;</content:encoded><category>databases</category><category>checklist</category><category>failures</category></item><item><title>Database Connection Pooling: Why Apps Kill Databases</title><link>https://rajivonai.com/blog/2023-07-10-database-connection-pooling-why-apps-kill-databases/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-07-10-database-connection-pooling-why-apps-kill-databases/</guid><description>Without a connection pool, traffic spikes exhaust OS-level resources before a single slow query runs — here is what actually happens and how to fix it.</description><pubDate>Mon, 10 Jul 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Most applications exhaust their database long before the database is under load.&lt;/strong&gt; The failure is not query pressure — it is connection pressure. Every new connection to PostgreSQL forks a backend process. Every new connection to MySQL spawns a thread. Without a pool capping that number, a traffic spike generates hundreds of OS-level resources in seconds, and the database runs out of capacity to accept connections before it runs out of capacity to execute queries.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Backend engineers know connection pools exist. Most frameworks configure one by default — SQLAlchemy, HikariCP, ActiveRecord, and similar libraries all ship with pool settings. The problem is that those library-level pools live inside a single application process. Scale to five app pods and you have five independent pools, each with their own ten connections: fifty total connections to the database. Scale to fifty pods and you have five hundred. Add a deployment rollout that starts new pods before draining old ones and the math gets worse fast.&lt;/p&gt;
&lt;p&gt;This matters because databases have hard limits. PostgreSQL’s &lt;code&gt;max_connections&lt;/code&gt; defaults to 100. MySQL’s defaults to 151. Those limits are not arbitrary — they map to real resource consumption per connection.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s connection model, documented in the &lt;a href=&quot;https://www.postgresql.org/docs/current/connect-estab.html&quot;&gt;PostgreSQL Server Programming documentation&lt;/a&gt;, forks a new backend process for each client connection. Each backend process carries its own memory space — typically 5–10 MB per connection depending on work_mem settings and query state. One hundred connections means one hundred processes. At five hundred connections you are consuming several gigabytes of RAM just in process overhead before a single row is read.&lt;/p&gt;
&lt;p&gt;MySQL uses a thread-per-connection model rather than processes, which reduces per-connection overhead, but the problem is structurally identical: threads consume stack space, file descriptors, and scheduler overhead. At high connection counts both systems degrade.&lt;/p&gt;
&lt;p&gt;The acute failure mode is a connection storm: an app deployment or autoscale event brings up many new pods simultaneously, each opening their full pool. The database hits &lt;code&gt;max_connections&lt;/code&gt;, new connection attempts queue or return errors, and the application starts logging “too many connections” at the moment it most needs to be available — during a traffic spike or recovery event. The database itself is not overloaded. It simply cannot accept new clients.&lt;/p&gt;
&lt;p&gt;What is the right way to decouple application instance count from database connection count?&lt;/p&gt;
&lt;h2 id=&quot;how-connection-poolers-work&quot;&gt;How Connection Poolers Work&lt;/h2&gt;
&lt;p&gt;A connection pooler sits between application processes and the database. Applications connect to the pooler, which maintains a fixed, smaller set of long-lived connections to the actual database. The application sees a normal database endpoint; the database sees a bounded number of backend processes regardless of how many application pods are running.&lt;/p&gt;
&lt;p&gt;The two dominant tools are PgBouncer for PostgreSQL and ProxySQL for MySQL.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PgBouncer&lt;/strong&gt; operates in three modes, documented in the &lt;a href=&quot;https://www.pgbouncer.org/config.html&quot;&gt;PgBouncer documentation&lt;/a&gt;:&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Mode&lt;/th&gt;&lt;th&gt;How it works&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Session mode&lt;/td&gt;&lt;td&gt;One server connection per client session; held for the life of the client connection&lt;/td&gt;&lt;td&gt;Minimal breakage; connection count reduction only happens if clients disconnect promptly&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Transaction mode&lt;/td&gt;&lt;td&gt;Server connection returned to pool after each transaction completes&lt;/td&gt;&lt;td&gt;LISTEN/NOTIFY, advisory locks, prepared statements, and SET LOCAL state do not survive across transactions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Statement mode&lt;/td&gt;&lt;td&gt;Server connection returned after each statement&lt;/td&gt;&lt;td&gt;Breaks transactions; use only for simple read-only workloads&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Transaction mode delivers the most aggressive multiplexing — a pooler with 20 server-side connections can service hundreds of application clients that are between transactions — but it breaks any feature that assumes state persists across transactions. PostgreSQL’s &lt;code&gt;LISTEN/NOTIFY&lt;/code&gt; mechanism relies on a persistent server connection; in transaction mode the pooler may reassign that connection to another client between events. Advisory locks held at session scope are lost the moment the transaction commits. Applications using &lt;code&gt;SET LOCAL&lt;/code&gt; to configure session parameters will find those settings gone after each transaction boundary.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ProxySQL&lt;/strong&gt; applies the same multiplexing principle to MySQL, with additional query routing capabilities (read-write splitting, rule-based routing) that make it common in MySQL environments with replicas. Its connection pool size is configured independently of the application-side connection settings.&lt;/p&gt;
&lt;p&gt;The practical deployment pattern is to configure application connection pools small (3–5 connections per pod) so the pooler remains the single point of configuration, and set the pooler’s server-side pool to a number the database can sustain — typically 20–50% of &lt;code&gt;max_connections&lt;/code&gt;, leaving headroom for administrative connections and monitoring.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The PostgreSQL project documents the process-per-connection model explicitly, and the &lt;a href=&quot;https://www.pgbouncer.org/faq.html&quot;&gt;PgBouncer FAQ&lt;/a&gt; describes the transaction mode tradeoffs in detail, noting that applications must be verified compatible before enabling it.&lt;/p&gt;
&lt;p&gt;The Heroku Postgres team published guidance on PgBouncer in transaction mode specifically because Heroku’s platform runs many small dynos each with their own application process — exactly the multi-pod scaling problem described above. Their tooling, &lt;a href=&quot;https://github.com/heroku/heroku-buildpack-pgbouncer&quot;&gt;pgbouncer-heroku&lt;/a&gt;, emerged from the documented operational reality that a modest Heroku app on ten dynos could exhaust a standard PostgreSQL &lt;code&gt;max_connections&lt;/code&gt; without any pooler in place.&lt;/p&gt;
&lt;p&gt;The documented pattern from the PgBouncer project itself is: use session mode as a starting point when application compatibility is uncertain, verify that no LISTEN/NOTIFY or advisory lock usage exists, then migrate to transaction mode for maximum multiplexing.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Transaction mode with LISTEN/NOTIFY&lt;/td&gt;&lt;td&gt;Notifications are never received or delivered to the wrong client&lt;/td&gt;&lt;td&gt;The pooler reassigns server connections between events; the persistent channel the listener expects does not exist&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Pool exhaustion under bursts&lt;/td&gt;&lt;td&gt;New client connections are queued or rejected by the pooler itself&lt;/td&gt;&lt;td&gt;The pooler’s server-side pool is also bounded; if all server connections are busy, clients wait or time out&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Health check connections consuming pool slots&lt;/td&gt;&lt;td&gt;Liveness probes open a connection and close it repeatedly, consuming pool capacity&lt;/td&gt;&lt;td&gt;Health checks should connect to the pooler’s stats port or use a single persistent probe connection rather than opening fresh database connections&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Without a standalone pooler, application pod count directly drives database connection count — a deployment event can exhaust &lt;code&gt;max_connections&lt;/code&gt; before the database processes a single query.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Deploy PgBouncer (PostgreSQL) or ProxySQL (MySQL) as a sidecar or dedicated service; configure application pools to 3–5 connections per pod; set the pooler’s server pool to a fraction of &lt;code&gt;max_connections&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After deploying the pooler, run &lt;code&gt;SELECT count(*) FROM pg_stat_activity&lt;/code&gt; during a load test — the number should stay flat as application replicas scale, rather than increasing proportionally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, check your current connection count and compare it to your &lt;code&gt;max_connections&lt;/code&gt; setting; if you are above 60% of the limit without a pooler, that is the gap to close first:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Connection count by state&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;), &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;state&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Show the configured limit&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW max_connections;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;</content:encoded><category>databases</category><category>failures</category></item><item><title>Schema Deployment Risk Checklist</title><link>https://rajivonai.com/blog/2023-06-26-schema-deployment-risk-checklist/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-06-26-schema-deployment-risk-checklist/</guid><description>Assessing lock type, table size, reversibility, and rollback plan before every schema migration — a structured checklist for zero-downtime deployments.</description><pubDate>Mon, 26 Jun 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The most dangerous moment in a schema deployment is not the migration itself — it is the 30 seconds before you run it when you think you understand the lock behavior but haven’t confirmed it.&lt;/strong&gt; &lt;code&gt;ALTER TABLE ADD COLUMN&lt;/code&gt; on a 2 GB table is instantaneous on PostgreSQL 11 and later. The same statement on PostgreSQL 10 can hold an ACCESS EXCLUSIVE lock for minutes. &lt;code&gt;CREATE INDEX&lt;/code&gt; without &lt;code&gt;CONCURRENTLY&lt;/code&gt; will block all writes on the table for the duration of the build. Understanding which statement takes which lock, and what the options are to avoid it, is table stakes for schema work on production databases.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Schema migrations in a running production system have three risk dimensions: lock duration, reversibility, and execution time. These are independent axes. A migration can be fast but irreversible (dropping a column). It can be slow but non-blocking (&lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt;). It can be fast, reversible, and still dangerous because the lock type is wrong for the traffic pattern.&lt;/p&gt;
&lt;p&gt;Most teams have learned about &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt;. Fewer have mapped out the full lock table for &lt;code&gt;ALTER TABLE&lt;/code&gt; variants. The failure pattern is predictable: an engineer runs &lt;code&gt;ALTER TABLE orders ADD COLUMN tax_id VARCHAR(32) NOT NULL DEFAULT &apos;&apos;&lt;/code&gt; on a table with 500 million rows, assumes it is fast because they have done it before on small tables, and discovers it is holding an ACCESS EXCLUSIVE lock while taking 12 minutes to backfill the default.&lt;/p&gt;
&lt;p&gt;This checklist forces the assessment before the migration runs, not after it starts.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;When schema migrations fail, they usually do not corrupt data — they corrupt availability. A migration that holds an &lt;code&gt;ACCESS EXCLUSIVE&lt;/code&gt; lock on a heavily trafficked table causes all incoming queries to queue. Once the connection pool saturates, the application begins dropping requests, triggering an escalating cascade of timeouts.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Application connection queuing after migration started&lt;/td&gt;&lt;td&gt;APM or &lt;code&gt;pg_stat_activity&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Migration holding ACCESS EXCLUSIVE lock — connections waiting&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Migration running longer than expected&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt; with &lt;code&gt;state = &apos;active&apos;&lt;/code&gt; and old &lt;code&gt;xact_start&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Table size or data backfill underestimated on staging&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replication lag spiking during migration&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_replication&lt;/code&gt; — &lt;code&gt;replay_lag&lt;/code&gt; growing&lt;/td&gt;&lt;td&gt;Migration WAL volume causing replication to fall behind&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Migration script fails with lock timeout&lt;/td&gt;&lt;td&gt;Application or migration tool error log&lt;/td&gt;&lt;td&gt;Lock acquisition timed out — another transaction holding the table&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Rollback script unavailable&lt;/td&gt;&lt;td&gt;Migration tool history&lt;/td&gt;&lt;td&gt;Migration was run without a matching down migration&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The traditional approach of “test it on staging” provides a false sense of security. A deployment that runs in two seconds on a 100 MB staging table can stall for twenty minutes on a 500 GB production table. Furthermore, if a migration blocks mid-execution due to lock contention or disk space limits, the lack of an immediate, tested rollback plan forces engineers to invent recovery strategies during an active incident.&lt;/p&gt;
&lt;p&gt;How can a team systematically verify the lock behavior, execution duration, and reversibility of a schema migration before it ever touches production?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;The solution is a structured evaluation that categorizes migrations by lock type, table size, and rollback complexity before execution.&lt;/p&gt;
&lt;h3 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h3&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Schema migration planned] --&gt; B{Requires ACCESS EXCLUSIVE lock?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no — CONCURRENTLY or ANALYZE| C[Safe to run anytime — proceed]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| D{Table size greater than 1 GB?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes| E{Online alternative available?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|yes| F[Use online alternative — see options below]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|no| G[Schedule maintenance window]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no — small table| H{Traffic pattern allows short lock?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|yes| I[Run during low-traffic window]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|no| J[Use online alternative or maintenance window]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; K{NOT NULL without default?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|yes| L[3-step split — nullable then backfill then constraint]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|no| M[ADD COLUMN with DEFAULT on PG11 or later — instant]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What this diagram shows:&lt;/strong&gt; A migration risk decision tree. The first branch identifies whether the operation requires ACCESS EXCLUSIVE lock. If so, table size determines whether an online alternative exists. The final branch handles NOT NULL without a default — which requires the three-step pattern: add as nullable, backfill, then add the constraint.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Does the migration require ACCESS EXCLUSIVE lock?&lt;/strong&gt; — the most important question to answer first:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check the lock type for common DDL operations:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- ACCESS EXCLUSIVE (blocks reads AND writes):&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;--   ALTER TABLE (most variants)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;--   DROP TABLE, TRUNCATE, DROP INDEX&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;--   VACUUM FULL, CLUSTER&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;--&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- SHARE UPDATE EXCLUSIVE (allows reads and writes):&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;--   CREATE INDEX CONCURRENTLY&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;--   VACUUM, ANALYZE, CREATE STATISTICS&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;--&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- SHARE (allows reads, blocks writes):&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;--   CREATE INDEX (without CONCURRENTLY)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- To confirm lock behavior during a migration, check what is waiting:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pid, relation::regclass, mode, granted&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_locks&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; granted&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pid;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If your migration uses &lt;code&gt;ALTER TABLE&lt;/code&gt; on a large table, it will take ACCESS EXCLUSIVE. Period. Understand this before starting.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;What is the table size?&lt;/strong&gt; — execution time scales with table size for any migration that rewrites rows:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_size_pretty(pg_total_relation_size(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;::regclass)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; total_size,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_size_pretty(pg_relation_size(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;::regclass)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; heap_size,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_size_pretty(pg_indexes_size(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;::regclass)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; index_size,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  reltuples::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;bigint&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; estimated_rows&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_class&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; relname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For any migration that rewrites the heap (ADD COLUMN with default on PG10, changing column types, ADD CONSTRAINT), the lock duration is proportional to table size. A migration that runs in 3 seconds on a 100 MB staging table will run for 18 minutes on a 36 GB production table.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Is the migration reversible?&lt;/strong&gt; — classify before running:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check existing column definitions before adding or dropping&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  column_name,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  data_type,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  is_nullable,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  column_default&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;columns&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; table_name &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ordinal_position;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reversibility classification:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ADD COLUMN nullable&lt;/code&gt; — reversible: &lt;code&gt;DROP COLUMN&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ADD COLUMN NOT NULL DEFAULT value&lt;/code&gt; — reversible on PG11 and later: &lt;code&gt;DROP COLUMN&lt;/code&gt; (PG11+ stores the default in catalog, no rewrite)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DROP COLUMN&lt;/code&gt; — irreversible: data is gone after vacuum runs&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ALTER COLUMN TYPE&lt;/code&gt; — reversible in principle, but requires another full rewrite; plan carefully&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CREATE INDEX&lt;/code&gt; — fully reversible: &lt;code&gt;DROP INDEX&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ADD CONSTRAINT CHECK&lt;/code&gt; — reversible: &lt;code&gt;DROP CONSTRAINT&lt;/code&gt;, but adds a lock; use &lt;code&gt;NOT VALID&lt;/code&gt; + &lt;code&gt;VALIDATE CONSTRAINT&lt;/code&gt; split&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Test the migration on a production-sized staging database&lt;/strong&gt; — estimate true execution time:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Time the migration on a copy of production data&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;psql&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -d&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; staging_prod_copy&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -c&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;\timing&quot;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -c&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;ALTER TABLE orders ADD COLUMN archived_at timestamptz;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# For longer migrations, use EXPLAIN to see what the operation will do before committing&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;BEGIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; orders&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; ADD&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; COLUMN&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; archived_at&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; timestamptz&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;--&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; Check&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pg_locks&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; here&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; to&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; observe&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; lock&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; behavior&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;ROLLBACK&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;  &lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;--&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; abort&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; to&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; avoid&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; actual&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; change&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Timing on staging with a production-sized dataset is the only reliable estimate. Factor-of-10 size differences between staging and production are common and explain most migration surprises.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Is the migration idempotent?&lt;/strong&gt; — essential for safe retries:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Idempotent column addition&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ADD&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IF&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXISTS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; archived_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;timestamptz&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Idempotent index creation&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; CONCURRENTLY&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; IF&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXISTS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_orders_archived_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders (archived_at);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Idempotent constraint addition&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;DO $$&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BEGIN&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  IF&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXISTS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_constraint&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; conname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;chk_orders_status&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; conrelid &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;::regclass&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  ) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;THEN&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ADD&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; CONSTRAINT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; chk_orders_status&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    CHECK&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; IN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;pending&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;processing&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;shipped&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;cancelled&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;))&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    NOT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; VALID;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  END&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; IF&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;END&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $$;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A migration that fails midway and cannot be safely retried creates recovery debt. &lt;code&gt;IF NOT EXISTS&lt;/code&gt; guards on &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; and &lt;code&gt;ADD COLUMN&lt;/code&gt; are the standard pattern.&lt;/p&gt;
&lt;h3 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Lock-safe online alternatives&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For the most common migration types, online alternatives avoid ACCESS EXCLUSIVE:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- ADD INDEX: always use CONCURRENTLY on production&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; CONCURRENTLY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_orders_customer_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders (customer_id);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- ADD COLUMN with default (PostgreSQL 11 and later): instant, no table rewrite&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- PostgreSQL 11 and later stores the default in pg_attrdef, not in the heap&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ADD&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN archived_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;timestamptz&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DEFAULT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NULL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- ADD NOT NULL constraint without default: 3-step split&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Step 1: Add column as nullable&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ADD&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IF&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXISTS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tax_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VARCHAR&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;32&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Step 2: Backfill in batches (do NOT do this in a single UPDATE)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;DO $$&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DECLARE&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  batch_size &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; :&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  offset_val &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; :&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  rows_updated &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BEGIN&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  LOOP&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    UPDATE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tax_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;      SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;      WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tax_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NULL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;      LIMIT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; batch_size&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    );&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    GET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; DIAGNOSTICS rows_updated &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ROW_COUNT;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    EXIT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHEN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; rows_updated &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    PERFORM pg_sleep(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);  &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- brief pause between batches&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  END&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LOOP&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;END&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $$;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Step 3: Add NOT NULL constraint (fast — validates only in PG12 and later)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN tax_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NOT NULL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- PG12 and later: uses a not-null marker in pg_attribute, not a CHECK constraint scan&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Table rewrite with &lt;code&gt;pg_repack&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For bloated tables needing a full rewrite (e.g., removing a column after many deletes), &lt;code&gt;pg_repack&lt;/code&gt; performs online table rebuilding without extended ACCESS EXCLUSIVE:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Install pg_repack extension&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; EXTENSION&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pg_repack&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Run repack online — rebuilds table without long lock&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_repack&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -h&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; localhost&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -U&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; postgres&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -d&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; mydb&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -t&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# With specific columns (version 1.4.7 and later)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_repack&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -h&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; localhost&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -U&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; postgres&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -d&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; mydb&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --table&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;pg_repack&lt;/code&gt; works by building a new table copy online, capturing changes via a trigger, then performing a fast swap at the end. The final swap takes a brief ACCESS EXCLUSIVE lock (usually under a second). Per the &lt;code&gt;pg_repack&lt;/code&gt; documentation, it requires the table to have a primary key or a unique constraint.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Scheduled maintenance window with monitoring&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When no online alternative exists — changing a column type, adding a foreign key that requires a full scan, or truncating a large table — execute during a maintenance window with active monitoring:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Set a lock timeout to abort if the migration waits too long for a lock&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; lock_timeout&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;5s&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Set a statement timeout as a safety net&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; statement_timeout &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;10min&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Run migration&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN amount &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TYPE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NUMERIC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;12&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;4&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Monitor from a second session during execution&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pid, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, query, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xact_start &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; duration&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; !=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; duration &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A &lt;code&gt;lock_timeout&lt;/code&gt; prevents the migration from queuing indefinitely behind a long-running transaction. If the migration cannot acquire its lock in 5 seconds, it aborts cleanly, allowing you to investigate what is holding the lock before retrying.&lt;/p&gt;
&lt;h3 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h3&gt;
&lt;p&gt;For every migration, have the rollback command written before running the forward migration:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Forward: add column&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ADD&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN archived_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;timestamptz&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Rollback: drop column&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IF&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXISTS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; archived_at;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Forward: create index&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; CONCURRENTLY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_orders_status &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Rollback: drop index&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; CONCURRENTLY &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IF&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXISTS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_orders_status;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Forward: add constraint (using NOT VALID to avoid full scan)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ADD&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; CONSTRAINT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; chk_orders_positive_amount&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  CHECK&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (amount &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;NOT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; VALID;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Validate separately (allows reads and writes during validation)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders VALIDATE &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CONSTRAINT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; chk_orders_positive_amount;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Rollback: drop constraint&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DROP&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; CONSTRAINT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; IF&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; EXISTS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; chk_orders_positive_amount;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For migrations that are irreversible at the data level (DROP COLUMN, TRUNCATE), the rollback plan is: restore from backup. This should be documented explicitly in the migration, and the backup should be confirmed current before running.&lt;/p&gt;
&lt;h3 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h3&gt;
&lt;p&gt;A pre-migration risk assessment script that runs before any &lt;code&gt;ALTER TABLE&lt;/code&gt; in your CI pipeline catches most issues automatically:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;#!/bin/bash&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Check if a migration will require a table rewrite on a large table&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;TABLE_SIZE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;psql&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -tAc&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;SELECT pg_relation_size(&apos;${&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;TABLE_NAME&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;}&apos;::regclass)&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;TABLE_SIZE_GB&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;scale=2; ${&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;TABLE_SIZE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;}/1073741824&quot;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; bc&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;if&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (( $(echo &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$TABLE_SIZE_GB&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &gt; 1&quot;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; bc &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;l) )); &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;then&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;WARNING: Table ${&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;TABLE_NAME&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;} is ${&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;TABLE_SIZE_GB&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;}GB&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;Verify migration is CONCURRENTLY-safe or schedule maintenance window&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  exit&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;fi&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For teams using schema migration tools (Flyway, Liquibase, golang-migrate), pre-migration hooks that run the size check and lock-type classification against the target SQL are the standard pattern.&lt;/p&gt;
&lt;h3 id=&quot;schema-deployment-checklist&quot;&gt;Schema Deployment Checklist&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Identify the SQL statement and its lock type — ACCESS EXCLUSIVE, SHARE, or SHARE UPDATE EXCLUSIVE&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_total_relation_size&lt;/code&gt; for the target table — flag if greater than 1 GB&lt;/li&gt;
&lt;li&gt;Determine if the migration is reversible — write the rollback SQL before running the forward migration&lt;/li&gt;
&lt;li&gt;Test execution time on a production-sized staging database with &lt;code&gt;\timing&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Confirm the migration is idempotent — add &lt;code&gt;IF NOT EXISTS&lt;/code&gt; and &lt;code&gt;IF EXISTS&lt;/code&gt; guards where applicable&lt;/li&gt;
&lt;li&gt;Determine if an online alternative exists — &lt;code&gt;CONCURRENTLY&lt;/code&gt; index, PG11+ ADD COLUMN, 3-step NOT NULL&lt;/li&gt;
&lt;li&gt;For ACCESS EXCLUSIVE on large tables — schedule a maintenance window or use the online alternative&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;lock_timeout = &apos;5s&apos;&lt;/code&gt; and &lt;code&gt;statement_timeout&lt;/code&gt; before running any blocking migration&lt;/li&gt;
&lt;li&gt;Confirm a current backup exists before running any irreversible migration (DROP COLUMN, TRUNCATE)&lt;/li&gt;
&lt;li&gt;Monitor &lt;code&gt;pg_stat_activity&lt;/code&gt; for lock contention during the migration window from a second session&lt;/li&gt;
&lt;li&gt;Verify replication lag does not spike during migration — check &lt;code&gt;pg_stat_replication&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;After migration completes, run &lt;code&gt;EXPLAIN (ANALYZE)&lt;/code&gt; on the primary affected queries to confirm plan is correct&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The PostgreSQL documentation for &lt;code&gt;ADD COLUMN&lt;/code&gt; explicitly describes the behavioral change in PostgreSQL 11: prior to version 11, &lt;code&gt;ADD COLUMN&lt;/code&gt; with a &lt;code&gt;DEFAULT&lt;/code&gt; clause required a full table rewrite to store the default in every existing row. PostgreSQL 11 introduced storage of the default in &lt;code&gt;pg_attrdef&lt;/code&gt;, allowing &lt;code&gt;ADD COLUMN ... DEFAULT&lt;/code&gt; to complete in milliseconds regardless of table size — the default is applied on read for existing rows, not during the migration. This behavior is documented in the PostgreSQL 11 release notes.&lt;/p&gt;
&lt;p&gt;The documentation for &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; documents its two-pass scan approach: it makes two passes over the table — one to build the initial index, one to incorporate concurrent changes — before marking the index valid. This means it takes longer than non-concurrent index creation, but it never holds an ACCESS EXCLUSIVE lock. The tradeoff is explicit in the documentation: “the table is not locked against writes for an extended period of time, but the build takes longer.”&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; leaves invalid index&lt;/td&gt;&lt;td&gt;Transaction conflict or cancellation during build&lt;/td&gt;&lt;td&gt;Drop the invalid index; recreate with CONCURRENTLY&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;NOT VALID&lt;/code&gt; constraint skips existing data violations&lt;/td&gt;&lt;td&gt;Backfill was incomplete before constraint was added&lt;/td&gt;&lt;td&gt;Run &lt;code&gt;VALIDATE CONSTRAINT&lt;/code&gt; to enforce on all rows; fix violations first&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3-step NOT NULL breaks if backfill is skipped&lt;/td&gt;&lt;td&gt;Developer runs step 1 and step 3 without step 2&lt;/td&gt;&lt;td&gt;Enforce step ordering in migration tooling; use explicit progress markers&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;lock_timeout&lt;/code&gt; causes migration abort&lt;/td&gt;&lt;td&gt;Another long transaction holds an incompatible lock&lt;/td&gt;&lt;td&gt;Identify and wait for blocking transaction; retry migration with longer timeout&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;pg_repack&lt;/code&gt; fails on table with no primary key&lt;/td&gt;&lt;td&gt;Table uses composite key or has no unique identifier&lt;/td&gt;&lt;td&gt;Add a surrogate primary key first, or use a maintenance window rewrite&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-this-post-does-not-cover&quot;&gt;What This Post Does Not Cover&lt;/h2&gt;
&lt;p&gt;This checklist covers schema migration risk for PostgreSQL and MySQL. It does not cover: migration tooling comparisons (Flyway vs Liquibase vs sqitch), zero-downtime application deployment patterns when schema and code changes must roll out together, MongoDB schema validation evolution, or database-level encryption key rotation during schema changes. Each of those is a separate decision area.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Schema migrations that appear safe on small staging tables can hold ACCESS EXCLUSIVE locks for minutes on large production tables, queuing and dropping connections until they complete or are killed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Classify every migration by lock type and table size before running it; use &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; and the 3-step NOT NULL split for large tables; and always have the rollback command written before the forward migration runs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After implementing &lt;code&gt;CONCURRENTLY&lt;/code&gt; and deferred NOT NULL patterns, migration deployments should complete with zero connection queuing — observable in &lt;code&gt;pg_stat_activity&lt;/code&gt; showing no waiting state during the migration window.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Pick one upcoming schema migration and run through this checklist before executing it. If it requires ACCESS EXCLUSIVE on a table over 1 GB, find the online alternative or schedule the maintenance window before the deployment date.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category><category>architecture</category></item><item><title>Cloud Database Cost Triage: Storage, IOPS, CPU, Replicas</title><link>https://rajivonai.com/blog/2023-06-05-cloud-database-cost-triage-storage-iops-cpu-replicas/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-06-05-cloud-database-cost-triage-storage-iops-cpu-replicas/</guid><description>A structured runbook for identifying which cost dimension is driving your AWS RDS or Aurora bill before making any changes.</description><pubDate>Mon, 05 Jun 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The RDS bill is higher than expected and the instinct to scale up the instance or add a replica is almost always the wrong first move.&lt;/strong&gt; Cost spikes in cloud databases have four distinct drivers — storage, IOPS, instance class, and replicas — and each requires a different remediation. Acting on the wrong one wastes money and may make the problem worse. The right move is triage first.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;AWS RDS and Aurora bill on four independent cost dimensions: storage consumed, I/O operations performed, the instance class running the engine, and the number of instances attached to the cluster. When a monthly bill grows faster than traffic, it is usually one of these dimensions accelerating — not all four simultaneously.&lt;/p&gt;
&lt;p&gt;The problem is that Cost Explorer shows total database spend, not cost per dimension. An engineer looking at a $4,000 line item for “Amazon RDS” cannot tell whether the driver is 2 TB of unclaimed storage, a gp2 volume depleting its burst I/O credits, an over-provisioned db.r6g.2xlarge sitting at 8% CPU, or three read replicas that no longer carry meaningful traffic.&lt;/p&gt;
&lt;p&gt;Each of those four scenarios has a different first command to run and a different remediation. Conflating them means you might rightsize the instance when the actual driver is 800 GB of dead tuples waiting on autovacuum.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Storage cost growing without traffic growth&lt;/td&gt;&lt;td&gt;AWS Cost Explorer, grouped by usage type&lt;/td&gt;&lt;td&gt;Table bloat, dead tuples, or log accumulation not being reclaimed&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;IOPS charges on a gp2 volume&lt;/td&gt;&lt;td&gt;CloudWatch &lt;code&gt;VolumeReadIOPS&lt;/code&gt; and &lt;code&gt;VolumeWriteIOPS&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Burst credit balance depleted; every I/O now billed at the gp2 overage rate&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High instance cost relative to CPU utilization&lt;/td&gt;&lt;td&gt;CloudWatch &lt;code&gt;CPUUtilization&lt;/code&gt; p95 over 30 days&lt;/td&gt;&lt;td&gt;Instance class is over-provisioned for the actual workload&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replica count grew over time&lt;/td&gt;&lt;td&gt;RDS console — DB instances view&lt;/td&gt;&lt;td&gt;Replicas added reactively without a retirement policy; each one bills at primary instance rates&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Snapshot retention set to maximum&lt;/td&gt;&lt;td&gt;RDS console — Maintenance and backups&lt;/td&gt;&lt;td&gt;Snapshots older than policy requires accumulate silently at $0.095 per GB-month&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Database and table sizes&lt;/strong&gt; — connect to the PostgreSQL instance and run both queries. The first gives total database size; the second surfaces the top bloat candidates by table.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Total database size&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_size_pretty(pg_database_size(current_database()));&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Top 10 tables by total size (including indexes and toast)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  schemaname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  tablename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_size_pretty(pg_total_relation_size(schemaname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;||&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;.&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ||&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; total_size,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pg_size_pretty(pg_relation_size(schemaname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;||&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;.&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ||&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename))       &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; table_size&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_total_relation_size(schemaname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;||&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;.&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ||&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If a table’s total size is significantly larger than its live row count implies, dead tuples are accumulating. Cross-reference with &lt;code&gt;pg_stat_user_tables.n_dead_tup&lt;/code&gt;.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Write amplification signal from the background writer&lt;/strong&gt; — PostgreSQL’s &lt;code&gt;pg_stat_bgwriter&lt;/code&gt; tracks how much I/O the background writer and checkpointer are generating. High &lt;code&gt;buffers_checkpoint&lt;/code&gt; relative to &lt;code&gt;buffers_clean&lt;/code&gt; or &lt;code&gt;buffers_backend&lt;/code&gt; indicates that checkpointing is driving write I/O, not the application directly.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  checkpoints_timed,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  checkpoints_req,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  buffers_checkpoint,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  buffers_clean,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  buffers_backend,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  maxwritten_clean&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_bgwriter;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;AWS documents that RDS gp2 volumes use a credit-based burst model. As documented in the AWS RDS storage documentation, a gp2 volume earns 3 IOPS per GB per second and can burst to 3,000 IOPS until the credit bucket empties. Once depleted, throughput drops to the baseline rate and every operation above baseline is billed at the provisioned IOPS rate. &lt;code&gt;buffers_checkpoint&lt;/code&gt; growing while CloudWatch &lt;code&gt;BurstBalance&lt;/code&gt; drops toward zero is the signature of this problem.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;IOPS consumption in CloudWatch&lt;/strong&gt; — pull &lt;code&gt;VolumeReadIOPS&lt;/code&gt; and &lt;code&gt;VolumeWriteIOPS&lt;/code&gt; for the last 30 days with a 1-hour resolution. If the volume is gp2 and you see sustained IOPS above 3,000, the burst balance is gone and you are in the expensive steady state.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;aws&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; cloudwatch&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; get-metric-statistics&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --namespace&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; AWS/RDS&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --metric-name&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; WriteIOPS&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --dimensions&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; Name=DBInstanceIdentifier,Value=YOUR_DB_ID&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --start-time&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -u&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -v-30d&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%Y-%m-%dT%H:%M:%SZ&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;\&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --end-time&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -u&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%Y-%m-%dT%H:%M:%SZ&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;\&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --period&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 3600&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --statistics&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; Average&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --output&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; table&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;CPU utilization p95 over 30 days&lt;/strong&gt; — pull &lt;code&gt;CPUUtilization&lt;/code&gt; statistics. AWS Compute Optimizer evaluates RDS instances and flags over-provisioned instances when p99 CPU stays below 40% over a 14-day observation window. If p95 CPU is consistently below 40%, the instance is a rightsizing candidate.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;aws&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; cloudwatch&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; get-metric-statistics&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --namespace&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; AWS/RDS&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --metric-name&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; CPUUtilization&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --dimensions&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; Name=DBInstanceIdentifier,Value=YOUR_DB_ID&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --start-time&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -u&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -v-30d&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%Y-%m-%dT%H:%M:%SZ&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;\&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --end-time&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -u&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%Y-%m-%dT%H:%M:%SZ&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;\&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --period&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 3600&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --statistics&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; p95&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --output&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; table&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Rightsizing down one instance class (e.g., db.r6g.2xlarge to db.r6g.xlarge) typically halves the instance-hour cost while maintaining the same network and storage performance characteristics.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Replica replication activity&lt;/strong&gt; — query &lt;code&gt;pg_stat_replication&lt;/code&gt; on the primary to see what each replica is actually doing. &lt;code&gt;sent_lsn&lt;/code&gt; minus &lt;code&gt;replay_lsn&lt;/code&gt; is the replication lag in bytes. If a replica’s &lt;code&gt;state&lt;/code&gt; is &lt;code&gt;streaming&lt;/code&gt; but it is rarely queried (verify via the replica’s own &lt;code&gt;pg_stat_activity&lt;/code&gt; or CloudWatch &lt;code&gt;DatabaseConnections&lt;/code&gt;), it is a cost-only presence.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  client_addr,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  application_name,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  sent_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  write_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  flush_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  replay_lsn,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  sync_state&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_replication;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the broader question of whether read replicas are delivering value relative to their cost, see &lt;a href=&quot;https://rajivonai.com/blog/2023-04-17-read-replicas-are-not-free-scale/&quot;&gt;Read Replicas Are Not Free Scale&lt;/a&gt; — which covers the replication lag model and the routing decisions that make replicas worth keeping.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Bill spike detected] --&gt; B{Storage cost growing?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| C{Table bloat above 20%?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|yes| D[Run VACUUM or pg_repack]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|no| E[Audit snapshot retention policy]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no| F{IOPS charges high?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|yes| G{gp2 burst balance depleted?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|yes| H[Migrate volume to gp3]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|no| I[Check pg_stat_bgwriter for write amplification]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|no| J{CPU p95 below 40%?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|yes| K[Rightsize instance class down]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|no| L{CPU p95 above 70%?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|yes| M[Optimize queries or scale up]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|no| N{Replica traffic justified?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    N --&gt;|no| O[Remove idle replicas]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    N --&gt;|yes| P[No cost action needed — monitor]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Reclaim storage from table bloat&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PostgreSQL’s MVCC model retains dead tuples until autovacuum or manual vacuum cleans them. On RDS, autovacuum runs automatically but can fall behind on high-write tables. Bloat inflates &lt;code&gt;pg_database_size&lt;/code&gt;, which directly inflates Aurora storage billing (Aurora charges per GB-month for all allocated storage, including dead tuple space).&lt;/p&gt;
&lt;p&gt;For tables where you can tolerate a brief lock, &lt;code&gt;VACUUM FULL&lt;/code&gt; rewrites the table and releases space to the OS. For live tables, &lt;code&gt;pg_repack&lt;/code&gt; performs the same operation online without a full table lock.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Identify bloat candidates&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  schemaname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  tablename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_dead_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_live_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  round&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_dead_tup::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;numeric&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; NULLIF&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_live_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;+&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; dead_pct,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_autovacuum,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_vacuum&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; dead_pct &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Reclaim space (causes brief AccessExclusiveLock)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;VACUUM FULL &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VERBOSE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; your_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;your_table&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Migrate gp2 to gp3 for explicit IOPS control&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;AWS documents the gp2 volume type as a burst model: baseline throughput is 3 IOPS/GB, maximum burst is 3,000 IOPS, and burst credits replenish at 3 credits per GB per second. Once the credit bucket empties, the volume returns to baseline and sustained writes above baseline are billed at the gp2 I/O pricing tier.&lt;/p&gt;
&lt;p&gt;gp3 eliminates the burst model. Storage and IOPS are provisioned independently: 3,000 IOPS and 125 MiB/s baseline are included at no additional cost, with additional IOPS purchasable at $0.02 per provisioned IOPS-month. For workloads that have depleted their gp2 burst balance, gp3 is typically lower cost at equivalent IOPS.&lt;/p&gt;
&lt;p&gt;The migration is online and reversible — RDS performs it as a storage modification with no downtime required.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;aws&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; rds&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; modify-db-instance&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --db-instance-identifier&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; YOUR_DB_ID&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --storage-type&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; gp3&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --iops&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 3000&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --apply-immediately&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Rightsize the instance class&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When CloudWatch &lt;code&gt;CPUUtilization&lt;/code&gt; p95 stays below 40% over a 30-day window, the instance class is over-provisioned. AWS Compute Optimizer surfaces RDS rightsizing recommendations automatically; the recommendations include projected savings and a confidence rating based on observed utilization.&lt;/p&gt;
&lt;p&gt;Rightsizing down one class within the same instance family (e.g., db.r6g.2xlarge to db.r6g.xlarge) retains the same memory-to-CPU ratio and network performance tier while halving instance-hour cost. Verify that the target instance class can accommodate peak connection count and memory requirements before applying.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Apply instance class change with minimal downtime (uses MultiAZ failover if enabled)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;aws&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; rds&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; modify-db-instance&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --db-instance-identifier&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; YOUR_DB_ID&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --db-instance-class&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; db.r6g.xlarge&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --apply-immediately&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Option 4 — Remove idle read replicas&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Each RDS or Aurora read replica is a full instance billed at the same rate as the primary. Replicas that carry negligible query traffic (verify via CloudWatch &lt;code&gt;DatabaseConnections&lt;/code&gt; on the replica endpoint) are pure cost with no throughput benefit.&lt;/p&gt;
&lt;p&gt;Removing a replica is a permanent action — there is no undo. If a replica might be needed for failover, promote it to a standalone instance first, then terminate the original replica relationship. If it is genuinely unused, delete it directly.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Delete a replica with no promotion needed&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;aws&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; rds&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; delete-db-instance&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --db-instance-identifier&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; YOUR_REPLICA_ID&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --skip-final-snapshot&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Storage VACUUM FULL&lt;/strong&gt; — not reversible in the traditional sense; the operation releases space. If the lock causes application errors, monitor &lt;code&gt;pg_stat_activity&lt;/code&gt; for blocking queries. Prefer &lt;code&gt;pg_repack&lt;/code&gt; on production tables to avoid the lock.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;gp2 to gp3 migration&lt;/strong&gt; — reversible. AWS allows reverting a gp3 volume back to gp2 via another storage modification. Monitor CloudWatch &lt;code&gt;WriteLatency&lt;/code&gt; and &lt;code&gt;ReadLatency&lt;/code&gt; after the change; if latency increases, revert.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Instance class rightsize&lt;/strong&gt; — reversible. Scale back up via &lt;code&gt;modify-db-instance&lt;/code&gt;. If using Multi-AZ, the downtime is a failover window (typically under 60 seconds). Monitor &lt;code&gt;DatabaseConnections&lt;/code&gt;, &lt;code&gt;FreeableMemory&lt;/code&gt;, and &lt;code&gt;CPUUtilization&lt;/code&gt; for 48 hours after the change.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replica removal&lt;/strong&gt; — not reversible. A deleted replica cannot be re-attached. Create a new replica from scratch if needed. Before deleting, capture the replica’s CloudWatch &lt;code&gt;DatabaseConnections&lt;/code&gt; over the last 30 days to confirm it was idle.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Cost anomaly detection in AWS Cost Explorer can alert when RDS spend deviates from a predicted baseline. Set a threshold of 10–15% above the trailing 30-day average for the database service line; this catches storage growth and IOPS spikes before the end-of-month invoice.&lt;/p&gt;
&lt;p&gt;AWS Compute Optimizer generates RDS rightsizing recommendations on a rolling basis. Export the recommendations weekly via the Compute Optimizer API and route flagged instances to a Slack channel or ticket queue for review. The documented API call is straightforward:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;aws&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; compute-optimizer&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; get-rds-database-recommendations&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --filters&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; name=Finding,values=Overprovisioned&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --output&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; json&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For replica auditing, a scheduled PostgreSQL query on the primary that writes &lt;code&gt;pg_stat_replication&lt;/code&gt; state and replica endpoint &lt;code&gt;DatabaseConnections&lt;/code&gt; to a monitoring table gives a weekly audit trail. Flag replicas where the rolling 7-day average connection count on the replica endpoint is below five; those are candidates for removal review.&lt;/p&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;What broke: The RDS billing line grew faster than traffic because one or more of four cost dimensions — storage bloat, IOPS burst depletion, over-provisioned instance class, or idle replicas — was not monitored against a policy.&lt;/li&gt;
&lt;li&gt;What was done: Each dimension was triaged in order using documented CloudWatch metrics and PostgreSQL system catalog queries; the offending dimension was identified and remediated with a reversible change.&lt;/li&gt;
&lt;li&gt;What prevents recurrence: Compute Optimizer rightsizing alerts, Cost Explorer anomaly detection, and a monthly replica audit ensure each dimension is reviewed before it compounds.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;checklist&quot;&gt;Checklist&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Pull AWS Cost Explorer grouped by RDS usage type to identify which billing dimension is growing.&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;SELECT pg_size_pretty(pg_database_size(current_database()))&lt;/code&gt; on each RDS instance to establish a storage baseline.&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_stat_user_tables&lt;/code&gt; for tables with dead tuple percentages above 20%; schedule &lt;code&gt;VACUUM FULL&lt;/code&gt; or &lt;code&gt;pg_repack&lt;/code&gt; for the top offenders.&lt;/li&gt;
&lt;li&gt;Check CloudWatch &lt;code&gt;BurstBalance&lt;/code&gt; on any gp2 volume; if it is below 50% and trending down, plan a gp3 migration.&lt;/li&gt;
&lt;li&gt;Pull 30-day &lt;code&gt;VolumeWriteIOPS&lt;/code&gt; with 1-hour resolution; compare to gp2 baseline rate for the volume size.&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_stat_bgwriter&lt;/code&gt; to detect write amplification from checkpoint pressure; tune &lt;code&gt;checkpoint_completion_target&lt;/code&gt; and &lt;code&gt;max_wal_size&lt;/code&gt; if &lt;code&gt;checkpoints_req&lt;/code&gt; is high.&lt;/li&gt;
&lt;li&gt;Pull 30-day &lt;code&gt;CPUUtilization&lt;/code&gt; p95; flag any instance where p95 is below 40% as an over-provisioning candidate.&lt;/li&gt;
&lt;li&gt;Review AWS Compute Optimizer recommendations for the RDS cluster; document each flagged instance and projected savings.&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_stat_replication&lt;/code&gt; on the primary and cross-reference replica endpoint &lt;code&gt;DatabaseConnections&lt;/code&gt; to identify replicas with no meaningful traffic.&lt;/li&gt;
&lt;li&gt;Remove or repurpose idle replicas after confirming they are not required for failover topology.&lt;/li&gt;
&lt;li&gt;Set snapshot retention to match the recovery point objective in the database’s SLA; remove retention beyond policy.&lt;/li&gt;
&lt;li&gt;Enable Cost Explorer anomaly detection for the RDS service line at a 10–15% deviation threshold.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: An RDS bill spike triggers the instinct to scale the instance or add replicas — changes that are expensive, slow to take effect, and often targeting the wrong cost dimension entirely.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Triage the four cost dimensions in order — storage bloat, IOPS burst depletion, over-provisioned instance class, idle replicas — using CloudWatch metrics and PostgreSQL system catalog queries before making any change.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A specific dimension is identified as the driver, a targeted remediation is applied, and the next month’s Cost Explorer line for that dimension is lower — without touching the dimensions that were not the cause.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, enable AWS Compute Optimizer for your RDS instances and set a Cost Explorer anomaly detection alert at 15% above your 30-day RDS baseline — both are free to configure and will surface the next cost spike before it compounds.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category><category>checklist</category></item><item><title>MySQL Binlog Format: Row vs Statement vs Mixed</title><link>https://rajivonai.com/blog/2023-05-29-mysql-binlog-format-row-statement-mixed/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-05-29-mysql-binlog-format-row-statement-mixed/</guid><description>Choosing the wrong MySQL binary log format silently breaks replication or bloats the binlog — this is the decision tree for picking the right one.</description><pubDate>Mon, 29 May 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;MySQL’s binary log records every change for replication and point-in-time recovery, but the format it uses to record those changes determines whether replicas stay consistent.&lt;/strong&gt; Three formats are available. One of them has a silent correctness problem that surfaces only when non-deterministic SQL runs on a replica, at which point the divergence is already committed to disk.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;The binary log (binlog) is the backbone of MySQL replication and PITR. Every write that commits on the primary is written to the binlog. Replicas consume the binlog and replay those writes locally. The format controls how each write is recorded: as the original SQL statement, as the actual row values that changed, or as a combination of both selected automatically.&lt;/p&gt;
&lt;p&gt;Engineers provisioning a new MySQL server or migrating from an older version frequently encounter the format question without a clear default rationale. MySQL 5.7 defaulted to STATEMENT. MySQL 8.0 changed the default to ROW. The reason for that change is the correctness problem in STATEMENT format, and understanding it clarifies why ROW is the right default for most production workloads.&lt;/p&gt;
&lt;p&gt;You can check the current format on any running server:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; @@binlog_format;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;STATEMENT format logs the SQL text that ran on the primary. When the replica applies the statement, it re-executes that SQL. For most deterministic DML this is fine. The problem appears with non-deterministic functions: &lt;code&gt;UUID()&lt;/code&gt;, &lt;code&gt;RAND()&lt;/code&gt;, &lt;code&gt;NOW()&lt;/code&gt;, &lt;code&gt;SYSDATE()&lt;/code&gt;, user-defined functions, and some stored procedure patterns.&lt;/p&gt;
&lt;p&gt;Consider this insert:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders (id, session_token, created_at)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VALUES&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;42&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, UUID(), &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;NOW&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;());&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On the primary, &lt;code&gt;UUID()&lt;/code&gt; generates a specific UUID and &lt;code&gt;NOW()&lt;/code&gt; captures the current timestamp. That statement is written to the binlog verbatim. On the replica, the statement re-executes — but &lt;code&gt;UUID()&lt;/code&gt; generates a different UUID and &lt;code&gt;NOW()&lt;/code&gt; captures a different time. The primary and replica now hold different data for the same row. The replica has not errored. It has silently diverged.&lt;/p&gt;
&lt;p&gt;The same problem appears with &lt;code&gt;RAND()&lt;/code&gt;, triggers that call non-deterministic functions, and stored procedures whose output depends on server state. MySQL logs a warning in STATEMENT mode when it detects a non-deterministic statement, but the warning is easy to miss in a busy log.&lt;/p&gt;
&lt;h2 id=&quot;how-the-three-formats-work&quot;&gt;How the Three Formats Work&lt;/h2&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Format&lt;/th&gt;&lt;th&gt;What is logged&lt;/th&gt;&lt;th&gt;Safe for non-deterministic SQL&lt;/th&gt;&lt;th&gt;Binlog size&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;STATEMENT&lt;/td&gt;&lt;td&gt;SQL text of the change&lt;/td&gt;&lt;td&gt;No&lt;/td&gt;&lt;td&gt;Small&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ROW&lt;/td&gt;&lt;td&gt;Before and after values for each row&lt;/td&gt;&lt;td&gt;Yes&lt;/td&gt;&lt;td&gt;Large for bulk operations&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MIXED&lt;/td&gt;&lt;td&gt;Automatically ROW when unsafe, STATEMENT otherwise&lt;/td&gt;&lt;td&gt;Yes&lt;/td&gt;&lt;td&gt;Moderate&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;ROW format&lt;/strong&gt; logs the actual column values that changed for every row. For a statement that updates 10,000 rows, ROW format writes 10,000 row images to the binlog. This is verbose. A bulk DELETE or UPDATE that touches millions of rows produces a proportionally large binlog event. Binlog disk usage and replication bandwidth both increase relative to STATEMENT.&lt;/p&gt;
&lt;p&gt;The tradeoff is correctness: ROW format replicas always apply the exact values the primary committed. There is no re-execution, no non-determinism, no divergence risk.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MIXED format&lt;/strong&gt; attempts to get the best of both: it uses STATEMENT by default and switches to ROW automatically when MySQL detects that the statement is unsafe for statement-based replication. The detection covers most known unsafe patterns, but coverage is not exhaustive — some stored procedure and trigger combinations can still produce unsafe MIXED-format behavior in edge cases.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MySQL 8.0 default:&lt;/strong&gt; ROW. The MySQL 8.0 Reference Manual documents this change explicitly, noting that ROW is safer for replication consistency and required for some features including multi-source replication and certain crash-safe replica configurations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Changing the format at runtime&lt;/strong&gt; (requires SUPER or BINLOG_ADMIN privilege):&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Session level&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SESSION&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; binlog_format &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;ROW&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Global level (takes effect for new connections)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; GLOBAL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; binlog_format &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;ROW&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For a permanent change, set it in the MySQL configuration file:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;ini&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;[mysqld]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;binlog_format&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = ROW&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that changing the global binlog format does not affect the current session’s format. Each session that was open before the change continues using the old format until reconnected.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The MySQL 8.0 Reference Manual, in the chapter “Binary Logging Formats,” explicitly documents the non-deterministic function risk in STATEMENT mode and lists the categories of unsafe statements. The change from STATEMENT to ROW as the MySQL 8.0 default is documented in the MySQL 8.0 release notes and the replication chapter of the manual.&lt;/p&gt;
&lt;p&gt;The binlog size growth with ROW format is documented behavior: the MySQL documentation notes that ROW format generates more log data for statements that modify many rows, particularly for bulk DELETE, UPDATE, and INSERT…SELECT operations. The practical implication is that teams migrating from STATEMENT to ROW should audit their batch operations and ensure binlog retention and disk capacity accounts for the larger volume.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;STATEMENT with non-deterministic functions&lt;/td&gt;&lt;td&gt;Replica silently diverges from primary&lt;/td&gt;&lt;td&gt;Different values for UUID, RAND, NOW on re-execution&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ROW format with bulk multi-row operations&lt;/td&gt;&lt;td&gt;Binlog grows very large; replication bandwidth spikes&lt;/td&gt;&lt;td&gt;One row image written per changed row&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MIXED with complex stored procedures or triggers&lt;/td&gt;&lt;td&gt;Unsafe pattern not detected; falls back to STATEMENT&lt;/td&gt;&lt;td&gt;MySQL’s unsafe-detection does not cover all trigger and procedure edge cases&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: STATEMENT format silently breaks replica consistency when any non-deterministic function appears in DML, and the divergence is committed before the error is visible.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Set &lt;code&gt;binlog_format = ROW&lt;/code&gt; in the MySQL configuration for all production servers; MySQL 8.0 defaults to this already.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Check &lt;code&gt;SELECT @@binlog_format&lt;/code&gt; on all replicas and the primary; run SHOW REPLICA STATUS and verify &lt;code&gt;Seconds_Behind_Source&lt;/code&gt; stays near zero after the format change.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, run &lt;code&gt;SELECT @@binlog_format&lt;/code&gt; on every MySQL instance in production. For any instance running STATEMENT or MIXED, review whether non-deterministic functions appear in the application’s DML patterns before the next major version upgrade.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;ROW format is not a performance optimization — it is a correctness requirement for any workload that uses non-deterministic SQL. The binlog size cost is real but manageable. Replica divergence is not.&lt;/p&gt;</content:encoded><category>databases</category></item><item><title>Database Backup Validation Workflow</title><link>https://rajivonai.com/blog/2023-05-15-database-backup-validation-workflow/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-05-15-database-backup-validation-workflow/</guid><description>A repeatable runbook for proving that your database backups are actually restorable — with exact commands, decision tree, and automation patterns.</description><pubDate>Mon, 15 May 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A backup that has never been restored is a hypothesis, not a safety net.&lt;/strong&gt; The job of a backup validation workflow is not to confirm that backup files exist — it is to prove that a recoverable database can be produced from them within your documented RTO, on demand, and on a schedule that keeps that proof fresh.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most teams reach a point where backup jobs are running nightly, retention windows are configured, and monitoring shows no failures. The backup checkbox is green. What is rarely true is that anyone has measured how long a restore actually takes, or whether the restored database is consistent enough to serve traffic.&lt;/p&gt;
&lt;p&gt;The gap between “backups are running” and “we can recover from backups” is where most recovery failures live. That gap expands silently: schema migrations add tables that the restore script does not verify, sequences drift out of sync, foreign key constraints that were dropped for a bulk load never get re-added, and PITR windows shrink as WAL archiving falls behind. None of these register as a backup failure. They register as a recovery failure — at 3am, under incident pressure, with customers waiting.&lt;/p&gt;
&lt;p&gt;This runbook operationalizes the difference. The goal is a weekly validation cycle that produces a measured RTO, a verified consistent restore, and documented PITR coverage — before you need any of them.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;No documented restore time&lt;/td&gt;&lt;td&gt;Runbook or incident playbook&lt;/td&gt;&lt;td&gt;RTO is aspirational, not measured&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Backup job shows “succeeded” but restore has never been tested&lt;/td&gt;&lt;td&gt;CI logs, backup tool dashboard&lt;/td&gt;&lt;td&gt;File integrity is confirmed; recoverability is not&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Backup files exist but manifest or catalog is unverified&lt;/td&gt;&lt;td&gt;pg_dump output, S3 bucket listing&lt;/td&gt;&lt;td&gt;Partial or corrupt dump may silently pass a file-size check&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Last restore test was more than 90 days ago&lt;/td&gt;&lt;td&gt;Backup validation log, calendar&lt;/td&gt;&lt;td&gt;Schema and data drift since last test may invalidate assumptions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;RTO and RPO are in the SLA doc but not measured&lt;/td&gt;&lt;td&gt;SLA document, incident retrospectives&lt;/td&gt;&lt;td&gt;Numbers were estimated at design time and never validated&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;pg_stat_archiver shows gaps or lag&lt;/td&gt;&lt;td&gt;PostgreSQL system view&lt;/td&gt;&lt;td&gt;WAL archive is falling behind; PITR window is narrowing&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify backup file integrity&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For a PostgreSQL logical dump, verify the catalog without performing a full restore:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_restore&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --list&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; backup.dump&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /dev/null&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; &amp;#x26;&amp;#x26; &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;catalog OK&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;--list&lt;/code&gt; flag reads the table of contents from a custom-format dump. If the dump is corrupt or truncated, this fails immediately. A clean exit with “catalog OK” confirms the file is structurally valid. It does not confirm data integrity — that requires a restore.&lt;/p&gt;
&lt;p&gt;For Aurora RDS snapshots, check snapshot status and progress via the CLI:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;aws&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; rds&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; describe-db-snapshots&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --db-instance-identifier&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; mydb&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --query&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;DBSnapshots[*].[DBSnapshotIdentifier,Status,PercentProgress]&apos;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --output&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; table&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Any snapshot not in &lt;code&gt;available&lt;/code&gt; status cannot be used for restore. The &lt;code&gt;PercentProgress&lt;/code&gt; field indicates whether an automated snapshot is still in progress.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check backup age and frequency&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For PostgreSQL with WAL archiving, query the archiver process state:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; archived_count,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       last_archived_wal,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       last_archived_time,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       failed_count,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       last_failed_wal,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       last_failed_time,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       stats_reset&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_archiver;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The documented behavior of &lt;code&gt;pg_stat_archiver&lt;/code&gt; (PostgreSQL documentation, §28.2) is that &lt;code&gt;last_archived_time&lt;/code&gt; reflects when the most recent WAL segment was successfully archived. A &lt;code&gt;failed_count&lt;/code&gt; greater than zero with a recent &lt;code&gt;last_failed_time&lt;/code&gt; means the archive pipeline is broken and your PITR window has stopped advancing. &lt;code&gt;archived_count&lt;/code&gt; resetting unexpectedly can indicate a statistics reset, not necessarily a problem — check &lt;code&gt;stats_reset&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For RDS, list recent snapshots with a date filter:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;aws&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; rds&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; describe-db-snapshots&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --db-instance-identifier&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; mydb&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --query&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;DBSnapshots[?SnapshotCreateTime&gt;=`2023-05-08`].[DBSnapshotIdentifier,SnapshotCreateTime,Status]&apos;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --output&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; table&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Time a restore to a test instance&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Record the start time, execute the restore, and record the end time. This is your measured RTO. Do not estimate — measure:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;RESTORE_START&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -u&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%Y-%m-%dT%H:%M:%SZ&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;Restore started: &lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$RESTORE_START&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# PostgreSQL logical restore to a test instance&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_restore&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --host=test-db.internal&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --port=5432&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --username=restore_user&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --dbname=restore_target&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --verbose&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  backup.dump&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;RESTORE_END&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -u&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%Y-%m-%dT%H:%M:%SZ&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;Restore completed: &lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$RESTORE_END&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For Aurora, restore from a snapshot using the AWS CLI:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;aws&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; rds&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; restore-db-instance-from-db-snapshot&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --db-instance-identifier&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; mydb-validation-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%Y%m%d&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;\&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --db-snapshot-identifier&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; mydb-snapshot-id&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --db-instance-class&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; db.t3.medium&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --no-multi-az&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --no-publicly-accessible&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Log start and end times. The elapsed wall-clock time is your real RTO for this backup type and database size.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify data consistency post-restore&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Row counts on critical tables catch gross data loss. Sequence values confirm identity columns are in sync. Foreign key constraints confirm referential integrity was preserved:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Row counts on high-value tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; schemaname, tablename, n_live_tup&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; schemaname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;public&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_live_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check current sequence values&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; sequence_name, last_value&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;sequences&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; sequence_schema &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;public&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify foreign key constraints are present&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; conname, contype, conrelid::regclass &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; table_name&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_constraint&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; contype &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;f&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The expected output is that row counts roughly match production (accounting for any lag), sequences are ahead of the maximum id values in their respective tables, and all foreign key constraints are present. A missing constraint row indicates the constraint was dropped and not re-added before the backup was taken.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Test point-in-time recovery&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For PostgreSQL, a PITR test restores to a target LSN or timestamp rather than the latest checkpoint. This verifies that WAL segments are intact and readable:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# In recovery.conf (Postgres 11 and earlier) or postgresql.conf (12+):&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# recovery_target_time = &apos;2023-05-14 22:00:00 UTC&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# restore_command = &apos;cp /mnt/wal_archive/%f %p&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# For Aurora, restore to a point in time one hour before present:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;aws&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; rds&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; restore-db-instance-to-point-in-time&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --source-db-instance-identifier&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; mydb&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --target-db-instance-identifier&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; mydb-pitr-validation-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%Y%m%d&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;\&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --restore-time&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -u&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -v-1H&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%Y-%m-%dT%H:%M:%SZ&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;\&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --db-instance-class&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; db.t3.medium&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --no-publicly-accessible&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The AWS Aurora PITR documentation specifies that the &lt;code&gt;--restore-time&lt;/code&gt; parameter accepts an ISO 8601 timestamp. The restored instance should come up in a consistent state at the target time. Verify by checking a table that had known writes in the hour before the target timestamp.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Backup exists in storage] --&gt; B{Integrity verified?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no| C[Re-run backup — check for errors]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| D{Restore timed in last 30 days?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no| E[Run restore drill — record start and end time]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; F{Measured RTO within SLA?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|no| G[Escalate — switch to physical backup or optimize]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt;|yes| H{Data consistency verified?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes| H&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|no| I[Investigate — row counts, constraints, sequences]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|yes| J{PITR tested in last 30 days?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|no| K[Run PITR drill — restore to timestamp minus 1 hour]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt; L{PITR restore succeeded?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|no| M[Check WAL archive — review pg_stat_archiver]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|yes| N[Mark validation complete — log date and RTO]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|yes| N&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Switch from logical to physical backup for faster RTO&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PostgreSQL &lt;code&gt;pg_dump&lt;/code&gt; produces a portable logical backup but restore time scales with database size and is limited by the single-threaded restore process for custom-format dumps (parallel restore with &lt;code&gt;-j&lt;/code&gt; helps but still requires full data transfer). For large databases where RTO is failing its SLA target, switching to a physical backup method — &lt;code&gt;pg_basebackup&lt;/code&gt; for self-managed PostgreSQL, or Aurora snapshots which use storage-level cloning — typically reduces restore time significantly because physical restores do not need to re-execute every INSERT.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Physical base backup for self-managed PostgreSQL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_basebackup&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --host=primary.internal&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --username=replication_user&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --pgdata=/var/lib/postgresql/base_backup&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --format=tar&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --gzip&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --progress&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --wal-method=stream&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use when: logical restore times consistently exceed RTO targets and the database is large enough that parallel restore does not close the gap.&lt;/p&gt;
&lt;p&gt;Risk: physical backups are not portable across major PostgreSQL versions and require the same OS page size as the source.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Automate weekly restore drill to an isolated test instance&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Manual restore drills get deferred. An automated weekly drill that spins up a test instance, runs consistency checks, logs the RTO, and terminates the instance provides continuous validation without engineer attention. The pattern works for both self-managed PostgreSQL (via cron + pg_restore + psql checks) and Aurora (via AWS Lambda + EventBridge + the RDS API).&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Shell skeleton for a self-managed weekly drill&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;#!/bin/bash&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;set&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -euo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pipefail&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;BACKUP_FILE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;/backups/latest.dump&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;TEST_HOST&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;test-restore.internal&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;LOG_FILE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;/var/log/backup_validation/$(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%Y%m%d).log&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;START&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_restore&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --host=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$TEST_HOST&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --dbname=restore_target&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$BACKUP_FILE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$LOG_FILE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; 2&gt;&amp;#x26;1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;END&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; +%s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ELAPSED&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$((&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;END&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; -&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; START&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;))&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;RTO measured: ${&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ELAPSED&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;}s&quot;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$LOG_FILE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;psql&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --host=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$TEST_HOST&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --dbname=restore_target&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  -c&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;SELECT count(*) FROM pg_stat_user_tables;&quot;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$LOG_FILE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;Validation complete: $(&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;date&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -u&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;)&quot;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$LOG_FILE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use when: restore drills are happening less than monthly, or the team wants evidence of RTO measurements for compliance purposes.&lt;/p&gt;
&lt;p&gt;Risk: the test instance must be isolated from production network paths to avoid accidental writes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Add catalog verification to CI/CD for schema migrations&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Schema migrations are the most common way a logical backup becomes silently unrestorable — a migration drops and re-creates a constraint, a sequence, or a table in a way that the backup catalog does not reflect. Adding &lt;code&gt;pg_restore --list&lt;/code&gt; verification as a post-migration CI check confirms that the dump catalog matches expected objects after every migration run.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# In CI pipeline, after migration:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_dump&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --format=custom&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --schema-only&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  --file=schema_backup.dump&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; \&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$DATABASE_URL&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_restore&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --list&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; schema_backup.dump&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; grep&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -E&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;TABLE|SEQUENCE|CONSTRAINT&quot;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; |&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; sort&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /tmp/current_objects.txt&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Diff against expected objects baseline&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;diff&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /tmp/expected_objects.txt&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /tmp/current_objects.txt&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use when: the team runs frequent migrations and wants early warning before a corrupt backup reaches the weekly restore drill.&lt;/p&gt;
&lt;p&gt;Risk: schema-only catalog verification does not catch data integrity issues — it only confirms structural completeness.&lt;/p&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;p&gt;The backup validation workflow is entirely read-only on production. All restore operations target isolated test instances. There is nothing to roll back from the validation process itself.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If Option 1 (physical backup) causes issues&lt;/strong&gt;: The original logical backup schedule is unchanged. Run both in parallel for one validation cycle before cutting over. Revert by disabling the &lt;code&gt;pg_basebackup&lt;/code&gt; cron job and monitoring the next scheduled logical backup.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If Option 2 (automated restore drill) causes unexpected resource usage&lt;/strong&gt;: The EventBridge or cron schedule can be disabled immediately. If a test instance was not terminated by the script, terminate it manually via &lt;code&gt;aws rds delete-db-instance --db-instance-identifier mydb-validation-YYYYMMDD --skip-final-snapshot&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If Option 3 (CI catalog check) produces false positives after a migration&lt;/strong&gt;: Regenerate the &lt;code&gt;expected_objects.txt&lt;/code&gt; baseline from the current schema and commit it. The diff will be clean on the next run.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;The most impactful automation for this runbook is a weekly restore drill that requires no engineer involvement. The AWS pattern for Aurora uses EventBridge to trigger a Lambda function once per week. The Lambda calls &lt;code&gt;restore-db-instance-from-db-snapshot&lt;/code&gt; using the most recent available snapshot, polls the instance status until it reaches &lt;code&gt;available&lt;/code&gt;, runs row count checks via the RDS Data API or a temporary Lambda-to-RDS connection, logs the elapsed time and results to CloudWatch Logs, then calls &lt;code&gt;delete-db-instance&lt;/code&gt; to terminate the test instance.&lt;/p&gt;
&lt;p&gt;For a 100 GB Aurora database, the AWS RDS pricing documentation indicates that snapshot restore charges apply at the storage rate for the duration the instance is running. A validation instance that runs for two hours per week at &lt;code&gt;db.t3.medium&lt;/code&gt; pricing (on-demand) costs approximately $0.34 per week at current us-east-1 rates — less than the cost of one engineer-hour spent on a manual drill. The actual cost depends on instance class, storage provisioned, and region.&lt;/p&gt;
&lt;p&gt;For self-managed PostgreSQL, a pg_cron job or a systemd timer can trigger the shell skeleton from Option 2. The key instrumentation addition is writing the elapsed RTO and row count results to a table in a monitoring database so that trend data is available — a restore time that grows month over month as the database grows is a signal to revisit backup type before it breaches SLA.&lt;/p&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;What broke&lt;/strong&gt;: Backup jobs were succeeding but restorability had never been tested, meaning the team’s documented RTO had no measured basis and recovery from a real incident would be slower and less certain than assumed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What was done&lt;/strong&gt;: A validation workflow was implemented that measures actual restore time, verifies data consistency post-restore, and tests point-in-time recovery on a documented schedule.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What prevents recurrence&lt;/strong&gt;: Automated weekly restore drills log measured RTO to a persistent store, and a CI catalog check flags schema migrations that would make a backup unrestorable before they reach production.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;checklist&quot;&gt;Checklist&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Verify backup file integrity using &lt;code&gt;pg_restore --list&lt;/code&gt; (PostgreSQL) or &lt;code&gt;aws rds describe-db-snapshots&lt;/code&gt; (Aurora) — confirm no errors before proceeding&lt;/li&gt;
&lt;li&gt;Check backup age: confirm the most recent backup is within the expected retention window and frequency&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_stat_archiver&lt;/code&gt; and confirm &lt;code&gt;failed_count&lt;/code&gt; is zero and &lt;code&gt;last_archived_time&lt;/code&gt; is recent&lt;/li&gt;
&lt;li&gt;Run a timed restore to an isolated test instance and record wall-clock start and end times as the measured RTO&lt;/li&gt;
&lt;li&gt;Compare measured RTO against documented SLA target — escalate if over threshold&lt;/li&gt;
&lt;li&gt;Run row counts on the top 20 tables by size on the restored instance and compare to production baseline&lt;/li&gt;
&lt;li&gt;Verify sequence values are ahead of their respective table maximum id values&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;pg_constraint&lt;/code&gt; on the restored instance and confirm all expected foreign key constraints are present&lt;/li&gt;
&lt;li&gt;Run a PITR drill to a timestamp 1 hour before the current time — confirm the instance comes up and data at the target time is present&lt;/li&gt;
&lt;li&gt;Document the validation date, measured RTO, PITR result, and any anomalies in the validation log&lt;/li&gt;
&lt;li&gt;Set a calendar reminder or automate a trigger to repeat this cycle within 30 days&lt;/li&gt;
&lt;li&gt;If measured RTO exceeds SLA: open a ticket to evaluate physical backup method or restore parallelism before the next scheduled drill&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Backup jobs report success but the team has never measured actual restore time or verified data consistency — meaning the documented RTO is a guess and a real recovery event will be slower and less certain than expected.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Run a timed restore to an isolated test instance, verify row counts and foreign key constraints post-restore, and test PITR to a target timestamp — on a schedule that keeps the measurement fresh.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A logged RTO that fits inside the SLA target, verified by wall-clock start and end times from the last restore drill, plus a confirmed PITR result within the last 30 days.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, run &lt;code&gt;pg_restore --list backup.dump&lt;/code&gt; (or &lt;code&gt;aws rds describe-db-snapshots&lt;/code&gt;) to verify your most recent backup file is structurally intact, then schedule the first timed restore drill if one has not been run in the past 30 days.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category><category>failures</category></item><item><title>Logical Replication vs Physical Replication in PostgreSQL</title><link>https://rajivonai.com/blog/2023-05-08-logical-replication-vs-physical-replication/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-05-08-logical-replication-vs-physical-replication/</guid><description>Physical replication copies bytes; logical replication copies row changes — and confusing the two causes silent schema drift, sequence divergence, and failed zero-downtime upgrades.</description><pubDate>Mon, 08 May 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;PostgreSQL ships with two replication mechanisms that solve different problems, but they get confused often enough that teams use one where the other is required — and discover the difference during a failover.&lt;/strong&gt; Physical (streaming) replication is for high availability and read scaling. Logical replication is for selective data movement and zero-downtime major version upgrades. Using logical replication as a drop-in HA replacement leaves you with sequence values that have diverged, DDL changes that never arrived at the subscriber, and a schema state on the standby that does not match the primary.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most PostgreSQL deployments start with physical streaming replication. It works, it is simple to configure, and for HA purposes it does exactly what is needed: a replica that is continuously kept in sync and can be promoted in seconds if the primary fails.&lt;/p&gt;
&lt;p&gt;Logical replication was added in PostgreSQL 10 and extended significantly in each subsequent release. It has a specific purpose: moving a subset of data across PostgreSQL instances that may differ by major version, schema, or platform. The canonical use case is a zero-downtime major version upgrade — replicate from a PG14 primary to a PG15 target, validate, then promote.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Teams encounter confusion when they try to use logical replication for HA or try to use physical replication for version upgrades.&lt;/p&gt;
&lt;p&gt;The failure mode that hurts: an engineer sets up logical replication from a PG13 primary to a PG14 standby as the HA plan, does no DDL synchronization, runs several migrations over six months, and then fails over. The standby runs, but queries immediately fail because the schema is months out of date.&lt;/p&gt;
&lt;p&gt;How do we safely distinguish these mechanisms and use the right one for the right operational constraint?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    subgraph Physical Replication&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    P1[Primary — PG14] --&gt;|Raw WAL Bytes| S1[Standby — PG14]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    S1 -.-&gt;|Exact Clone| R1[Read Only Query]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    end&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    subgraph Logical Replication&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    P2[Publisher — PG14] --&gt;|Decoded Row Changes| S2[Subscriber — PG15]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    S2 -.-&gt;|Writeable Target| R2[Zero Downtime Upgrade]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    end&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What this diagram shows:&lt;/strong&gt; Physical replication sends raw WAL bytes to an exact binary copy of the primary that must run the same major PostgreSQL version and stays read-only. Logical replication decodes individual row changes and sends them to a subscriber that can run a different PostgreSQL version and accept writes — which is what enables zero-downtime major version upgrades.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Physical replication copies WAL byte-for-byte. The replica is a binary clone of the primary: same files, same transaction IDs, same system catalog. This means it requires the same PostgreSQL major version as the primary (minor version differences are allowed). It replicates everything — all databases, all tables, all sequences, system catalogs — because it is literally replaying the raw write-ahead log.&lt;/p&gt;
&lt;p&gt;Logical replication decodes WAL into row-level changes: INSERT, UPDATE, DELETE events per table. A publication on the primary defines which tables to send; a subscription on the target applies those changes. The target is a separate, writeable PostgreSQL instance — it can be a different major version, a different schema, or even a different Postgres fork.&lt;/p&gt;
&lt;p&gt;There are specific limitations of logical replication that dictate when it can be used:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DDL is not replicated.&lt;/strong&gt; Schema changes executed on the publisher — &lt;code&gt;ALTER TABLE&lt;/code&gt;, &lt;code&gt;CREATE INDEX&lt;/code&gt;, &lt;code&gt;ADD COLUMN&lt;/code&gt; — are not sent to the subscriber. The subscriber’s schema must be managed separately. A column added on the primary will not exist on the subscriber, and the replication stream will fail when it encounters rows with that column.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sequences are not replicated.&lt;/strong&gt; Sequence state (the current counter) is not sent over logical replication. After promotion of a logical subscriber, all &lt;code&gt;SERIAL&lt;/code&gt; and &lt;code&gt;IDENTITY&lt;/code&gt; columns will restart from wherever the sequence was initialized on the subscriber — which may be far below the primary’s current value, causing primary key conflicts on first insert.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Large objects are excluded.&lt;/strong&gt; PostgreSQL logical replication does not support &lt;code&gt;pg_largeobject&lt;/code&gt; — any data stored via the large object interface is not sent.&lt;/p&gt;























































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Property&lt;/th&gt;&lt;th&gt;Physical Replication&lt;/th&gt;&lt;th&gt;Logical Replication&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;WAL content&lt;/td&gt;&lt;td&gt;Raw bytes, page-level&lt;/td&gt;&lt;td&gt;Decoded row changes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Version requirement&lt;/td&gt;&lt;td&gt;Same PG major version&lt;/td&gt;&lt;td&gt;Cross-major-version capable&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Scope&lt;/td&gt;&lt;td&gt;Entire cluster&lt;/td&gt;&lt;td&gt;Per-table, per-publication&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;DDL replicated&lt;/td&gt;&lt;td&gt;Yes (byte-for-byte)&lt;/td&gt;&lt;td&gt;No — must apply manually&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Sequences replicated&lt;/td&gt;&lt;td&gt;Yes&lt;/td&gt;&lt;td&gt;No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Large objects&lt;/td&gt;&lt;td&gt;Yes&lt;/td&gt;&lt;td&gt;No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Subscriber writeable&lt;/td&gt;&lt;td&gt;No (hot standby read-only)&lt;/td&gt;&lt;td&gt;Yes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Primary use case&lt;/td&gt;&lt;td&gt;HA, read replicas&lt;/td&gt;&lt;td&gt;Version upgrades, selective sync&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Failover time&lt;/td&gt;&lt;td&gt;Seconds (promote standby)&lt;/td&gt;&lt;td&gt;Minutes (manual schema validation needed)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s streaming replication documentation (postgresql.org/docs/current/warm-standby.html) describes physical replication’s behavior: the standby continuously applies WAL records and can be promoted instantly because it shares the same timeline and transaction state as the primary.&lt;/p&gt;
&lt;p&gt;PostgreSQL’s logical replication documentation (postgresql.org/docs/current/logical-replication.html) documents the known limitations explicitly: “Only DML operations are replicated. Schema changes (DDL) are not replicated.” The documentation also notes that “sequences are not replicated” and recommends that operators who use logical replication for version upgrades must handle sequence advancement manually during the cutover.&lt;/p&gt;
&lt;p&gt;The documented pattern from the PostgreSQL logical replication documentation is that the initial table sync for a new subscription copies the current table contents as a snapshot — on large tables this can take hours, and replication lag accumulates during that window. Physical replication has no equivalent initial sync cost because it starts from a base backup and streams from there.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;
&lt;p&gt;The limitations of logical replication create operational risk if used incorrectly:&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;DDL on publisher not applied to subscriber&lt;/td&gt;&lt;td&gt;Replication stream errors when row data includes columns not present in subscriber schema; apply worker stops&lt;/td&gt;&lt;td&gt;Logical replication does not decode or forward DDL; subscriber schema must be kept in sync manually&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Sequence values diverge after failover&lt;/td&gt;&lt;td&gt;First INSERT after promotion generates IDs that conflict with rows that existed on the former primary&lt;/td&gt;&lt;td&gt;Subscriber sequences were never updated; they restart from initialization value, not primary’s current value&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Initial snapshot for large tables&lt;/td&gt;&lt;td&gt;Replication lag grows during the hours-long initial sync; the subscriber cannot be used as an HA target during this window&lt;/td&gt;&lt;td&gt;Logical replication’s initial sync is a table-level snapshot copy, not a streaming catchup&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;For a zero-downtime major version upgrade, the sequence problem is solved by advancing the subscriber’s sequences past the primary’s current values before promotion. PostgreSQL’s &lt;code&gt;pg_upgrade&lt;/code&gt; documentation recommends scripting this using &lt;code&gt;setval()&lt;/code&gt; against each affected sequence immediately before the promotion cutover.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Teams treating logical replication as a drop-in HA mechanism get schema drift and sequence conflicts at promotion time — failover appears to succeed, then applications fail immediately.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Use physical streaming replication for HA; reserve logical replication for cross-version migration or selective data movement, and build explicit DDL sync and sequence advancement steps into the cutover runbook.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After a logical replication setup, query &lt;code&gt;SELECT schemaname, tablename FROM information_schema.tables WHERE table_schema = &apos;public&apos;&lt;/code&gt; on both primary and subscriber and diff the results — schema parity must be verified before any promotion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: If you have an existing logical replication setup intended for HA, audit it this week: list all DDL changes since the subscription was created and confirm each was applied on the subscriber.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category></item><item><title>Read Replicas Are Not Free Scale</title><link>https://rajivonai.com/blog/2023-04-17-read-replicas-are-not-free-scale/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-04-17-read-replicas-are-not-free-scale/</guid><description>Read replicas add read throughput but they do not reduce write load, do not eliminate replication lag, and silently serve stale data under write bursts — understanding those constraints before you add replicas is the decision engineers skip.</description><pubDate>Mon, 17 Apr 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Adding a read replica is often the first instinct when a database is under load — and it often makes things worse in ways that take weeks to surface.&lt;/strong&gt; Replicas do increase read throughput, but they do not reduce write pressure on the primary, do not guarantee consistent data, and the operational burden of managing lag, failover, and session consistency accumulates quietly until something breaks.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Read replicas are standard infrastructure in most relational deployments. AWS RDS, Aurora, Cloud SQL, and self-managed PostgreSQL and MySQL all support them. The pitch is straightforward: offload read traffic to replica nodes, keep the primary free for writes, scale horizontally without sharding.&lt;/p&gt;
&lt;p&gt;That pitch is accurate as far as it goes. The problem is what it leaves out.&lt;/p&gt;
&lt;p&gt;Engineers reach for replicas when they see high CPU or query latency on the primary. What this misses: replication is not free. Replicas consume resources on the primary for log shipping, introduce lag between writes and reads, and create an eventual-consistency model that most application code is not written to handle.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The silent failure mode: your application writes a record, then immediately reads it back, but the read lands on a replica that has not yet applied the write. No error is returned. The user sees stale data. This is the documented behavior of asynchronous replication — the bug is routing the read to a replica without accounting for the replication window.&lt;/p&gt;
&lt;p&gt;Under normal conditions, lag is milliseconds and rarely surfaces. Under a write burst — a batch import, a traffic spike, a schema migration — lag climbs to seconds or minutes. During that window, every read routed to a replica is potentially wrong.&lt;/p&gt;
&lt;p&gt;The core question: which reads are safe to serve from a replica, and how do you verify that the replica is current enough to answer them?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    App[Application Client] --&gt;|1. Write Record| Primary[Primary Database Node]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Primary --&gt;|2. Ship WAL Asynchronously| Replica[Read Replica Node]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    App --&gt;|3. Immediate Read| Replica&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Replica --&gt;|4. Returns Stale Data| App&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Replication lag is the delay between a commit on the primary and that commit being visible on a replica. How large the window gets — and what you can do about it — depends on the model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PostgreSQL streaming replication&lt;/strong&gt; is asynchronous by default. The primary commits before the replica confirms receipt or apply. &lt;code&gt;pg_stat_replication&lt;/code&gt; exposes &lt;code&gt;write_lag&lt;/code&gt;, &lt;code&gt;flush_lag&lt;/code&gt;, and &lt;code&gt;replay_lag&lt;/code&gt;. Under write load, replay lag dominates; the WAL apply process is fundamentally single-threaded for physical streaming replication.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MySQL Group Replication&lt;/strong&gt; offers synchronous and semi-synchronous modes. Semi-synchronous (the default) confirms receipt but not apply — lag persists at the relay log. Fully synchronous mode blocks the primary commit until a replica confirms receipt, which reduces read lag at the cost of write latency (MySQL 8.0 Reference Manual, Group Replication).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Aurora&lt;/strong&gt; uses shared distributed storage rather than WAL shipping, so replicas observe page mutations directly. AWS documentation cites typical lag below 10 ms. Faster than streaming replication, but the session consistency problem remains: reads routed to the Aurora reader endpoint immediately after a write can still miss it.&lt;/p&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Replication model&lt;/th&gt;&lt;th&gt;Lag driver&lt;/th&gt;&lt;th&gt;Session consistency risk&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL streaming (async)&lt;/td&gt;&lt;td&gt;WAL ship and replay&lt;/td&gt;&lt;td&gt;Yes — read can land before write applies&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MySQL semi-synchronous&lt;/td&gt;&lt;td&gt;Binlog receipt confirmed; apply async&lt;/td&gt;&lt;td&gt;Yes — same apply lag pattern&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MySQL Group Replication (sync)&lt;/td&gt;&lt;td&gt;Commit blocked until majority confirms receipt&lt;/td&gt;&lt;td&gt;Reduced but not eliminated&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Aurora read replicas&lt;/td&gt;&lt;td&gt;Storage page propagation — sub-10 ms&lt;/td&gt;&lt;td&gt;Yes — writer endpoint required for read-after-write&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s &lt;code&gt;pg_stat_replication.replay_lag&lt;/code&gt; can grow unbounded under write load — including during heavy &lt;code&gt;COPY&lt;/code&gt; operations — because the WAL apply process cannot keep pace with the primary (PostgreSQL documentation, “Monitoring Replication”). The application has no visibility into this metric unless explicitly instrumented.&lt;/p&gt;
&lt;p&gt;AWS documentation on Aurora Replicas explicitly recommends the writer endpoint for read-after-write consistency. Even sub-10 ms storage propagation creates a window where the reader endpoint can miss the most recent write. The shared storage architecture changes the lag mechanism but not the session consistency constraint.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Write burst&lt;/td&gt;&lt;td&gt;Reads return stale data silently&lt;/td&gt;&lt;td&gt;Replica apply process falls behind; no error surfaces to the client&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replica promotion during failover&lt;/td&gt;&lt;td&gt;Writes fail for 30–120 seconds in streaming replication setups&lt;/td&gt;&lt;td&gt;Primary must be confirmed, DNS or proxy updated, and applications reconnected&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Session consistency violation&lt;/td&gt;&lt;td&gt;User writes then immediately reads stale data&lt;/td&gt;&lt;td&gt;Connection pooler routes the read to a replica before replication applies the write&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Routing reads to replicas without accounting for lag means applications silently return wrong answers during write bursts — no error, just stale data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Classify reads by consistency requirement before routing. Reads that must see the latest write go to the primary; reads that tolerate bounded staleness go to replicas, with lag monitored against that bound.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Query &lt;code&gt;pg_stat_replication.replay_lag&lt;/code&gt; on the primary (or &lt;code&gt;Seconds_Behind_Source&lt;/code&gt; in MySQL) during a write spike. If it exceeds your application’s staleness tolerance, replica routing is already producing silent correctness errors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Audit your connection pooler or load balancer this week to confirm which queries reach replicas, then add a lag threshold alert — reject or redirect replica reads when lag exceeds your application’s tolerance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The cost of replicas shows up in consistency, failover latency, and operational complexity — not on a throughput graph. That mismatch is why replica failures are hard to catch until they surface as user-visible data errors.&lt;/p&gt;</content:encoded><category>databases</category><category>architecture</category><category>failures</category></item><item><title>PostgreSQL Connection Storm Runbook</title><link>https://rajivonai.com/blog/2023-04-03-postgresql-connection-storm-runbook/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-04-03-postgresql-connection-storm-runbook/</guid><description>Diagnosing and resolving connection exhaustion in PostgreSQL: too many clients, idle-in-transaction accumulation, and the case for connection pooling.</description><pubDate>Mon, 03 Apr 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;“Sorry, too many clients already” means PostgreSQL has rejected a connection before your application could run a single query.&lt;/strong&gt; Every connection to PostgreSQL is a forked OS process consuming memory — typically 5–10 MB of RAM per connection — so &lt;code&gt;max_connections&lt;/code&gt; is a hard ceiling that cannot be stretched without consequences. Once you hit it, the failure mode is not graceful degradation; it is hard rejection of new connections until existing ones close.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s process-per-connection architecture dates to a period when connection counts were measured in dozens, not thousands. Each connection forks a backend process, inherits a memory allocation, and holds that allocation for the duration of the connection regardless of whether a query is running. At 200 connections, this overhead is manageable. At 1,000 connections, PostgreSQL is spending more memory serving idle backends than it is serving active queries.&lt;/p&gt;
&lt;p&gt;The default &lt;code&gt;max_connections = 100&lt;/code&gt; reflects this constraint — it is not a conservative setting that exists to be raised. The PostgreSQL documentation explicitly notes that increasing &lt;code&gt;max_connections&lt;/code&gt; requires increasing &lt;code&gt;shared_buffers&lt;/code&gt; proportionally, and that the memory overhead of idle connections is real and measurable.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Connection storms occur in three patterns: application connection leaks (connections opened and never closed), pool exhaustion from too many services competing for the same pool, and deployments that spin up new application instances without shutting down old ones cleanly. The &lt;code&gt;idle in transaction&lt;/code&gt; state is particularly damaging because those connections are holding transactions open, which blocks vacuum and prevents transaction ID advancement.&lt;/p&gt;
&lt;p&gt;Without a centralized connection multiplexer, every new microservice or horizontal pod autoscaling event directly multiplies the active TCP connections to the database host. Eventually, the database runs out of available connection slots or OS memory, triggering catastrophic connection rejection. How do you scale application instances without proportionally scaling database connection overhead?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;The structural solution is to decouple application connection counts from PostgreSQL process counts using connection pooling, specifically PgBouncer in transaction mode, while implementing aggressive server-side transaction timeouts to prevent zombie state accumulation.&lt;/p&gt;
&lt;h3 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h3&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Application errors: “sorry, too many clients already”&lt;/td&gt;&lt;td&gt;Application logs&lt;/td&gt;&lt;td&gt;&lt;code&gt;max_connections&lt;/code&gt; ceiling hit — no new connections possible&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;count(*)&lt;/code&gt; near &lt;code&gt;max_connections&lt;/code&gt; value&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Connection headroom nearly exhausted&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High count of &lt;code&gt;idle in transaction&lt;/code&gt; state&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Connections holding open transactions, blocking vacuum&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;One client IP with &gt; 50 connections&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt; grouped by &lt;code&gt;client_addr&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Connection leak on a specific application server&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;No PgBouncer or pgpool in the stack&lt;/td&gt;&lt;td&gt;Infrastructure review&lt;/td&gt;&lt;td&gt;Direct connection architecture that cannot scale safely&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Memory pressure on the PostgreSQL host&lt;/td&gt;&lt;td&gt;OS metrics&lt;/td&gt;&lt;td&gt;Each idle connection consuming 5–10 MB RAM&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h3 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Count connections by state&lt;/strong&gt; — get the distribution of active, idle, and idle-in-transaction connections:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; connection_count,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  max&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; state_change) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; oldest_in_state&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pid &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;!=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_backend_pid()&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; connection_count &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;High &lt;code&gt;idle&lt;/code&gt; counts mean connections are staying open without doing work — a pooling problem. High &lt;code&gt;idle in transaction&lt;/code&gt; counts mean applications are opening transactions and not committing or rolling back — a connection leak or long-running operation pattern.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Check the connection ceiling&lt;/strong&gt; — confirm max_connections and how close you are:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW max_connections;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; total_connections,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; setting::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;int&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_settings &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; name&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;max_connections&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; max_connections,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; setting::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;int&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_settings &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; name&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;max_connections&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pct_used&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Anything above 80% of &lt;code&gt;max_connections&lt;/code&gt; is operational risk. At 90%, connection failures are likely during traffic spikes. PostgreSQL reserves a small number of connections for superusers via &lt;code&gt;superuser_reserved_connections&lt;/code&gt; (default 3), so regular users lose access before the absolute ceiling.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Count idle-in-transaction connections&lt;/strong&gt; — these are the most damaging:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idle_in_txn_count,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  max&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xact_start) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; oldest_open_txn&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle in transaction&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Any &lt;code&gt;oldest_open_txn&lt;/code&gt; value above 5 minutes should be treated as an incident. These connections are holding their transaction’s snapshot, preventing vacuum from advancing the horizon, and consuming a process slot doing nothing.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Connection distribution by client address&lt;/strong&gt; — identify connection hogs:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  client_addr,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  usename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; connections,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  sum&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CASE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; WHEN&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle in transaction&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; THEN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ELSE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; END&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idle_in_txn&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pid &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;!=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_backend_pid()&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; client_addr, usename&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; connections &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A single application server holding 80 connections to PostgreSQL while a second server holds 2 is a strong signal of either a connection leak or misconfigured pool sizing on the first server.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Check for a connection pooler&lt;/strong&gt; — if there is no PgBouncer or pgpool in front of PostgreSQL, that is the fix:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Check whether PgBouncer is running on the standard port&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;nc&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -z&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; localhost&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 6432&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; &amp;#x26;&amp;#x26; &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;PgBouncer present&quot;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ||&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; echo&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &quot;No pooler on 6432&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Or check from the PostgreSQL side — poolers identify themselves&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; client_addr,&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; application_name,&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; application_name&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; ILIKE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;%pgbouncer%&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;   OR&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; application_name&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; ILIKE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;%pgpool%&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;GROUP&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; BY&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; client_addr,&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; application_name&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If no pooler is present and connection counts are near the ceiling, adding PgBouncer in transaction mode is the fastest structural fix available. Nothing else will prevent recurrence under load.&lt;/p&gt;
&lt;h3 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h3&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Connections near max_connections] --&gt; B{idle in transaction count high?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| C[Set idle_in_transaction_session_timeout]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no| D{idle connection count high?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes| E{Pooler in front of Postgres?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|no| F[Add PgBouncer in transaction mode]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|yes| G{Pool sized correctly?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|no| H[Reduce pool_size per service]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|yes| I{One client addr dominant?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt;|yes| J[Investigate connection leak on that host]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt;|no| K[Too many services — reduce direct connections]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no| L{Connection rate spiking?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|yes| M[Check deploy — new instances not closing old]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|no| N[Increase max_connections as last resort]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Add PgBouncer in transaction mode (fastest structural fix)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;PgBouncer in transaction mode multiplexes many application connections onto a small number of PostgreSQL backend processes. A typical configuration allows 1,000 application connections to share 20 PostgreSQL connections if the average transaction is short.&lt;/p&gt;
&lt;p&gt;Install and configure PgBouncer with a minimal &lt;code&gt;pgbouncer.ini&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;ini&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;[databases]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;mydb&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;host&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;=127.0.0.1 &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;port&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;=5432 &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;dbname&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;=mydb&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;[pgbouncer]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;listen_addr&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = 0.0.0.0&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;listen_port&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = 6432&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;pool_mode&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = transaction&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;max_client_conn&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = 1000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;default_pool_size&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = 20&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;server_idle_timeout&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = 600&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;log_connections&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = 0&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;log_disconnections&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = 0&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Application changes: point connection strings to PgBouncer’s port (6432) instead of PostgreSQL’s port (5432). This is the only change required at the application layer.&lt;/p&gt;
&lt;p&gt;Transaction mode has one constraint documented in the PgBouncer documentation: prepared statements tied to a specific backend do not survive across transactions in transaction mode. Applications using &lt;code&gt;PREPARE&lt;/code&gt; statements must either use the statement cache inside PgBouncer or be moved to session mode.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Set idle_in_transaction_session_timeout&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For immediate relief from accumulated &lt;code&gt;idle in transaction&lt;/code&gt; connections, set a server-side timeout:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Immediate change, no restart required&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SYSTEM&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idle_in_transaction_session_timeout &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;5min&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_reload_conf();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify it took effect&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW idle_in_transaction_session_timeout;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After reload, any session that stays in &lt;code&gt;idle in transaction&lt;/code&gt; state for more than 5 minutes will be automatically terminated by PostgreSQL. The application will see a connection error and must handle reconnection.&lt;/p&gt;
&lt;p&gt;This parameter was added in PostgreSQL 9.6. It does not affect sessions with actively running queries — only sessions that have an open transaction but are not executing SQL.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Increase max_connections (last resort)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Increasing &lt;code&gt;max_connections&lt;/code&gt; requires a PostgreSQL restart and must be paired with a proportional increase in memory:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Edit postgresql.conf&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;max_connections&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 200&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# shared_buffers should be at least 128MB per 100 connections as a starting point&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;shared_buffers&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; 2GB&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Restart required&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pg_ctl&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; restart&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -D&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /var/lib/postgresql/data&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the last resort because it treats the symptom — not enough connection slots — without addressing the underlying cause, which is direct connections rather than pooled connections. Each additional connection slot adds OS process overhead. The PostgreSQL wiki notes that raising &lt;code&gt;max_connections&lt;/code&gt; above 200 without a pooler in front rarely solves connection exhaustion; it only defers it.&lt;/p&gt;
&lt;h3 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt;: Revert immediately with &lt;code&gt;ALTER SYSTEM SET idle_in_transaction_session_timeout = 0; SELECT pg_reload_conf();&lt;/code&gt; — zero disables the timeout. No restart required.&lt;/li&gt;
&lt;li&gt;PgBouncer addition: PgBouncer is a proxy; removing it means pointing application connection strings back to the direct PostgreSQL port. No PostgreSQL changes are needed. PgBouncer itself can be stopped or removed at any time.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;max_connections&lt;/code&gt; increase: Decreasing &lt;code&gt;max_connections&lt;/code&gt; requires a restart. Before decreasing, verify that active connections at the new lower limit will not be rejected. Query &lt;code&gt;SELECT count(*) FROM pg_stat_activity&lt;/code&gt; first to confirm actual utilization.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h3&gt;
&lt;p&gt;A Prometheus alert on &lt;code&gt;pg_stat_activity_count&lt;/code&gt; by state is the standard monitoring approach. If you do not have Prometheus, this pg_cron query captures connection utilization hourly for capacity planning:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; cron&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schedule&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;connection-capacity-log&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;0 * * * *&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, $$&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; ops&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;connection_log&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (ts, total, idle, idle_in_txn, active, max_conn)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(),&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;),&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FILTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;),&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FILTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle in transaction&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;),&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FILTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;active&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;),&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; setting::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;int&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_settings &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; name&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;max_connections&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pid &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;!=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_backend_pid();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$$);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alert thresholds worth setting: &lt;code&gt;total &gt; 0.8 * max_connections&lt;/code&gt; for capacity warning, &lt;code&gt;idle_in_txn &gt; 10&lt;/code&gt; for transaction hygiene alert, &lt;code&gt;idle_in_txn&lt;/code&gt; with &lt;code&gt;age &gt; 5 minutes&lt;/code&gt; for immediate escalation.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The PgBouncer documentation describes transaction mode as suitable for any application that does not use session-level PostgreSQL features across transactions: advisory locks, &lt;code&gt;SET LOCAL&lt;/code&gt;, &lt;code&gt;LISTEN/NOTIFY&lt;/code&gt;, prepared statements in session scope, and temporary tables. For applications that do use these features, session mode provides pooling with fewer constraints but with lower connection multiplexing ratios.&lt;/p&gt;
&lt;p&gt;The documented pattern from the PostgreSQL documentation on &lt;code&gt;max_connections&lt;/code&gt; is that each additional connection adds approximately &lt;code&gt;400 bytes&lt;/code&gt; of shared memory overhead, plus the per-process allocation (typically 5–10 MB). The PostgreSQL wiki explicitly recommends that databases serving more than a few hundred concurrent application connections place a pooler in front rather than raising &lt;code&gt;max_connections&lt;/code&gt; beyond 200.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;PgBouncer transaction mode breaks application&lt;/td&gt;&lt;td&gt;Application uses prepared statements or &lt;code&gt;SET LOCAL&lt;/code&gt; across transactions&lt;/td&gt;&lt;td&gt;Switch specific pools to session mode; or migrate to &lt;code&gt;pg_prepared_statements&lt;/code&gt; cache&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt; causes unexpected rollbacks&lt;/td&gt;&lt;td&gt;Application holds open transactions intentionally for long operations&lt;/td&gt;&lt;td&gt;Increase the timeout for those connections, or refactor to commit-per-batch&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Increasing &lt;code&gt;max_connections&lt;/code&gt; causes OOM&lt;/td&gt;&lt;td&gt;New connection ceiling consumes available RAM&lt;/td&gt;&lt;td&gt;Reduce &lt;code&gt;max_connections&lt;/code&gt; and add PgBouncer instead&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PgBouncer pool exhausted under burst load&lt;/td&gt;&lt;td&gt;&lt;code&gt;default_pool_size&lt;/code&gt; too small for concurrent query volume&lt;/td&gt;&lt;td&gt;Increase &lt;code&gt;default_pool_size&lt;/code&gt;; add read replicas for read traffic&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Application does not retry on connection error&lt;/td&gt;&lt;td&gt;&lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt; terminates and app crashes&lt;/td&gt;&lt;td&gt;Add connection retry logic with exponential backoff&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: PostgreSQL rejects connections hard when &lt;code&gt;max_connections&lt;/code&gt; is exhausted — no graceful degradation, just immediate errors for every new connection attempt.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Add PgBouncer in transaction mode between applications and PostgreSQL to multiplex application connections onto a small pool of PostgreSQL backends, and set &lt;code&gt;idle_in_transaction_session_timeout = &apos;5min&apos;&lt;/code&gt; to prevent zombie transactions from consuming connection slots.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After adding PgBouncer, &lt;code&gt;SELECT count(*) FROM pg_stat_activity&lt;/code&gt; on the PostgreSQL side should show a small stable number (equal to &lt;code&gt;default_pool_size&lt;/code&gt;) regardless of how many application-side connections exist.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run the connection-by-state query from Check 1 against your production database today. If &lt;code&gt;idle in transaction&lt;/code&gt; count exceeds 5, set &lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt; immediately — it requires only a config reload, not a restart.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;checklist&quot;&gt;Checklist&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Query &lt;code&gt;pg_stat_activity&lt;/code&gt; grouped by state to see total, idle, idle-in-transaction, and active counts&lt;/li&gt;
&lt;li&gt;Compare total connections to &lt;code&gt;max_connections&lt;/code&gt; — flag if &gt; 80% used&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;idle in transaction&lt;/code&gt; count and age of oldest open transaction&lt;/li&gt;
&lt;li&gt;Group connections by &lt;code&gt;client_addr&lt;/code&gt; to identify any single-host leak&lt;/li&gt;
&lt;li&gt;Confirm whether PgBouncer or pgpool is present and accepting connections&lt;/li&gt;
&lt;li&gt;If no pooler: install PgBouncer in transaction mode before the next traffic event&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;idle_in_transaction_session_timeout = &apos;5min&apos;&lt;/code&gt; and reload config&lt;/li&gt;
&lt;li&gt;Verify &lt;code&gt;pool_mode&lt;/code&gt; in PgBouncer config is &lt;code&gt;transaction&lt;/code&gt; for OLTP workloads&lt;/li&gt;
&lt;li&gt;Confirm application handles connection errors with retry logic&lt;/li&gt;
&lt;li&gt;Review &lt;code&gt;max_connections&lt;/code&gt; setting — resist raising it without adding a pooler&lt;/li&gt;
&lt;li&gt;Add a monitoring alert at 80% of &lt;code&gt;max_connections&lt;/code&gt; utilization&lt;/li&gt;
&lt;li&gt;Log connection counts hourly to build a capacity baseline for the next 30 days&lt;/li&gt;
&lt;/ol&gt;</content:encoded><category>databases</category><category>checklist</category><category>failures</category></item><item><title>MongoDB WiredTiger Cache: Practical Basics</title><link>https://rajivonai.com/blog/2023-03-13-mongodb-wiredtiger-cache-practical-basics/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-03-13-mongodb-wiredtiger-cache-practical-basics/</guid><description>WiredTiger&apos;s internal cache is MongoDB&apos;s primary memory tier — how to read its metrics, recognize eviction pressure, and size it correctly for your working set.</description><pubDate>Mon, 13 Mar 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;MongoDB’s WiredTiger storage engine maintains its own internal cache independent of the OS page cache, and when that cache fills beyond capacity, eviction pressure causes reads to go to disk — a transition that happens silently until IOPS spike and ops/sec drops.&lt;/strong&gt; The default cache size is 50% of available RAM minus 1 GB, but the uncompressed nature of the cache means a dataset that looks modest on disk can consume several times more memory once loaded into WiredTiger.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;WiredTiger has been MongoDB’s default storage engine since version 3.2. It stores data compressed on disk but decompresses pages into the internal cache when they are loaded for reads or writes. A collection that occupies 10 GB on disk with snappy compression might occupy 25–35 GB in the WiredTiger cache, because the cache holds the uncompressed representation.&lt;/p&gt;
&lt;p&gt;Engineers managing MongoDB capacity frequently size hardware based on disk footprint or compressed data size. That works until the working set exceeds the uncompressed cache size, at which point WiredTiger begins evicting pages to make room for new reads — and those evicted pages, when needed again, require disk reads.&lt;/p&gt;
&lt;p&gt;The OS page cache sits below WiredTiger and caches the compressed on-disk representation. MongoDB uses both layers, but WiredTiger’s internal cache governs how much uncompressed working set fits in memory. The distinction matters when diagnosing whether a performance problem is a WiredTiger cache miss or an OS-level page cache miss.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;WiredTiger eviction is a background process that attempts to keep the cache below its configured high-water mark (default 95% of cache size). When reads and writes drive cache occupancy above this threshold faster than background eviction can drain it, application threads begin participating in foreground eviction — pausing to evict pages before completing their operations. This is the condition that converts a slow-cache-miss into a stalled application thread.&lt;/p&gt;
&lt;p&gt;The failure mode on Atlas and self-managed deployments looks similar: read throughput drops, latency climbs, and CloudWatch or Atlas metrics show disk IOPS climbing while CPU stays flat. The traditional diagnosis suspects indexes — add an index, the IOPS should drop. It does not drop because the index pages are themselves not fitting in cache.&lt;/p&gt;
&lt;p&gt;The core question: is the WiredTiger cache sized for your actual uncompressed working set, and is eviction pressure currently active?&lt;/p&gt;
&lt;h2 id=&quot;how-wiredtiger-cache-works&quot;&gt;How WiredTiger Cache Works&lt;/h2&gt;
&lt;p&gt;WiredTiger cache metrics are accessible through &lt;code&gt;db.serverStatus()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;serverStatus&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;().wiredTiger.cache&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key fields to examine:&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Field&lt;/th&gt;&lt;th&gt;What it measures&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;bytes currently in the cache&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Current uncompressed bytes in cache&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;maximum bytes configured&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Configured cache ceiling&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;pages evicted by application threads&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Foreground eviction — application threads stalled for eviction&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;pages read into cache&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Cumulative physical reads from disk into cache&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;tracked dirty bytes in the cache&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Modified pages not yet flushed to disk&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The ratio that matters most operationally:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;cache fill ratio = bytes currently in cache / maximum bytes configured&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A ratio consistently above 90–95% means background eviction is working hard to prevent foreground eviction. A ratio above 95% combined with nonzero &lt;code&gt;pages evicted by application threads&lt;/code&gt; means foreground eviction is active and application threads are being paused.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Checking cache pressure:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;let&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; c &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;serverStatus&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;().wiredTiger.cache;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;print&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;Cache fill %:&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, Math.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;round&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(c[&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;bytes currently in the cache&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;] &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; c[&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;maximum bytes configured&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;] &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;));&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;print&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;App thread evictions:&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, c[&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;pages evicted by application threads&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;]);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Cache sizing:&lt;/strong&gt; MongoDB documentation specifies the default as the larger of 256 MB or &lt;code&gt;(RAM - 1GB) * 0.5&lt;/code&gt;. On a 16 GB server, that is &lt;code&gt;(16-1) * 0.5 = 7.5 GB&lt;/code&gt;. For a server dedicated to MongoDB, the documented guidance is to set &lt;code&gt;wiredTigerCacheSizeGB&lt;/code&gt; to roughly 60% of available RAM, leaving headroom for OS page cache, sort operations, and connection overhead.&lt;/p&gt;
&lt;p&gt;Configure via mongod.conf:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;yaml&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;storage&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;  wiredTiger&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;    engineConfig&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#85E89D&quot;&gt;      cacheSizeGB&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;10&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;The two-layer memory model:&lt;/strong&gt; When MongoDB reads a document from disk, the OS page cache loads the compressed block. WiredTiger decompresses it into the internal cache. Both layers retain the data independently. On a cache miss in WiredTiger but a hit in OS page cache, the read is a decompression operation rather than a physical disk I/O — faster than a full disk read, but slower than a WiredTiger cache hit. Monitoring only disk IOPS can understate the actual working set pressure if the OS page cache is absorbing misses.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented behavior of WiredTiger, as described in the MongoDB documentation chapter “WiredTiger Storage Engine,” is that the internal cache holds uncompressed document and index pages while on-disk storage uses compression. MongoDB documentation explicitly notes this asymmetry: “with compression, less data is stored on disk but the storage engine cache holds data in its uncompressed form.” This is the source of the common sizing mistake where teams provision RAM based on compressed disk size.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;db.serverStatus().wiredTiger.cache&lt;/code&gt; output is documented in the MongoDB Server Manual under “db.serverStatus() output — wiredTiger.” The field &lt;code&gt;pages evicted by application threads&lt;/code&gt; is specifically called out in MongoDB documentation as an indicator of eviction pressure reaching foreground threads.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Working set exceeds cache&lt;/td&gt;&lt;td&gt;Read IOPS spike; ops/sec drops&lt;/td&gt;&lt;td&gt;Cache misses require physical disk reads after eviction&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Read-heavy analytics scanning full collections&lt;/td&gt;&lt;td&gt;Normal OLTP reads get evicted&lt;/td&gt;&lt;td&gt;Analytics scan floods cache with pages that are not reused&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Uncompressed cache significantly larger than disk size&lt;/td&gt;&lt;td&gt;Undersized WiredTiger cache despite adequate disk&lt;/td&gt;&lt;td&gt;Engineers sized RAM for compressed footprint, not uncompressed working set&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: WiredTiger cache is sized for compressed disk footprint, not the uncompressed working set — eviction pressure is causing application threads to stall on foreground eviction.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Check cache fill ratio and foreground eviction count via &lt;code&gt;db.serverStatus().wiredTiger.cache&lt;/code&gt;; if fill ratio exceeds 90% consistently, increase &lt;code&gt;wiredTigerCacheSizeGB&lt;/code&gt; to 60% of available RAM or upgrade instance size.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After resizing, monitor &lt;code&gt;pages evicted by application threads&lt;/code&gt; dropping to near zero; ops/sec should stabilize and disk IOPS should drop.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, run the cache fill ratio check above against any MongoDB deployment that has been showing elevated IOPS or latency — verify whether cache pressure is the underlying cause before adding indexes or upgrading storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The WiredTiger cache and the OS page cache are two separate memory pools with two separate capacities. Sizing only one correctly is not enough.&lt;/p&gt;</content:encoded><category>databases</category></item><item><title>Aurora MySQL Writer CPU Spike Workflow</title><link>https://rajivonai.com/blog/2023-03-06-aurora-mysql-writer-cpu-spike-workflow/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-03-06-aurora-mysql-writer-cpu-spike-workflow/</guid><description>A systematic runbook for diagnosing Aurora MySQL writer CPU spikes — from Performance Insights through lock contention, long transactions, and read offload.</description><pubDate>Mon, 06 Mar 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;An Aurora MySQL writer CPU spike is almost never just a CPU problem.&lt;/strong&gt; The writer processes writes exclusively for the cluster, and when CPU spikes, the culprit is usually a query that changed execution plan, a lock contention burst, a batch job running longer than expected, or a sudden increase in connection count. Treating it as a capacity problem and scaling the instance is the expensive, slow-feedback response. The fast response starts with Performance Insights.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;CloudWatch shows Aurora MySQL writer &lt;code&gt;CPUUtilization&lt;/code&gt; at 80–95%. Application latency is climbing. The P99 for write endpoints has doubled. The on-call engineer opens the console and sees the CPU metric, the latency metric, and a blinking cursor.&lt;/p&gt;
&lt;p&gt;Aurora MySQL separates the writer from the reader cluster endpoints. The writer handles all DML. Readers handle only SELECT queries that have been explicitly routed to the reader endpoint. When the writer is saturated, writes stall, and any reads routed to the writer stall with them. Scaling the writer instance buys time but does not address the root cause — and Aurora Serverless v2 auto-scaling adds latency while scaling happens, which worsens the incident in the short term.&lt;/p&gt;
&lt;p&gt;The diagnostic sequence determines whether this resolves in 10 minutes or 2 hours.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;CPUUtilization 80–100%&lt;/td&gt;&lt;td&gt;CloudWatch — Aurora writer&lt;/td&gt;&lt;td&gt;Writer is bottlenecked; cause unknown&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High DBLoad&lt;/td&gt;&lt;td&gt;Performance Insights — DBLoad metric&lt;/td&gt;&lt;td&gt;Confirms sessions waiting; compare DBLoadCPU vs DBLoadNonCPU&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;One query dominating AAS&lt;/td&gt;&lt;td&gt;Performance Insights — Top SQL&lt;/td&gt;&lt;td&gt;Single query is consuming most writer capacity&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Long lock wait in INNODB STATUS&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW ENGINE INNODB STATUS\G&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Lock contention between concurrent transactions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Active connections spike&lt;/td&gt;&lt;td&gt;CloudWatch — DatabaseConnections&lt;/td&gt;&lt;td&gt;Connection pool exhausted or connection storm&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PROCESSLIST shows many similar queries&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW FULL PROCESSLIST&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Hot query pattern, not a single rogue query&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Performance Insights — split CPU vs wait&lt;/strong&gt; — Determine whether the bottleneck is CPU execution or wait events:&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Performance Insights DBLoad chart separates &lt;code&gt;db.load.avg&lt;/code&gt; into &lt;code&gt;DBLoadCPU&lt;/code&gt; (executing on CPU) and &lt;code&gt;DBLoadNonCPU&lt;/code&gt; (waiting — on locks, I/O, etc.). If &lt;code&gt;DBLoadNonCPU&lt;/code&gt; dominates, the CPU spike is a secondary effect of sessions piling up behind a lock or slow I/O, not pure execution load.&lt;/p&gt;
&lt;p&gt;Navigate to: RDS Console → your Aurora cluster → Performance Insights → select DB Load breakdown by wait event.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Top SQL by average active sessions&lt;/strong&gt; — Identify the specific query driving load:&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Performance Insights → Top SQL tab, sorted by &lt;code&gt;Load (AAS)&lt;/code&gt;. The top query by AAS is the first candidate. Note its digest, get the full SQL text, and examine its execution plan.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Run on the Aurora writer — substitute the digest from Performance Insights&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; customer_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 12345&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Currently running queries:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW FULL PROCESSLIST;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Look for queries in &lt;code&gt;State: executing&lt;/code&gt; or &lt;code&gt;State: Waiting for table metadata lock&lt;/code&gt; or &lt;code&gt;State: updating&lt;/code&gt;. A large number of identical or similar queries stacking up indicates the query is not returning promptly — the connection pool is filling with in-flight sessions.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;InnoDB lock contention:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW ENGINE INNODB &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;STATUS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;\G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Scroll to the &lt;code&gt;TRANSACTIONS&lt;/code&gt; section and look for &lt;code&gt;LOCK WAIT&lt;/code&gt;. Lock waits indicate two or more transactions competing for the same row or range. The &lt;code&gt;LATEST DETECTED DEADLOCK&lt;/code&gt; section shows the most recent deadlock event — if it is recent and matches the CPU spike timing, lock contention is the primary cause.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Long transactions:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; trx_id, trx_started, trx_query,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       TIMESTAMPDIFF(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SECOND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, trx_started, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;NOW&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; age_sec&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;INNODB_TRX&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; trx_started&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Any transaction older than 60 seconds on the writer during a CPU spike is a strong suspect. Long transactions hold row locks longer, block concurrent writes, and generate undo log that increases internal InnoDB maintenance work.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Aurora writer CPU spike] --&gt; B{Performance Insights — single query dominant?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| C[EXPLAIN the query — check for full scan]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; D{Missing index?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes| E[Add index — test in staging first]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no| F[Check statistics staleness — run ANALYZE TABLE]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no| G{DBLoadNonCPU dominant?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|yes| H{INNODB STATUS shows lock waits?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|yes| I[Find blocking transaction — reduce scope or kill]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|no| J[Check I/O metrics — consider read offload]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|no| K{Many connections in PROCESSLIST?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|yes| L[Check connection pool config — reduce max connections]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|no| M{Aurora Serverless v2 scaling in progress?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    M --&gt;|yes| N[Wait for scale-up — increase minimum ACU to prevent recurrence]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    M --&gt;|no| O[Check recent schema or code deployment]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Add index for the top query&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If Performance Insights identifies a query doing a full scan (&lt;code&gt;type=ALL&lt;/code&gt; in EXPLAIN) as the top AAS consumer, adding the right index is the highest-leverage fix:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Confirm execution plan before adding index&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; customer_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 12345&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pending&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Add the index (run during low-traffic window or use pt-online-schema-change for large tables)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ADD&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_customer_status (customer_id, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify the new plan&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; customer_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 12345&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pending&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Aurora MySQL supports online DDL for most index additions. For large tables, monitor &lt;code&gt;information_schema.INNODB_ONLINE_DDL&lt;/code&gt; for progress.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Route reads to Aurora reader endpoint&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If reads are being sent to the writer endpoint — intentionally or by misconfiguration — routing them to the reader reduces writer load immediately:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify no heavy reads are running on writer&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; user, info, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;time&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;PROCESSLIST&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; command &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;!=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;Sleep&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; info &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIKE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;SELECT%&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; time&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Update application connection configuration to direct SELECT queries to the Aurora reader endpoint (&lt;code&gt;cluster.ro.amazonaws.com&lt;/code&gt;). For applications that cannot distinguish read vs write connections, a read-write splitting proxy (ProxySQL, RDS Proxy) is an intermediate step.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Kill long-running blocking transactions&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If &lt;code&gt;INFORMATION_SCHEMA.INNODB_TRX&lt;/code&gt; shows a transaction blocking others and it has been running longer than its normal expected duration:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Identify the blocking thread&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; trx_mysql_thread_id, trx_started, trx_query&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;INNODB_TRX&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; trx_started&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Kill it&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;KILL&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;trx_mysql_thread_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Coordinate with the application team before killing production transactions. For recurring batch jobs that grow too large, the fix is chunking them: process rows in batches of 1,000–10,000 with explicit commits between chunks rather than one large transaction.&lt;/p&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Index additions:&lt;/strong&gt; Indexes can be dropped if they cause unexpected plan changes for other queries: &lt;code&gt;ALTER TABLE orders DROP INDEX idx_customer_status&lt;/code&gt;. Monitor query plan changes via Performance Insights for 24 hours after index additions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Read routing changes:&lt;/strong&gt; Application-level changes to reader endpoint routing can be reverted by changing the connection string back. Stateful connections in the pool drain within one connection TTL cycle.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Killed transactions:&lt;/strong&gt; The killed transaction rolls back automatically. InnoDB rollback time is proportional to transaction size. Monitor &lt;code&gt;information_schema.INNODB_TRX&lt;/code&gt; to confirm completion. No binlog event is written for the rolled-back transaction.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Aurora Performance Insights exposes API access to DB load metrics. A CloudWatch Alarm on &lt;code&gt;DBLoad&lt;/code&gt; exceeding the instance’s &lt;code&gt;max_connections&lt;/code&gt;-based threshold (typically 2x vCPU count as a conservative threshold) can trigger automated notification before CPU fully saturates.&lt;/p&gt;
&lt;p&gt;A more targeted detection: schedule a query every 2 minutes on the writer that checks for long-running transactions and high-AAS queries simultaneously:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Long transaction detection (run on writer, schedule via external monitor)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; COUNT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; long_txn_count&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;INNODB_TRX&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; TIMESTAMPDIFF(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SECOND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, trx_started, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;NOW&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 120&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alert if &lt;code&gt;long_txn_count&lt;/code&gt; exceeds 2 during business hours. In most workloads, a transaction running more than 2 minutes on a write-heavy Aurora cluster is either a stuck batch job or a deadlock victim that failed to rollback.&lt;/p&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;What broke:&lt;/strong&gt; Aurora MySQL writer CPU spiked to 90%+, causing write latency to climb and application error rates to increase. The root cause was a high-AAS query executing a full table scan on a growing table after a recent data volume increase changed the query’s cost model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What was done:&lt;/strong&gt; Performance Insights identified the specific query. An index was added targeting the full-scan column. Writer CPU returned to baseline within 5 minutes of the index becoming active.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What prevents recurrence:&lt;/strong&gt; Performance Insights monitoring with a DBLoad alarm at 4 AAS (writer-size-appropriate threshold) provides early warning. The long-transaction check query is scheduled to run every 2 minutes as a canary for batch job runaway.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;checklist&quot;&gt;Checklist&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Open Performance Insights — confirm DBLoad is elevated on the writer, not the reader&lt;/li&gt;
&lt;li&gt;Compare &lt;code&gt;DBLoadCPU&lt;/code&gt; vs &lt;code&gt;DBLoadNonCPU&lt;/code&gt; — determine if wait events or CPU execution dominate&lt;/li&gt;
&lt;li&gt;Identify top query by AAS in Performance Insights Top SQL tab&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;EXPLAIN&lt;/code&gt; on the top query — look for &lt;code&gt;type=ALL&lt;/code&gt; or high &lt;code&gt;rows&lt;/code&gt; estimate&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;SHOW FULL PROCESSLIST&lt;/code&gt; — check for many stacked identical queries&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;SHOW ENGINE INNODB STATUS\G&lt;/code&gt; — look for lock waits and recent deadlocks&lt;/li&gt;
&lt;li&gt;Run long-transaction query on &lt;code&gt;INFORMATION_SCHEMA.INNODB_TRX&lt;/code&gt; — look for transactions older than 60 seconds&lt;/li&gt;
&lt;li&gt;If full scan confirmed — add index in staging, test plan change, deploy to production&lt;/li&gt;
&lt;li&gt;If lock contention confirmed — identify blocking transaction, coordinate kill or reduce transaction scope&lt;/li&gt;
&lt;li&gt;Verify no SELECT queries are routed to writer endpoint — check connection strings in application config&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: An Aurora MySQL writer CPU spike is treated as a capacity problem, which leads to scaling the instance or adding replicas — changes that are slow, expensive, and do not address a bad query plan, lock contention, or a batch job that outgrew its transaction scope.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Open Performance Insights first: split &lt;code&gt;DBLoadCPU&lt;/code&gt; from &lt;code&gt;DBLoadNonCPU&lt;/code&gt; to determine whether the bottleneck is execution or waiting, identify the top AAS query, then follow the decision tree to the targeted remediation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: CPU returns to baseline and DBLoad drops below the vCPU-count threshold within minutes of addressing the root cause — without any instance scaling.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, enable a CloudWatch alarm on &lt;code&gt;DBLoad&lt;/code&gt; at a threshold of 2× the instance’s vCPU count, and verify that Performance Insights is enabled on your Aurora writer so the top SQL tab is populated the next time a spike occurs.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cloud</category><category>checklist</category><category>failures</category></item><item><title>MySQL Replication Lag Decision Tree</title><link>https://rajivonai.com/blog/2023-02-06-mysql-replication-lag-decision-tree/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-02-06-mysql-replication-lag-decision-tree/</guid><description>A systematic runbook for diagnosing MySQL replication lag — from initial SHOW REPLICA STATUS to parallel apply, long transactions, and relay log space.</description><pubDate>Mon, 06 Feb 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Replication lag in MySQL is a symptom, not a cause — but the cause is almost always one of five things, and the diagnostic sequence matters.&lt;/strong&gt; Engineers who start tuning parallel replica workers before they check whether the replica’s SQL thread is even running waste an hour on the wrong problem. This runbook covers the decision tree from first alert to targeted remediation.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;The alert fires: &lt;code&gt;Seconds_Behind_Source&lt;/code&gt; is 300 and climbing. Read queries routed to the replica are returning data that is several minutes stale. The application is surfacing incorrect balances, missing recent records, or serving out-of-date inventory counts depending on what is being replicated.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Seconds_Behind_Source&lt;/code&gt; measures the timestamp difference between the most recently executed event on the replica and the timestamp recorded in the primary’s binlog for the same event. It is an estimate of how far behind the replica is in applying committed transactions from the primary. When it grows without bound, the replica is applying events slower than the primary is producing them — or it has stopped applying events entirely.&lt;/p&gt;
&lt;p&gt;The distinction between “stopped” and “slow” is the first fork in the diagnostic tree.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Seconds_Behind_Source&lt;/code&gt; growing&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW REPLICA STATUS\G&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Replica is falling behind; does not indicate why&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;SQL_Running: No&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW REPLICA STATUS\G&lt;/code&gt;&lt;/td&gt;&lt;td&gt;SQL thread stopped — replication halted, not just slow&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;IO_Running: No&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW REPLICA STATUS\G&lt;/code&gt;&lt;/td&gt;&lt;td&gt;I/O thread stopped — not receiving new binlog events&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Last_SQL_Error&lt;/code&gt; non-empty&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW REPLICA STATUS\G&lt;/code&gt;&lt;/td&gt;&lt;td&gt;SQL thread encountered an error on a specific event&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High relay log space&lt;/td&gt;&lt;td&gt;&lt;code&gt;Relay_Log_Space&lt;/code&gt; in SHOW REPLICA STATUS&lt;/td&gt;&lt;td&gt;Binlog arriving faster than SQL thread can apply it&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Long-running transactions on primary&lt;/td&gt;&lt;td&gt;&lt;code&gt;INFORMATION_SCHEMA.INNODB_TRX&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Large transactions create large binlog events that take time to apply&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Thread status&lt;/strong&gt; — Verify both replication threads are running before investigating lag causes:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;REPLICA&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; STATUS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;\G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Look for &lt;code&gt;Replica_IO_Running: Yes&lt;/code&gt; and &lt;code&gt;Replica_SQL_Running: Yes&lt;/code&gt;. If either is &lt;code&gt;No&lt;/code&gt;, read &lt;code&gt;Last_IO_Error&lt;/code&gt; or &lt;code&gt;Last_SQL_Error&lt;/code&gt; for the stop reason. A stopped thread is not a lag problem — it is a replication failure. Fix the root cause before any lag remediation.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Long-running transactions on the primary&lt;/strong&gt; — A single long transaction creates one large binlog event that the replica must apply sequentially:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; trx_id, trx_started, trx_query,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       TIMESTAMPDIFF(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SECOND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, trx_started, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;NOW&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; trx_age_sec&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;INNODB_TRX&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; trx_started&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Any transaction older than 30–60 seconds is a candidate for blocking replica apply. Check &lt;code&gt;trx_query&lt;/code&gt; for the SQL responsible.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Top queries by wait time on primary&lt;/strong&gt; — Identify what the primary is spending time on:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; DIGEST_TEXT, COUNT_STAR, SUM_TIMER_WAIT,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       ROUND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(SUM_TIMER_WAIT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COUNT_STAR &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; 1e12, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;3&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; avg_latency_sec&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; performance_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;events_statements_summary_by_digest&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUM_TIMER_WAIT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;High-latency statements generating large binlog events are a common cause of chronic lag. A 10-second DELETE running every minute creates a 10-second replication backlog per cycle.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Parallel apply configuration&lt;/strong&gt; — Check whether multi-threaded replica apply is enabled:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; @@replica_parallel_workers, @@replica_parallel_type;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If &lt;code&gt;replica_parallel_workers&lt;/code&gt; is 0 or 1, the replica applies one transaction at a time. Modern MySQL supports &lt;code&gt;LOGICAL_CLOCK&lt;/code&gt; parallelism, which applies transactions from the same binlog group commit in parallel. On a high-throughput primary, single-threaded apply is the most common cause of chronic lag.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Relay log space&lt;/strong&gt; — Check if the relay log backlog is growing:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;REPLICA&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; STATUS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;\G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Look at &lt;code&gt;Relay_Log_Space&lt;/code&gt;. If this is large and growing, the I/O thread is receiving binlog events faster than the SQL thread processes them — confirming a slow-apply bottleneck rather than a network or connectivity issue.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Seconds_Behind_Source growing] --&gt; B{SQL_Running = YES?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no| C[Read Last_SQL_Error]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; D[Fix SQL error — skip or repair event]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| E{IO_Running = YES?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|no| F[Read Last_IO_Error]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; G[Fix network or auth issue]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|yes| H{Long transaction on primary?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|yes| I[Reduce transaction size on primary]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|no| J{parallel_workers is 0 or 1?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|yes| K[Enable LOGICAL_CLOCK parallel apply]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|no| L{Relay log space growing?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|yes| M[Increase relay_log_space_limit or scale replica]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|no| N[Check primary write volume vs replica capacity]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Enable parallel replica apply&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Single-threaded apply is the most common cause of lag on busy primaries. Enable multi-threaded apply using the &lt;code&gt;LOGICAL_CLOCK&lt;/code&gt; algorithm, which replicates the parallelism from the primary’s binlog group commit:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; GLOBAL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; replica_parallel_workers &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 4&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; GLOBAL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; replica_parallel_type &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;LOGICAL_CLOCK&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Required for crash-safe parallel apply&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; GLOBAL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; replica_preserve_commit_order &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Restart the SQL thread to apply:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;STOP&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; REPLICA&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SQL_THREAD;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;START&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; REPLICA&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SQL_THREAD;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Monitor &lt;code&gt;Seconds_Behind_Source&lt;/code&gt; to confirm the replica is catching up. The MySQL documentation recommends &lt;code&gt;replica_preserve_commit_order = 1&lt;/code&gt; when using parallel apply to maintain consistent external visibility.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Kill blocking long transactions on the primary&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If a single large transaction is generating a binlog event that takes minutes to apply, identify and interrupt it:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- On the primary&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; trx_id, trx_started, trx_mysql_thread_id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;INNODB_TRX&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; trx_started&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 5&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;KILL&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;trx_mysql_thread_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After killing the transaction, verify it rolls back cleanly. This is disruptive — validate that the transaction is truly blocking before killing it. If the transaction is a scheduled batch job, coordinate with the application team to reduce its scope (process in smaller batches) or schedule it during low-replication-sensitivity windows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Promote replica or add a new downstream replica&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If the primary’s write volume consistently exceeds what a single replica can apply even with parallel workers, the architecture has reached a scale limit. Options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Promote the lagging replica to primary and demote the original (for planned maintenance or topology change)&lt;/li&gt;
&lt;li&gt;Add a second-tier replica that replicates from a relay replica closer to the primary&lt;/li&gt;
&lt;li&gt;Evaluate whether reads can be sharded or moved to a read-optimized layer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is not a quick fix — it is an architectural response to sustained primary write volume exceeding replica apply capacity.&lt;/p&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;For parallel apply changes:&lt;/strong&gt; Disable by setting &lt;code&gt;replica_parallel_workers = 0&lt;/code&gt; and restarting the SQL thread. The change is non-destructive — disabling parallel apply reverts to sequential mode immediately.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;For killed transactions on primary:&lt;/strong&gt; The transaction will roll back automatically. Monitor &lt;code&gt;information_schema.INNODB_TRX&lt;/code&gt; to confirm the rollback completes. If the transaction was large, rollback can take as long as the original execution. No binlog event is emitted for the rolled-back transaction.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;For relay log space changes:&lt;/strong&gt; Increasing &lt;code&gt;relay_log_space_limit&lt;/code&gt; is non-destructive and can be done at runtime with &lt;code&gt;SET GLOBAL&lt;/code&gt;. Decreasing it requires waiting for relay log consumption to catch up first.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Replication lag monitoring lends itself to a simple alerting script. The core signal — &lt;code&gt;Seconds_Behind_Source&lt;/code&gt; above a threshold — can be captured from &lt;code&gt;SHOW REPLICA STATUS&lt;/code&gt; via any MySQL-compatible monitoring tool (Percona Monitoring and Management, CloudWatch RDS Enhanced Monitoring, or a custom cron-driven script).&lt;/p&gt;
&lt;p&gt;A more targeted automation: schedule a query on the primary every 5 minutes to check for transactions older than 60 seconds and write the result to a monitoring table. Any row in that table with &lt;code&gt;trx_age_sec &gt; 300&lt;/code&gt; is a candidate for alerting before it generates a multi-minute binlog event that stalls the replica.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Scheduled check for long-running transactions (run on primary)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; COUNT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; long_txn_count&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;INNODB_TRX&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; TIMESTAMPDIFF(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SECOND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, trx_started, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;NOW&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 60&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If this returns nonzero during steady-state operation, the replication lag root cause is already present even when lag is not yet visible.&lt;/p&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;What broke:&lt;/strong&gt; MySQL replication lag caused read replicas to serve stale data. The replica was applying committed transactions slower than the primary was producing them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What was done:&lt;/strong&gt; Identified the root cause (long transactions or single-threaded apply), enabled parallel replica apply or reduced transaction scope on the primary, and verified &lt;code&gt;Seconds_Behind_Source&lt;/code&gt; returned to near zero.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What prevents recurrence:&lt;/strong&gt; Parallel apply configured with &lt;code&gt;LOGICAL_CLOCK&lt;/code&gt; handles normal write volume. Long-transaction alerting on the primary gives early warning before binlog events stall the replica apply thread.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;checklist&quot;&gt;Checklist&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Run &lt;code&gt;SHOW REPLICA STATUS\G&lt;/code&gt; and confirm both &lt;code&gt;Replica_IO_Running&lt;/code&gt; and &lt;code&gt;Replica_SQL_Running&lt;/code&gt; are &lt;code&gt;Yes&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Read &lt;code&gt;Last_SQL_Error&lt;/code&gt; and &lt;code&gt;Last_IO_Error&lt;/code&gt; — if either is non-empty, address the error before diagnosing lag&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;Seconds_Behind_Source&lt;/code&gt; trend — is it growing, stable, or recovering?&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;INFORMATION_SCHEMA.INNODB_TRX&lt;/code&gt; on primary for transactions older than 30 seconds&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;performance_schema.events_statements_summary_by_digest&lt;/code&gt; on primary for top wait-time queries&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;SELECT @@replica_parallel_workers, @@replica_parallel_type&lt;/code&gt; — if workers is 0 or 1, evaluate enabling parallel apply&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;Relay_Log_Space&lt;/code&gt; from &lt;code&gt;SHOW REPLICA STATUS&lt;/code&gt; — large growing relay log confirms slow-apply bottleneck&lt;/li&gt;
&lt;li&gt;If enabling parallel apply, set &lt;code&gt;replica_preserve_commit_order = 1&lt;/code&gt; before restarting the SQL thread&lt;/li&gt;
&lt;li&gt;After any change, monitor &lt;code&gt;Seconds_Behind_Source&lt;/code&gt; for 10–15 minutes to confirm the trend reverses&lt;/li&gt;
&lt;li&gt;Document the root cause and resolution in your incident log for pattern tracking&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: &lt;code&gt;Seconds_Behind_Source&lt;/code&gt; grows during an incident and the natural instinct is to tune parallel workers — but if the SQL thread has stopped or there is a long transaction blocking apply, that tuning changes nothing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Follow the decision tree: check thread status first, long transactions second, parallel apply configuration third, relay log space last. Each check either identifies the cause or rules it out before the next step.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After the correct remediation, &lt;code&gt;Seconds_Behind_Source&lt;/code&gt; stops growing and trends back toward zero within a few minutes, confirming the apply bottleneck was addressed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, run &lt;code&gt;SELECT @@replica_parallel_workers, @@replica_parallel_type&lt;/code&gt; on every replica in your fleet — if any replica has &lt;code&gt;parallel_workers = 0&lt;/code&gt; or &lt;code&gt;1&lt;/code&gt;, evaluate enabling &lt;code&gt;LOGICAL_CLOCK&lt;/code&gt; parallel apply before the next high-write event.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category><category>failures</category></item><item><title>MySQL Cardinality and Index Selectivity</title><link>https://rajivonai.com/blog/2023-01-30-mysql-cardinality-and-index-selectivity/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-01-30-mysql-cardinality-and-index-selectivity/</guid><description>MySQL ignores an index when the optimizer estimates a full scan is cheaper — which happens when cardinality is too low, statistics are stale, or the query shape doesn&apos;t match index selectivity. How to diagnose which problem it is and what to do about each.</description><pubDate>Mon, 30 Jan 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;MySQL can have a perfectly valid index on a column and still choose a full table scan — not because the optimizer is broken, but because the index is genuinely not worth using.&lt;/strong&gt; Understanding cardinality and selectivity is what separates engineers who add indexes thoughtfully from those who add them and then wonder why EXPLAIN still shows &lt;code&gt;type=ALL&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most engineers learn early that indexes speed up queries. What the introductory materials skip is the optimizer’s decision logic: an index is only used when the optimizer estimates it will be cheaper than not using it. That estimate is driven by selectivity — how many rows the index is expected to filter out. A high-selectivity index on an email column eliminates nearly every row it does not match. A low-selectivity index on a status column with three possible values eliminates almost nothing, and the optimizer correctly concludes that scanning the whole table in a single sequential pass is cheaper than bouncing through the index structure.&lt;/p&gt;
&lt;p&gt;This distinction matters most on large tables. On a 200-row test database, the optimizer often uses indexes it would ignore on a 50-million-row production table, because the cost model changes with scale. Engineers who tune queries against small datasets frequently miss the issue until the table grows.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The failure mode is specific: you create an index, run EXPLAIN, and see &lt;code&gt;type=ALL&lt;/code&gt;. The index exists. The query filters on the indexed column. But the optimizer ignores it. This confuses engineers who expect index presence to imply index use.&lt;/p&gt;
&lt;p&gt;The root cause is low selectivity. If a &lt;code&gt;status&lt;/code&gt; column has three values — &lt;code&gt;active&lt;/code&gt;, &lt;code&gt;inactive&lt;/code&gt;, &lt;code&gt;deleted&lt;/code&gt; — and 60% of rows are &lt;code&gt;active&lt;/code&gt;, an index on &lt;code&gt;status&lt;/code&gt; where the query filters &lt;code&gt;WHERE status = &apos;active&apos;&lt;/code&gt; returns 60% of the table. InnoDB’s cost model estimates that reading 60% of a large table via random index lookups is more expensive than a sequential full scan, and it is usually right.&lt;/p&gt;
&lt;p&gt;The second failure mode is stale cardinality estimates. InnoDB samples pages to estimate cardinality rather than counting exact distinct values. After a large bulk insert, a table truncate and reload, or months of accumulating rows, the stored cardinality estimate can be wildly wrong, causing the optimizer to make poor choices.&lt;/p&gt;
&lt;p&gt;Why does the optimizer choose a full table scan despite an index, and how can engineers design indexes that the database will actually use?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Cardinality&lt;/strong&gt; is the number of distinct values in an index, as estimated by InnoDB. &lt;strong&gt;Selectivity&lt;/strong&gt; is the ratio of cardinality to total rows, driving the optimizer’s cost model.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Query filters by status] --&gt; B{MySQL Optimizer}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; C[Evaluate index — High random IO cost]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; D[Evaluate table scan — Sequential IO cost]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; E{Cost Model}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; E&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt; F[Table scan chosen]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; G[Index ignored]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A selectivity of 0.99 (nearly unique column) is excellent. A selectivity of 0.000003 (three values across a million rows) is almost worthless for filtering.&lt;/p&gt;
&lt;p&gt;You can query estimated selectivity directly:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;INDEX_NAME&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;COLUMN_NAME&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;CARDINALITY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;TABLE_ROWS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  ROUND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;CARDINALITY&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;TABLE_ROWS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;4&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; selectivity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;STATISTICS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; s&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;TABLES&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; t&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;TABLE_SCHEMA&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;TABLE_SCHEMA&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;TABLE_NAME&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;TABLE_NAME&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;TABLE_SCHEMA&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;your_db&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; s&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;TABLE_NAME&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;your_table&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;How InnoDB estimates cardinality:&lt;/strong&gt; InnoDB uses random page sampling rather than a full scan. The number of pages sampled is controlled by &lt;code&gt;innodb_stats_sample_pages&lt;/code&gt; and &lt;code&gt;innodb_stats_persistent_sample_pages&lt;/code&gt;. Small samples on large tables with skewed data distributions produce inaccurate estimates.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Refreshing stale estimates:&lt;/strong&gt; Running &lt;code&gt;ANALYZE TABLE orders;&lt;/code&gt; re-runs the sampling process and updates the stored cardinality in &lt;code&gt;mysql.innodb_table_stats&lt;/code&gt;. After bulk loads, table rebuilds, or significant data changes, running this is the fastest way to restore accurate optimizer decisions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Composite indexes and leading column selectivity:&lt;/strong&gt; A composite index on &lt;code&gt;(status, created_at)&lt;/code&gt; is only useful when the query can filter on &lt;code&gt;status&lt;/code&gt; first. If &lt;code&gt;status&lt;/code&gt; has low selectivity, the optimizer may still prefer a full scan, unless the &lt;code&gt;created_at&lt;/code&gt; range is exceptionally narrow.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern across high-scale engineering teams is to enforce strict index selectivity thresholds during schema reviews. Shopify’s engineering blog explicitly outlines their MySQL indexing strategy, noting that adding an index on a boolean or low-cardinality column is an anti-pattern. They observe that MySQL’s optimizer will frequently ignore these indexes because the random I/O required to fetch rows exceeds the sequential I/O cost of a full table scan.&lt;/p&gt;
&lt;p&gt;Similarly, MySQL’s own InnoDB engine relies heavily on &lt;code&gt;innodb_stats_persistent_sample_pages&lt;/code&gt;. If the sample pages do not accurately reflect the distribution of data — such as immediately following a massive backfill — the optimizer behaves unpredictably. The established behavior to combat this is hooking &lt;code&gt;ANALYZE TABLE&lt;/code&gt; into post-migration automation to ensure the optimizer has fresh cardinality estimates before taking production traffic.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Stale cardinality after bulk load&lt;/td&gt;&lt;td&gt;Optimizer uses wrong index or skips a valid one&lt;/td&gt;&lt;td&gt;Estimate reflects pre-load row distribution&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Composite index with low-selectivity leading column&lt;/td&gt;&lt;td&gt;Index not entered even when tail columns are selective&lt;/td&gt;&lt;td&gt;Optimizer evaluates leading column selectivity first&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FORCE INDEX overriding a correct low-selectivity decision&lt;/td&gt;&lt;td&gt;Query runs slower than a full scan would&lt;/td&gt;&lt;td&gt;Forces random I/O on a column that benefits from sequential scan&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: An index exists but EXPLAIN shows &lt;code&gt;type=ALL&lt;/code&gt; because selectivity is too low for the optimizer to prefer it over a full scan.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Check selectivity using the formula above; run ANALYZE TABLE after bulk data changes; design composite indexes with the most selective column first.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Compare &lt;code&gt;EXPLAIN&lt;/code&gt; output before and after ANALYZE TABLE on a table with stale stats; watch &lt;code&gt;type&lt;/code&gt; change from &lt;code&gt;ALL&lt;/code&gt; to &lt;code&gt;ref&lt;/code&gt; or &lt;code&gt;range&lt;/code&gt; when the estimate is accurate.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, run the selectivity query on your largest tables and verify that indexes on low-cardinality columns are intentional.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category><category>failures</category></item><item><title>PostgreSQL Autovacuum Failure Workflow</title><link>https://rajivonai.com/blog/2023-01-16-postgresql-autovacuum-failure-workflow/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-01-16-postgresql-autovacuum-failure-workflow/</guid><description>A step-by-step runbook for diagnosing and resolving autovacuum failures: dead tuple accumulation, bloat, and transaction ID wraparound risk.</description><pubDate>Mon, 16 Jan 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;When &lt;code&gt;n_dead_tup&lt;/code&gt; climbs and autovacuum isn’t keeping up, you have roughly two problems running in parallel: the bloat you can see today, and the transaction ID wraparound risk you might not notice until PostgreSQL forces an emergency shutdown.&lt;/strong&gt; The failure modes compound — bloat slows queries, which slows transactions, which delays vacuum, which grows bloat further. Getting out requires understanding which part of the cycle broke first.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s MVCC model keeps old row versions in the heap rather than updating in place. Autovacuum’s job is to reclaim those dead tuples and keep the transaction ID horizon from advancing too far. Under moderate write load, autovacuum usually runs unnoticed. Under high write volume — bulk loads, frequent deletes, update-heavy workloads — it falls behind.&lt;/p&gt;
&lt;p&gt;When autovacuum falls behind, the visible effects are: growing table size on disk, sequential scans replacing index scans as indexes become less selective relative to bloat, and queries that were running in single-digit milliseconds start showing variance. The less visible effect is &lt;code&gt;age(relfrozenxid)&lt;/code&gt; creeping toward the 2-billion wraparound limit, at which point PostgreSQL will refuse to serve any read or write until a full-table vacuum completes.&lt;/p&gt;
&lt;p&gt;The root cause is almost never “autovacuum is broken.” It is almost always one of three things: a long-running transaction blocking vacuum from removing dead tuples, the &lt;code&gt;autovacuum_vacuum_scale_factor&lt;/code&gt; threshold being too coarse for a large table, or &lt;code&gt;autovacuum_vacuum_cost_delay&lt;/code&gt; throttling throughput below what the write rate demands.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;n_dead_tup&lt;/code&gt; rising continuously&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_user_tables&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Vacuum not keeping up with write rate&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Table size growing without row count growth&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_size_pretty(pg_total_relation_size(...))&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Physical bloat accumulating in heap&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Sequential scans replacing index scans&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_user_tables.seq_scan&lt;/code&gt; increasing&lt;/td&gt;&lt;td&gt;Planner estimates degrading due to bloat&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;age(datfrozenxid)&lt;/code&gt; &gt; 1.5 billion&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_database&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Transaction ID wraparound risk is real&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Last autovacuum timestamp hours or days stale&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_user_tables.last_autovacuum&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Vacuum is being blocked or never triggered&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Long-lived idle-in-transaction sessions&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Blocking vacuum horizon advancement&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Dead tuple accumulation by table&lt;/strong&gt; — find which tables are most behind:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  schemaname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  tablename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_dead_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_live_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  round&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_dead_tup::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;numeric&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; nullif&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_live_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;+&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;2&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; dead_pct,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_autovacuum,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_autoanalyze&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;High &lt;code&gt;dead_pct&lt;/code&gt; on a large table tells you where to focus. A &lt;code&gt;last_autovacuum&lt;/code&gt; that is hours old on a high-write table means the trigger threshold was never crossed or vacuum was blocked.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Active blocking transactions&lt;/strong&gt; — long-running transactions prevent vacuum from advancing the horizon:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  usename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  wait_event_type,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  wait_event,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xact_start &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xact_duration,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  left&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(query, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;80&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query_preview&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; !=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xact_start &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IS NOT NULL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xact_duration &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Any session with &lt;code&gt;xact_duration&lt;/code&gt; over 10 minutes that is &lt;code&gt;idle in transaction&lt;/code&gt; is a primary vacuum-blocker candidate. PostgreSQL cannot remove dead tuples older than the oldest open transaction’s snapshot.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Transaction ID wraparound risk&lt;/strong&gt; — check how close each database is to the 2-billion limit:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  datname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  age(datfrozenxid) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xid_age,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  2000000000&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; -&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; age(datfrozenxid) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xid_remaining&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_database&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; age(datfrozenxid) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;PostgreSQL issues a WARNING at &lt;code&gt;age &gt; 1.5 billion&lt;/code&gt; and becomes read-only at &lt;code&gt;age &gt; 1.95 billion&lt;/code&gt;. Any value above 1 billion warrants attention. Above 1.5 billion, treat it as an incident in progress.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Current autovacuum scale factor&lt;/strong&gt; — determine whether the threshold is too coarse:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW autovacuum_vacuum_scale_factor;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Also check per-table overrides:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; relname, reloptions&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_class&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; reloptions &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;IS NOT NULL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; relkind &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;r&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The default &lt;code&gt;autovacuum_vacuum_scale_factor = 0.2&lt;/code&gt; means autovacuum triggers after 20% of the table’s live rows have become dead. On a 100-million-row table, that is 20 million dead tuples before vacuum runs — enough bloat to double the table’s physical size.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Background writer and checkpoint pressure&lt;/strong&gt; — determine if I/O is the bottleneck:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  checkpoints_timed,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  checkpoints_req,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  checkpoint_write_time,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  checkpoint_sync_time,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  buffers_clean,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  maxwritten_clean,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  buffers_backend&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_bgwriter;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;High &lt;code&gt;maxwritten_clean&lt;/code&gt; means the background writer hit its &lt;code&gt;bgwriter_lru_maxpages&lt;/code&gt; limit repeatedly. High &lt;code&gt;buffers_backend&lt;/code&gt; means backends are doing their own dirty buffer flushing — a sign that I/O throughput is limiting vacuum’s ability to write.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[n_dead_tup growing] --&gt; B{last_autovacuum recent?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no — never triggered| C{autovacuum=on globally?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|no| D[Enable autovacuum in postgresql.conf]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|yes| E{scale_factor too high?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|yes| F[Lower per-table scale_factor]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes — vacuum ran but did not help| G{oldest xact blocking vacuum?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|yes| H{safe to terminate?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|yes| I[pg_terminate_backend — then VACUUM]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|no| J[Wait for transaction — then VACUUM]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|no| K{cost_delay throttling?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|yes| L[Reduce cost_delay per-table]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|no| M{xid_age above 1.5B?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    M --&gt;|yes| N[VACUUM FREEZE — emergency]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    M --&gt;|no| O[Manual VACUUM VERBOSE — diagnose output]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Manual VACUUM to clear immediate bloat&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Run a manual &lt;code&gt;VACUUM VERBOSE&lt;/code&gt; to force reclamation and get diagnostic output:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;VACUUM &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VERBOSE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The verbose output shows how many dead tuples were removed, how many pages were scanned, and whether any tuples could not be removed due to transaction horizon constraints. If the output shows tuples “not removable due to oldest xmin,” a blocking transaction is the problem, not the configuration.&lt;/p&gt;
&lt;p&gt;For wraparound risk specifically, add &lt;code&gt;FREEZE&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;VACUUM FREEZE tablename;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;FREEZE&lt;/code&gt; advances &lt;code&gt;relfrozenxid&lt;/code&gt; and is the only action that reduces &lt;code&gt;age(datfrozenxid)&lt;/code&gt;. It is I/O-intensive on large tables, so run it during off-peak hours when possible.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Tune per-table autovacuum thresholds&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For high-write tables where the global &lt;code&gt;scale_factor&lt;/code&gt; is too coarse, override at the table level:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; high_write_table &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  autovacuum_vacuum_scale_factor &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;01&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  autovacuum_vacuum_threshold &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  autovacuum_vacuum_cost_delay &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 2&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  autovacuum_vacuum_cost_limit &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 400&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;scale_factor = 0.01&lt;/code&gt; triggers autovacuum after 1% dead tuples instead of 20%. &lt;code&gt;cost_delay = 2ms&lt;/code&gt; with &lt;code&gt;cost_limit = 400&lt;/code&gt; doubles autovacuum’s I/O budget relative to the default (&lt;code&gt;cost_delay = 20ms&lt;/code&gt;, &lt;code&gt;cost_limit = 200&lt;/code&gt;). These are per-table and do not affect global behavior.&lt;/p&gt;
&lt;p&gt;To verify the override is active:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; relname, reloptions&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_class&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; relname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;high_write_table&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Terminate blocking long-running transactions&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If &lt;code&gt;pg_stat_activity&lt;/code&gt; shows a session that has been &lt;code&gt;idle in transaction&lt;/code&gt; for an extended period and it cannot be resolved through application-layer means, terminate it:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_terminate_backend(pid)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle in transaction&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xact_start &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; interval &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;10 minutes&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After terminating, run &lt;code&gt;VACUUM VERBOSE&lt;/code&gt; on the affected table immediately to reclaim the dead tuples that were being held.&lt;/p&gt;
&lt;p&gt;To prevent recurrence, set the session-level timeout in &lt;code&gt;postgresql.conf&lt;/code&gt; or per-role:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SYSTEM&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idle_in_transaction_session_timeout &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;5min&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_reload_conf();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;VACUUM&lt;/code&gt; and &lt;code&gt;VACUUM FREEZE&lt;/code&gt; are read-safe operations. They do not lock tables for reads or writes (except at the very start of each heap page scan, which is a brief shared lock). They can be run and stopped at any time without data risk.&lt;/li&gt;
&lt;li&gt;Per-table &lt;code&gt;autovacuum_*&lt;/code&gt; overrides via &lt;code&gt;ALTER TABLE ... SET (...)&lt;/code&gt; are immediately active and immediately reversible: &lt;code&gt;ALTER TABLE tablename RESET (autovacuum_vacuum_scale_factor)&lt;/code&gt; returns to the global default.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pg_terminate_backend&lt;/code&gt; terminates the target session’s transaction — the application will see a connection error and must retry. This is the most disruptive remediation and should only be used when the blocking duration justifies it.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt; changes take effect for new transactions immediately after &lt;code&gt;pg_reload_conf()&lt;/code&gt;. Existing connections are not affected until they start a new transaction.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;The most impactful automation is a scheduled query that surfaces tables where &lt;code&gt;n_dead_tup&lt;/code&gt; exceeds a threshold before vacuum falls far enough behind to cause bloat. Using &lt;code&gt;pg_cron&lt;/code&gt; (if installed):&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Run every hour; log tables where dead_pct &gt; 10%&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; cron&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schedule&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;vacuum-watch&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;0 * * * *&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, $$&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; ops&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;vacuum_alerts&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (tablename, n_dead_tup, dead_pct, captured_at)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    tablename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    n_dead_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;    round&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_dead_tup::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;numeric&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; nullif&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_live_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;+&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;2&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;),&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; round&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_dead_tup::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;numeric&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; nullif&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_live_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;+&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;2&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;$$);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Separately, a daily alert on &lt;code&gt;age(datfrozenxid)&lt;/code&gt; crossing 500 million gives operational lead time well before the 1.5-billion warning threshold.&lt;/p&gt;
&lt;p&gt;For the deeper argument on why autovacuum should be treated as a capacity planning problem rather than a maintenance task, see &lt;a href=&quot;https://rajivonai.com/blog/2025-09-13-autovacuum-is-a-capacity-problem-not-a-maintenance-task/&quot;&gt;Autovacuum Is a Capacity Problem, Not a Maintenance Task&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The foundation of what autovacuum is doing and why its defaults are sized the way they are is covered in &lt;a href=&quot;https://rajivonai.com/blog/2022-04-11-postgresql-autovacuum-what-every-engineer-should-know/&quot;&gt;PostgreSQL Autovacuum: What Every Engineer Should Know&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s autovacuum documentation describes the trigger formula directly: a table is eligible for autovacuum when &lt;code&gt;n_dead_tup &gt; autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * pg_class.reltuples&lt;/code&gt;. The default &lt;code&gt;scale_factor&lt;/code&gt; of 0.2 was sized for databases where tables have at most a few million rows. For tables with tens or hundreds of millions of rows, the documented recommendation from PostgreSQL wiki is to lower &lt;code&gt;scale_factor&lt;/code&gt; to 0.01 or even 0.001 and raise &lt;code&gt;autovacuum_vacuum_threshold&lt;/code&gt; to a fixed low count.&lt;/p&gt;
&lt;p&gt;The documented pattern from the PostgreSQL MVCC documentation is that vacuum cannot remove a dead tuple that is still visible to any open transaction. This is not a bug — it is a consequence of snapshot isolation. The oldest running transaction’s &lt;code&gt;xmin&lt;/code&gt; forms the vacuum horizon; dead tuples older than that horizon cannot be reclaimed regardless of how aggressively autovacuum is configured.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Trigger&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Vacuum makes no progress despite running&lt;/td&gt;&lt;td&gt;Long-running transaction holds vacuum horizon&lt;/td&gt;&lt;td&gt;Terminate the blocking session; set &lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Autovacuum never triggers on large table&lt;/td&gt;&lt;td&gt;&lt;code&gt;scale_factor&lt;/code&gt; too high; threshold never crossed&lt;/td&gt;&lt;td&gt;Lower &lt;code&gt;scale_factor&lt;/code&gt; to 0.01 per-table&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;VACUUM FREEZE&lt;/code&gt; takes hours, blocks operations&lt;/td&gt;&lt;td&gt;Emergency freeze on a table with billions of rows&lt;/td&gt;&lt;td&gt;Run during maintenance window; break into table partition chunks if possible&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;cost_delay&lt;/code&gt; throttles vacuum below write rate&lt;/td&gt;&lt;td&gt;Default 20ms delay limits vacuum I/O to burst&lt;/td&gt;&lt;td&gt;Lower &lt;code&gt;cost_delay&lt;/code&gt; to 2ms and raise &lt;code&gt;cost_limit&lt;/code&gt; to 400 per-table&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Manual vacuum returns immediately with no work&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt; shows active &lt;code&gt;xmin&lt;/code&gt; holding horizon&lt;/td&gt;&lt;td&gt;Wait for long transaction to close, then re-run vacuum&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Autovacuum falling behind grows bloat silently until queries slow, and eventually creates transaction ID wraparound risk that can force an emergency database shutdown.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Tune per-table &lt;code&gt;autovacuum_vacuum_scale_factor&lt;/code&gt; and &lt;code&gt;cost_delay&lt;/code&gt; for high-write tables, and set &lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt; to prevent long transactions from blocking the vacuum horizon.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After applying per-table overrides, &lt;code&gt;last_autovacuum&lt;/code&gt; timestamps on affected tables should refresh within minutes, and &lt;code&gt;n_dead_tup&lt;/code&gt; should stabilize rather than grow between checks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run the dead tuple query from Check 1 this week against your production database. If any table has &lt;code&gt;dead_pct &gt; 10%&lt;/code&gt; and a &lt;code&gt;last_autovacuum&lt;/code&gt; older than an hour, that table needs a per-table threshold override today.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;checklist&quot;&gt;Checklist&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Query &lt;code&gt;pg_stat_user_tables&lt;/code&gt; to identify tables with high &lt;code&gt;n_dead_tup&lt;/code&gt; and stale &lt;code&gt;last_autovacuum&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;pg_stat_activity&lt;/code&gt; for sessions in &lt;code&gt;idle in transaction&lt;/code&gt; state longer than 5 minutes&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;age(datfrozenxid)&lt;/code&gt; in &lt;code&gt;pg_database&lt;/code&gt; — alert if any value exceeds 500 million&lt;/li&gt;
&lt;li&gt;Verify &lt;code&gt;autovacuum = on&lt;/code&gt; is set globally in &lt;code&gt;postgresql.conf&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Check per-table &lt;code&gt;reloptions&lt;/code&gt; for existing autovacuum overrides on affected tables&lt;/li&gt;
&lt;li&gt;If no blocking transaction: run &lt;code&gt;VACUUM VERBOSE tablename&lt;/code&gt; and inspect output for horizon messages&lt;/li&gt;
&lt;li&gt;Apply per-table &lt;code&gt;autovacuum_vacuum_scale_factor = 0.01&lt;/code&gt; to any table with &gt; 10 million rows&lt;/li&gt;
&lt;li&gt;Apply per-table &lt;code&gt;autovacuum_vacuum_cost_delay = 2&lt;/code&gt; for high-write tables&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;xid_age &gt; 1.5 billion&lt;/code&gt;: schedule emergency &lt;code&gt;VACUUM FREEZE&lt;/code&gt; immediately&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;idle_in_transaction_session_timeout = &apos;5min&apos;&lt;/code&gt; in &lt;code&gt;postgresql.conf&lt;/code&gt; to prevent recurrence&lt;/li&gt;
&lt;li&gt;Verify changes with &lt;code&gt;pg_reload_conf()&lt;/code&gt; and re-check &lt;code&gt;pg_stat_user_tables&lt;/code&gt; after 15 minutes&lt;/li&gt;
&lt;li&gt;Add a monitoring alert for &lt;code&gt;n_dead_tup / n_live_tup &gt; 0.1&lt;/code&gt; on your largest tables&lt;/li&gt;
&lt;/ol&gt;</content:encoded><category>databases</category><category>checklist</category><category>failures</category></item><item><title>PostgreSQL Statistics: Why the Optimizer Gets It Wrong</title><link>https://rajivonai.com/blog/2023-01-09-postgresql-statistics-why-the-optimizer-gets-it-wrong/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-01-09-postgresql-statistics-why-the-optimizer-gets-it-wrong/</guid><description>PostgreSQL&apos;s query planner depends entirely on per-column statistics that go stale after bulk loads — here is what that means for query plan quality and how to fix it.</description><pubDate>Mon, 09 Jan 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The PostgreSQL query planner does not look at your data. It looks at statistics about your data — histograms, most-common values, null fractions, and row count estimates stored in &lt;code&gt;pg_statistic&lt;/code&gt;. When those statistics are stale, the planner makes wrong decisions: it picks sequential scans over index scans, chooses nested loops over hash joins, and estimates 100 rows for a query that will return 10 million.&lt;/strong&gt; This is not a bug. It is an expected consequence of how cost-based optimization works, and it is entirely under operator control.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL builds query plans by estimating the cost of each possible execution path. Cost estimates depend on row count estimates, and row count estimates come from statistics. The statistics are not computed continuously — they are snapshots taken by &lt;code&gt;ANALYZE&lt;/code&gt; (or automatically by autovacuum’s analyze pass).&lt;/p&gt;
&lt;p&gt;Engineers typically encounter statistics problems in two situations. The first is after a bulk data load: a table that had 10,000 rows now has 10 million, but the planner still thinks it has 10,000 because &lt;code&gt;ANALYZE&lt;/code&gt; has not run since the load. The second is on tables with highly skewed distributions — a few values account for most rows, but the planner’s histogram does not have enough resolution to represent that accurately.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;PostgreSQL stores column statistics in &lt;code&gt;pg_statistic&lt;/code&gt;, exposed through the human-readable view &lt;code&gt;pg_stats&lt;/code&gt;. The key columns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;most_common_vals&lt;/code&gt; — the N most frequent values and their frequencies (&lt;code&gt;most_common_freqs&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;histogram_bounds&lt;/code&gt; — bucket boundaries dividing the non-MCV value range into equal-frequency slices&lt;/li&gt;
&lt;li&gt;&lt;code&gt;null_frac&lt;/code&gt; — fraction of rows that are NULL&lt;/li&gt;
&lt;li&gt;&lt;code&gt;correlation&lt;/code&gt; — how well physical row order matches logical sort order (1.0 = perfectly sorted; near 0 = random)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The planner combines these to estimate how many rows will pass a given filter condition. When the statistics are accurate, estimates are close to reality. When they are stale, the estimates can be off by orders of magnitude.&lt;/p&gt;
&lt;p&gt;The documented failure mode from PostgreSQL’s query planning documentation: after a bulk insert of 10 million rows into a table whose last &lt;code&gt;ANALYZE&lt;/code&gt; ran when the table had 1,000 rows, the planner’s &lt;code&gt;reltuples&lt;/code&gt; estimate in &lt;code&gt;pg_class&lt;/code&gt; will still read approximately 1,000. A query with &lt;code&gt;WHERE id = $1&lt;/code&gt; on a now-large table may generate a sequential scan plan — because the planner believes the table is small and the index overhead is not worth it.&lt;/p&gt;
&lt;p&gt;The core question: which statistics settings should you tune, and when should you manually trigger &lt;code&gt;ANALYZE&lt;/code&gt;?&lt;/p&gt;
&lt;h2 id=&quot;how-statistics-collection-works&quot;&gt;How Statistics Collection Works&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;default_statistics_target&lt;/code&gt; controls how much detail is collected per column. The default is 100, meaning PostgreSQL tracks the 100 most common values and uses 100 histogram buckets. The valid range is 1 to 10,000.&lt;/p&gt;
&lt;p&gt;Increasing &lt;code&gt;default_statistics_target&lt;/code&gt; makes &lt;code&gt;ANALYZE&lt;/code&gt; slower and the statistics larger, but improves estimate accuracy for skewed distributions. For most tables, the default is fine. For columns used in highly selective filters — especially foreign keys, status columns with many distinct values, or columns where the top 100 values do not capture the actual distribution — increasing the target at the column level is the right lever:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; STATISTICS&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 500&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can observe what the planner currently knows about a column:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  attname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_distinct,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  most_common_vals,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  most_common_freqs,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  histogram_bounds&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stats&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; attname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;status&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;n_distinct&lt;/code&gt; tells you how many distinct values PostgreSQL believes exist. A value of -0.5 means the planner estimates 50% of rows have distinct values (common for primary keys). A positive value is a raw count. If this number looks wrong, the statistics are stale.&lt;/p&gt;
&lt;p&gt;After a bulk load, always run &lt;code&gt;ANALYZE&lt;/code&gt; explicitly before the new data receives production query traffic:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE orders;           &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- whole table&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE orders (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);  &lt;/span&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- specific column only&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Autovacuum’s analyze pass uses &lt;code&gt;autovacuum_analyze_scale_factor&lt;/code&gt; (default: 0.2) and &lt;code&gt;autovacuum_analyze_threshold&lt;/code&gt; (default: 50). Same structural problem as vacuum thresholds: on a 50-million row table, autovacuum will not trigger &lt;code&gt;ANALYZE&lt;/code&gt; until 10 million rows have changed. For large bulk loads, waiting for autovacuum is not safe.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s query planner documentation (postgresql.org/docs/current/planner-stats.html) describes exactly how the planner uses &lt;code&gt;pg_statistic&lt;/code&gt; data: selectivity estimator functions read the statistics to produce row count estimates, and the planner chooses the lowest-cost plan based on those estimates combined with &lt;code&gt;seq_page_cost&lt;/code&gt;, &lt;code&gt;random_page_cost&lt;/code&gt;, and table and index size from &lt;code&gt;pg_class&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The correlation value in &lt;code&gt;pg_stats&lt;/code&gt; is particularly actionable: if &lt;code&gt;correlation&lt;/code&gt; for an indexed column is near 1.0 (data is physically sorted by that column), the planner will heavily favor index scans because random I/O effectively becomes sequential. If correlation is near 0 (random physical order), the planner may correctly prefer a sequential scan even for a highly selective query on a large table, because fetching scattered heap pages costs more than scanning the whole table with sequential I/O. Knowing this prevents incorrect index-forcing interventions.&lt;/p&gt;
&lt;p&gt;The documented pattern from PostgreSQL extended statistics documentation is that &lt;code&gt;CREATE STATISTICS&lt;/code&gt; (available since PostgreSQL 10) allows the planner to model correlations between columns — solving the multi-column selectivity problem that single-column histograms cannot handle. When a query filters on two correlated columns (e.g., &lt;code&gt;country&lt;/code&gt; and &lt;code&gt;city&lt;/code&gt;), single-column estimates multiply their selectivities independently, producing severely underestimated row counts.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Bulk insert without subsequent ANALYZE&lt;/td&gt;&lt;td&gt;Planner uses row counts from before the load; index scans may be abandoned for sequential scans on newly large tables&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_class.reltuples&lt;/code&gt; is only updated by ANALYZE; autovacuum’s analyze threshold may not trigger for hours&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Correlated columns with single-column statistics&lt;/td&gt;&lt;td&gt;Multi-column filter estimates are too optimistic; wrong join strategy chosen&lt;/td&gt;&lt;td&gt;Planner multiplies per-column selectivities independently, ignoring correlation between columns&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Partial index with no matching statistics&lt;/td&gt;&lt;td&gt;Planner cannot use the partial index’s selectivity correctly when the WHERE clause of the query partially matches the index predicate&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stats&lt;/code&gt; does not store per-partial-index statistics; planner falls back to whole-table estimates&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Stale statistics after bulk loads cause the planner to choose wrong execution plans — sequential scans where index scans are needed, or nested loops where hash joins would be correct.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Run &lt;code&gt;ANALYZE&lt;/code&gt; explicitly after every bulk load, reduce &lt;code&gt;autovacuum_analyze_scale_factor&lt;/code&gt; on large tables, and raise &lt;code&gt;statistics_target&lt;/code&gt; on highly selective or skewed columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Use &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt; before and after &lt;code&gt;ANALYZE&lt;/code&gt; on a query affected by a bulk load — the estimated row counts in the plan should converge toward actual row counts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, query &lt;code&gt;SELECT tablename, last_analyze, last_autoanalyze, n_live_tup FROM pg_stat_user_tables ORDER BY last_analyze ASC NULLS FIRST LIMIT 20;&lt;/code&gt; and identify tables where statistics are old relative to write volume.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>failures</category></item><item><title>Backups Are Not Recovery: The DBA Rule Everyone Learns Late</title><link>https://rajivonai.com/blog/2022-11-14-backups-are-not-recovery/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-11-14-backups-are-not-recovery/</guid><description>A backup file proves you captured data. Recovery is the process of producing a running, consistent database on a different system inside your RTO. They are not the same thing, and confusing them is how incidents get worse.</description><pubDate>Mon, 14 Nov 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A backup file is not proof of recoverability. It is proof that data was written to storage at a point in time. Recovery is the separate process of taking that file and producing a running, consistent database on a different system within your RTO. Engineers who conflate the two discover the gap during an actual incident — the worst possible time to find it.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most teams running production databases configure some form of backup. Nightly &lt;code&gt;pg_dump&lt;/code&gt; jobs, Aurora snapshots, &lt;code&gt;xtrabackup&lt;/code&gt; runs around low-traffic windows — the mechanics are straightforward. Monitoring confirms the job completed without error.&lt;/p&gt;
&lt;p&gt;That confirmation covers one half of the contract. It says data left the system. It says nothing about restore time, or whether WAL segments and encryption keys are available in the same failure scenario that just took down the primary.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The documented failure mode: a team runs nightly &lt;code&gt;pg_dump&lt;/code&gt;, stores output to S3, and considers their backup strategy complete. During a corruption event, they initiate a restore and discover that &lt;code&gt;pg_dump&lt;/code&gt; replays every row as SQL against a cold instance — on a large database, hours of work. With no WAL archives stored, there is no PITR capability either.&lt;/p&gt;
&lt;p&gt;The backup was real. The recovery was not viable within their RTO.&lt;/p&gt;
&lt;p&gt;The question every team must answer before an incident: have you timed a full restore on target hardware, and does that number fit inside your recovery time objective?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;RPO and RTO are different constraints governed by different mechanics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RPO (Recovery Point Objective)&lt;/strong&gt; is how much data loss is acceptable. A nightly backup gives an RPO of up to 24 hours. An RPO of minutes requires continuous WAL archiving (PostgreSQL) or binary log shipping (MySQL). Aurora documents this explicitly — PITR to any second within the retention window is only possible because Aurora streams redo logs continuously, not because snapshots run frequently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RTO (Recovery Time Objective)&lt;/strong&gt; is how long you can be down. It is determined by restore speed, not backup frequency.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Primary Database] --&gt;|Writes data| B[Base Backup]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt;|Streams changes| C[WAL Archive]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; D[Disaster Recovery Target]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|Replays until PITR| D&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; E[Recovered Database]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Backup type&lt;/th&gt;&lt;th&gt;Restore speed&lt;/th&gt;&lt;th&gt;PITR capable&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Logical — &lt;code&gt;pg_dump&lt;/code&gt;, &lt;code&gt;mysqldump&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Slow — replays SQL row by row&lt;/td&gt;&lt;td&gt;No, without WAL or binlog archiving&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Physical — &lt;code&gt;pg_basebackup&lt;/code&gt;, &lt;code&gt;xtrabackup&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Fast — copies raw data files&lt;/td&gt;&lt;td&gt;Yes, when WAL or binlog archiving is configured&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cloud snapshot — Aurora, RDS&lt;/td&gt;&lt;td&gt;Fast — clones at storage layer&lt;/td&gt;&lt;td&gt;Yes, when continuous backup is enabled&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;PostgreSQL’s documentation for &lt;code&gt;pg_basebackup&lt;/code&gt; describes its output as a binary copy of the data directory that a new instance can start from directly — bypassing the replay overhead that makes logical restores slow. For large databases, the difference is not marginal.&lt;/p&gt;
&lt;p&gt;Three additional gaps close the trap:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Same-region backup storage.&lt;/strong&gt; A regional disruption takes out both the database and the S3 bucket if they share a region. A backup unavailable during the failure it is meant to cover is not a recovery asset.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Logical backup without WAL archiving.&lt;/strong&gt; A &lt;code&gt;pg_dump&lt;/code&gt; taken at 2:00 AM returns you to 2:00 AM state. If corruption happened at 11:58 PM, 22 hours of data are gone. PITR requires WAL archiving in PostgreSQL or binary logging in MySQL, both enabled explicitly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Encryption key in the failed system.&lt;/strong&gt; If the key lives in the same environment that just failed or was compromised, the backup cannot be decrypted. Key management must be independent of the system being protected.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s &lt;code&gt;pg_basebackup&lt;/code&gt; documentation notes that WAL files generated during and after the backup are required for consistency — WAL archiving is the prerequisite for any PITR capability in self-managed PostgreSQL.&lt;/p&gt;
&lt;p&gt;Percona’s XtraBackup documentation describes a hot physical backup that does not block writes. It records the binary log position at the backup’s end — the anchor required for point-in-time recovery in MySQL and MariaDB.&lt;/p&gt;
&lt;p&gt;Amazon Aurora’s PITR documentation states that restores create a new DB cluster, not an in-place restoration. Applications must re-point to the new endpoint after a PITR restore — a step that surprises engineers who have never run the procedure under pressure.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Untested restore&lt;/td&gt;&lt;td&gt;RTO is unknown until the incident&lt;/td&gt;&lt;td&gt;Restore time was assumed, never measured on comparable hardware&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Same-region backup storage&lt;/td&gt;&lt;td&gt;Backup unavailable during regional failure&lt;/td&gt;&lt;td&gt;S3 bucket and database instance share the same AWS region&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Logical backup without WAL archiving&lt;/td&gt;&lt;td&gt;No PITR capability&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_dump&lt;/code&gt; is a point-in-time snapshot; intermediate recovery requires WAL or binlog&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Encryption key in the same environment&lt;/td&gt;&lt;td&gt;Cannot decrypt backup during recovery&lt;/td&gt;&lt;td&gt;Key management system is part of the failed or compromised system&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: A backup job completing successfully does not mean recovery is possible within your RTO.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Treat backup and recovery as separate contracts — configure WAL archiving for PITR, store backups cross-region, and time a full restore on comparable hardware.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A timed restore drill producing a running, queryable database at a point in time before a simulated event, completed inside your documented RTO.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, identify your largest production database and determine how long a full restore would take with your current backup type. If you have never timed it, schedule the drill now.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The backup proves data was written somewhere. The only thing that proves recovery is doing it.&lt;/p&gt;</content:encoded><category>databases</category><category>failures</category><category>checklist</category></item><item><title>Redis Memory Eviction Policies Explained</title><link>https://rajivonai.com/blog/2022-10-10-redis-memory-eviction-policies-explained/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-10-10-redis-memory-eviction-policies-explained/</guid><description>Redis has eight eviction policies and a maxmemory limit. The policy you pick determines whether your cache degrades safely or silently corrupts your hit rate under load.</description><pubDate>Mon, 10 Oct 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Redis does not manage memory for you.&lt;/strong&gt; You set a &lt;code&gt;maxmemory&lt;/code&gt; limit, choose an eviction policy, and Redis enforces both mechanically. Skip those settings and Redis will grow until the OS kills it, reject every write when the limit is hit, or silently evict keys you expected to stay cached. That is not a tuning detail — it is the difference between a cache that degrades gracefully and one that breaks applications under load.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;A typical Redis cache deployment sets keys with TTLs, adds a &lt;code&gt;maxmemory&lt;/code&gt; directive, and moves on. The assumption is that Redis will handle the rest.&lt;/p&gt;
&lt;p&gt;Redis exposes eviction policy as an explicit operator decision because different workloads have different requirements for which keys are safe to drop. A session store, a product catalog cache, and a rate-limiter all need different behavior at the eviction boundary. Redis gives you control, but that control requires a deliberate choice.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The failure modes appear only under sustained write pressure. When &lt;code&gt;maxmemory&lt;/code&gt; is not set, Redis accepts all writes until the host runs out of memory and the OOM killer terminates the process. When &lt;code&gt;noeviction&lt;/code&gt; is set and the limit is reached, Redis returns &lt;code&gt;OOM command not allowed when used memory &gt; &apos;maxmemory&apos;&lt;/code&gt; on every write. When &lt;code&gt;volatile-lru&lt;/code&gt; is configured but no keys have TTLs, Redis cannot find eligible keys and silently falls back to &lt;code&gt;noeviction&lt;/code&gt; behavior.&lt;/p&gt;
&lt;p&gt;Which policy fits your workload, and where does each one fail?&lt;/p&gt;
&lt;h2 id=&quot;how-eviction-works&quot;&gt;How Eviction Works&lt;/h2&gt;
&lt;p&gt;When a write arrives and memory is at the limit, Redis runs eviction logic before accepting the write. The policy determines which key is dropped.&lt;/p&gt;
&lt;p&gt;Redis 7.x documents eight policies:&lt;/p&gt;



























































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Policy&lt;/th&gt;&lt;th&gt;Key pool&lt;/th&gt;&lt;th&gt;Algorithm&lt;/th&gt;&lt;th&gt;Use case&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;noeviction&lt;/code&gt;&lt;/td&gt;&lt;td&gt;—&lt;/td&gt;&lt;td&gt;Rejects writes&lt;/td&gt;&lt;td&gt;Persistent stores where data loss is unacceptable&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;allkeys-lru&lt;/code&gt;&lt;/td&gt;&lt;td&gt;All keys&lt;/td&gt;&lt;td&gt;Least recently used&lt;/td&gt;&lt;td&gt;General-purpose cache&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;volatile-lru&lt;/code&gt;&lt;/td&gt;&lt;td&gt;TTL keys only&lt;/td&gt;&lt;td&gt;LRU from TTL set&lt;/td&gt;&lt;td&gt;Mixed store where permanent keys must survive&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;allkeys-lfu&lt;/code&gt;&lt;/td&gt;&lt;td&gt;All keys&lt;/td&gt;&lt;td&gt;Least frequently used&lt;/td&gt;&lt;td&gt;Skewed access patterns with a hot key set&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;volatile-lfu&lt;/code&gt;&lt;/td&gt;&lt;td&gt;TTL keys only&lt;/td&gt;&lt;td&gt;LFU from TTL set&lt;/td&gt;&lt;td&gt;Mixed store with skewed access&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;allkeys-random&lt;/code&gt;&lt;/td&gt;&lt;td&gt;All keys&lt;/td&gt;&lt;td&gt;Random&lt;/td&gt;&lt;td&gt;Almost never correct in production&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;volatile-random&lt;/code&gt;&lt;/td&gt;&lt;td&gt;TTL keys only&lt;/td&gt;&lt;td&gt;Random from TTL set&lt;/td&gt;&lt;td&gt;Rarely useful&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;volatile-ttl&lt;/code&gt;&lt;/td&gt;&lt;td&gt;TTL keys only&lt;/td&gt;&lt;td&gt;Shortest TTL first&lt;/td&gt;&lt;td&gt;When expiry order should drive eviction&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;For a standard cache where all keys have TTLs and access is roughly uniform, &lt;code&gt;allkeys-lru&lt;/code&gt; is the documented starting recommendation in the Redis memory management documentation. It requires no TTL discipline and evicts based on recency.&lt;/p&gt;
&lt;p&gt;For workloads with a stable hot key set — recommendations, trending content, rate-limit counters — &lt;code&gt;allkeys-lfu&lt;/code&gt; is a better fit. LFU tracks frequency rather than recency, so a hot key accessed hundreds of times will not be dropped for being idle. LFU support arrived in Redis 4.0.&lt;/p&gt;
&lt;p&gt;One detail matters for both: Redis does not maintain a true LRU or LFU data structure. It samples &lt;code&gt;maxmemory-samples&lt;/code&gt; keys (default: 5) and evicts the best candidate from that sample. This is an approximation; larger sample sizes improve accuracy at the cost of CPU.&lt;/p&gt;
&lt;p&gt;Set the policy in &lt;code&gt;redis.conf&lt;/code&gt; or apply it at runtime without a restart:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;ini&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# redis.conf — set once, survives restart&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;maxmemory 2gb&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;maxmemory-policy allkeys-lru&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;maxmemory-samples 10&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# Apply at runtime without restart&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;redis-cli&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; CONFIG&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; maxmemory-policy&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; allkeys-lru&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;redis-cli&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; CONFIG&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; SET&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; maxmemory-samples&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;volatile-*&lt;/code&gt; policies only touch keys with a TTL set. If the application writes any keys without TTLs, those keys are never eligible for eviction. As non-TTL keys accumulate, the eviction pool shrinks, and under write pressure Redis exhausts eligible keys and falls back to &lt;code&gt;noeviction&lt;/code&gt; behavior without any configuration change.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The Redis eviction policies reference at redis.io explicitly documents the &lt;code&gt;noeviction&lt;/code&gt; fallback when &lt;code&gt;volatile-*&lt;/code&gt; policies find no eligible keys. This is designed behavior. The practical consequence: &lt;code&gt;volatile-lru&lt;/code&gt; is safe only when TTL discipline is enforced at the application layer, not assumed.&lt;/p&gt;
&lt;p&gt;For diagnosis, &lt;code&gt;INFO memory&lt;/code&gt; returns &lt;code&gt;mem_fragmentation_ratio&lt;/code&gt;. The Redis documentation flags ratios above 1.5 as significant — the process RSS exceeds what Redis counts as &lt;code&gt;used_memory&lt;/code&gt;. Eviction uses &lt;code&gt;used_memory&lt;/code&gt;, not RSS, so high fragmentation means the host can approach OOM before Redis triggers any eviction.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;volatile-lru&lt;/code&gt; with no TTL keys&lt;/td&gt;&lt;td&gt;Writes fail under load; Redis behaves as &lt;code&gt;noeviction&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Eviction pool is empty; documented Redis fallback behavior&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;LRU or LFU with &lt;code&gt;maxmemory-samples 5&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Hot keys can be evicted by chance&lt;/td&gt;&lt;td&gt;Redis samples 5 keys, not the full keyspace; approximation only&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High &lt;code&gt;mem_fragmentation_ratio&lt;/code&gt; with tight &lt;code&gt;maxmemory&lt;/code&gt;&lt;/td&gt;&lt;td&gt;RSS exceeds RAM before eviction triggers&lt;/td&gt;&lt;td&gt;Eviction uses &lt;code&gt;used_memory&lt;/code&gt;, not RSS; fragmentation is invisible to eviction logic&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Unset or mismatched eviction policy causes write failures, hit-rate degradation, or OOM kills under load.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Set &lt;code&gt;maxmemory&lt;/code&gt; explicitly; use &lt;code&gt;allkeys-lru&lt;/code&gt; for general caches, &lt;code&gt;allkeys-lfu&lt;/code&gt; for skewed workloads; avoid &lt;code&gt;volatile-*&lt;/code&gt; unless TTL discipline is enforced at the application layer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After a load test, &lt;code&gt;redis-cli INFO stats | grep evicted_keys&lt;/code&gt; should be non-zero and &lt;code&gt;used_memory&lt;/code&gt; should stay below &lt;code&gt;maxmemory&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run &lt;code&gt;redis-cli CONFIG GET maxmemory &amp;#x26;&amp;#x26; redis-cli CONFIG GET maxmemory-policy&lt;/code&gt; across production instances; any instance returning &lt;code&gt;0&lt;/code&gt; for &lt;code&gt;maxmemory&lt;/code&gt; is unprotected.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Eviction policy is one of the few Redis settings where the wrong default does not produce an immediate visible failure — it surfaces only when the cache fills up, which is exactly when you need it most.&lt;/p&gt;</content:encoded><category>databases</category><category>failures</category></item><item><title>MongoDB Query Performance Workflow</title><link>https://rajivonai.com/blog/2022-09-26-mongodb-query-performance-workflow/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-09-26-mongodb-query-performance-workflow/</guid><description>A systematic runbook for diagnosing slow MongoDB queries — from explain output through COLLSCAN, index selectivity, in-memory sort, and WiredTiger cache pressure.</description><pubDate>Mon, 26 Sep 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A MongoDB query showing COLLSCAN in explain output is not always the root cause of a performance problem — but it is always the first place to look.&lt;/strong&gt; When Atlas Performance Advisor flags a query or &lt;code&gt;currentOp&lt;/code&gt; shows sessions running for seconds, the diagnostic sequence from explain output to index design to cache pressure determines whether you spend 15 minutes or 2 hours finding the fix.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;The alert fires or the monitoring dashboard shows elevated read latency. Atlas Performance Advisor has flagged one or more queries lacking index coverage. Operations that normally return in single-digit milliseconds are now taking hundreds of milliseconds or seconds. The collection has grown significantly since the last schema review.&lt;/p&gt;
&lt;p&gt;MongoDB query execution follows a straightforward path: the query planner selects a plan based on available indexes and statistics, executes it, and reports the winning plan with execution statistics. When no suitable index exists, the planner chooses COLLSCAN — a sequential scan of every document in the collection. For large collections, COLLSCAN latency scales linearly with collection size regardless of how selective the query predicate is.&lt;/p&gt;
&lt;p&gt;The diagnostic starting point is the same in every case: understand what the query planner is actually doing, then determine whether it is doing the right thing.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;queryPlanner.winningPlan.stage: COLLSCAN&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code&gt;explain()&lt;/code&gt; output&lt;/td&gt;&lt;td&gt;No index used — full collection scan&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High &lt;code&gt;totalDocsExamined&lt;/code&gt; vs &lt;code&gt;nReturned&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code&gt;explain(&quot;executionStats&quot;)&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Index exists but selectivity is low, or filter is post-index&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;SORT&lt;/code&gt; stage in winningPlan&lt;/td&gt;&lt;td&gt;&lt;code&gt;explain()&lt;/code&gt; output&lt;/td&gt;&lt;td&gt;In-memory sort — may hit 100 MB sort limit on large result sets&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;keysExamined &gt;&gt; nReturned&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code&gt;explain(&quot;executionStats&quot;)&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Index scan returning many keys, most filtered out after&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Ops flagged in Atlas Performance Advisor&lt;/td&gt;&lt;td&gt;Atlas UI — Performance Advisor tab&lt;/td&gt;&lt;td&gt;Atlas detected slow queries without index coverage&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Growing &lt;code&gt;opcounters.query&lt;/code&gt; with flat throughput&lt;/td&gt;&lt;td&gt;&lt;code&gt;db.serverStatus().opcounters&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Query rate growing without corresponding throughput improvement&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Currently running slow operations&lt;/strong&gt; — Check what is active before looking at historical patterns:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;currentOp&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  active: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;true&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  secs_running: { $gt: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;})&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Any operation running longer than 1 second is a candidate. Note the &lt;code&gt;ns&lt;/code&gt; (namespace), &lt;code&gt;op&lt;/code&gt; type, and &lt;code&gt;query&lt;/code&gt; field. If you see the same query pattern repeatedly, it is a systemic issue, not a one-off.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Explain the slow query with execution statistics&lt;/strong&gt; — Get the actual execution plan and row counts:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.orders.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;explain&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;executionStats&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;).&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;find&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  customer_id: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;12345&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  status: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;pending&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;})&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key fields in the output:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;winningPlan.stage&lt;/code&gt;: &lt;code&gt;IXSCAN&lt;/code&gt; (index used) or &lt;code&gt;COLLSCAN&lt;/code&gt; (full scan)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;executionStats.nReturned&lt;/code&gt;: documents returned to the client&lt;/li&gt;
&lt;li&gt;&lt;code&gt;executionStats.totalDocsExamined&lt;/code&gt;: documents MongoDB had to read&lt;/li&gt;
&lt;li&gt;&lt;code&gt;executionStats.totalKeysExamined&lt;/code&gt;: index keys scanned&lt;/li&gt;
&lt;li&gt;&lt;code&gt;executionStats.executionTimeMillis&lt;/code&gt;: actual query duration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A healthy query has &lt;code&gt;nReturned ≈ totalDocsExamined&lt;/code&gt;. A poorly indexed query has &lt;code&gt;totalDocsExamined &gt;&gt; nReturned&lt;/code&gt;.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;List existing indexes&lt;/strong&gt; — Understand what index coverage already exists:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.orders.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;getIndexes&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Check whether an index exists on the query fields. If an index exists but EXPLAIN shows COLLSCAN, the index may not match the query predicate (wrong field order in a compound index, mismatched types, or low cardinality causing planner to prefer COLLSCAN).&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Enable slow query profiling&lt;/strong&gt; — Capture slow queries for pattern analysis:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Set profiling level 1 — log queries slower than 100ms&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;setProfilingLevel&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, { slowms: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Read recent slow queries&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.system.profile.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;find&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;().&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;sort&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ ts: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; }).&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;limit&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;5&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;).&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pretty&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;()&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The profiler output includes full query shape, execution plan, and timing. On Atlas, the Query Profiler in the UI exposes the same data without manual profiling setup.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Check server-level query rate trends&lt;/strong&gt; — Determine if this is a new regression or a gradual growth issue:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;serverStatus&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;().opcounters&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Compare &lt;code&gt;query&lt;/code&gt; count between two calls 60 seconds apart. If the query rate has been growing while throughput stays flat, the queries are getting slower as the collection grows — a classic missing-index signature.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Slow MongoDB query] --&gt; B{explain shows COLLSCAN?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| C{Index exists on query fields?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|no| D[Create index on query predicate fields]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|yes| E{Cardinality low — many duplicate values?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|yes| F[Consider compound index with higher-cardinality field first]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|no| G[Check field type match — query type must match schema type]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no| H{totalDocsExamined much larger than nReturned?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|yes| I[Compound index needed — add filter fields in ESR order]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt;|no| J{SORT stage in winningPlan?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|yes| K[Add sort key to index — create covering compound index]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|no| L{WiredTiger cache fill above 90%?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|yes| M[Cache pressure — increase wiredTigerCacheSizeGB or upgrade instance]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    L --&gt;|no| N[Check write contention — concurrent writes to same documents]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Create a targeted index&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For a query doing COLLSCAN with no existing index on the predicate fields:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Single-field index&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.orders.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;createIndex&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ customer_id: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Compound index following ESR rule (Equality, Sort, Range)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Query: find({ customer_id: X, status: &quot;pending&quot; }, sort by created_at)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.orders.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;createIndex&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ customer_id: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, status: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, created_at: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The ESR rule from MongoDB documentation: place equality predicates first, sort fields second, and range predicates last in a compound index. This ordering maximizes the portion of the index that can be used for both filtering and sorting.&lt;/p&gt;
&lt;p&gt;After index creation, re-run &lt;code&gt;explain(&quot;executionStats&quot;)&lt;/code&gt; to confirm the plan switched from COLLSCAN to IXSCAN and &lt;code&gt;totalDocsExamined&lt;/code&gt; dropped to match &lt;code&gt;nReturned&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Covered query with projection&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If a query frequently returns only a subset of fields and those fields plus the query predicate can all fit in an index, a covered query avoids fetching documents entirely:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Index covers query + projection&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.orders.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;createIndex&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ customer_id: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, status: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, created_at: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, total: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Covered query — returns only indexed fields, no document fetch&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.orders.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;find&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  { customer_id: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;12345&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, status: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;pending&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; },&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  { customer_id: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, status: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, created_at: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, total: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, _id: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; }&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In &lt;code&gt;explain()&lt;/code&gt; output, a covered query shows &lt;code&gt;IXSCAN&lt;/code&gt; with no &lt;code&gt;FETCH&lt;/code&gt; stage. &lt;code&gt;totalDocsExamined&lt;/code&gt; will be 0.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Resolve in-memory sort&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;An in-memory SORT stage appears when no index covers the sort key. MongoDB limits in-memory sorts to 100 MB by default; queries that would exceed this limit fail with an error. Adding the sort key to the index eliminates the SORT stage:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Before: COLLSCAN or IXSCAN followed by SORT stage&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.orders.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;find&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ customer_id: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;12345&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; }).&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;sort&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ created_at: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Add compound index covering filter and sort&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.orders.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;createIndex&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ customer_id: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, created_at: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// After: IXSCAN with no SORT stage — sort is satisfied by index order&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Index creation:&lt;/strong&gt; Indexes can be dropped without data loss: &lt;code&gt;db.orders.dropIndex(&quot;index_name&quot;)&lt;/code&gt;. Index name is visible in &lt;code&gt;db.orders.getIndexes()&lt;/code&gt;. Drop takes effect immediately — query plans revert to pre-index behavior.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Profiling level change:&lt;/strong&gt; &lt;code&gt;db.setProfilingLevel(0)&lt;/code&gt; disables profiling. The &lt;code&gt;system.profile&lt;/code&gt; collection is not automatically truncated — drop it manually if it has grown large: &lt;code&gt;db.system.profile.drop()&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;No rollback needed for explain or currentOp&lt;/strong&gt; — these are read-only diagnostic commands with no side effects.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Atlas Performance Advisor automatically surfaces index recommendations for queries it detects as slow. For self-managed deployments, the same signal is available by querying the profiler collection on a schedule:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;// Find query shapes taking longer than 200ms in the last hour&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.system.profile.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;find&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  ts: { $gt: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;new&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; Date&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(Date.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 3600000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) },&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  millis: { $gt: &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;200&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; },&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  op: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;query&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;}).&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;sort&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ millis: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; }).&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;limit&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Running this as a scheduled job and alerting when new slow query shapes appear gives early warning before a growing collection converts a borderline index miss into a hard COLLSCAN under production load.&lt;/p&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;What broke:&lt;/strong&gt; MongoDB read latency spiked as collection growth exposed queries running without index coverage. Full collection scans were taking seconds on collections that had grown beyond their original index planning assumptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What was done:&lt;/strong&gt; Used &lt;code&gt;explain(&quot;executionStats&quot;)&lt;/code&gt; to identify COLLSCAN queries, applied compound indexes following the ESR rule, and verified plans switched from COLLSCAN to IXSCAN with &lt;code&gt;totalDocsExamined&lt;/code&gt; matching &lt;code&gt;nReturned&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What prevents recurrence:&lt;/strong&gt; Atlas Performance Advisor monitoring surfaces new missing-index patterns automatically. A scheduled profiler query provides equivalent coverage on self-managed deployments.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;checklist&quot;&gt;Checklist&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Run &lt;code&gt;db.currentOp({active: true, secs_running: {$gt: 1}})&lt;/code&gt; — identify active slow operations&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;explain(&quot;executionStats&quot;)&lt;/code&gt; on the flagged query — note &lt;code&gt;winningPlan.stage&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;totalDocsExamined&lt;/code&gt; vs &lt;code&gt;nReturned&lt;/code&gt; — ratio above 10:1 indicates poor selectivity or missing index&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;db.collection.getIndexes()&lt;/code&gt; — confirm which indexes exist and their field order&lt;/li&gt;
&lt;li&gt;Check for &lt;code&gt;SORT&lt;/code&gt; stage in winningPlan — if present, sort key is not covered by the index&lt;/li&gt;
&lt;li&gt;If COLLSCAN with no index: create a targeted index using ESR rule for compound predicates&lt;/li&gt;
&lt;li&gt;If IXSCAN but high &lt;code&gt;totalDocsExamined&lt;/code&gt;: consider adding remaining filter fields to the compound index&lt;/li&gt;
&lt;li&gt;Re-run &lt;code&gt;explain(&quot;executionStats&quot;)&lt;/code&gt; after index creation — verify plan switches to IXSCAN&lt;/li&gt;
&lt;li&gt;Check WiredTiger cache fill ratio via &lt;code&gt;db.serverStatus().wiredTiger.cache&lt;/code&gt; — rule out cache pressure&lt;/li&gt;
&lt;li&gt;Enable profiler at &lt;code&gt;slowms: 100&lt;/code&gt; if the slow query pattern is not yet fully characterized&lt;/li&gt;
&lt;/ol&gt;</content:encoded><category>databases</category><category>checklist</category><category>failures</category></item><item><title>MongoDB Index Basics: Why Your Query Became Slow</title><link>https://rajivonai.com/blog/2022-09-12-mongodb-index-basics-why-your-query-became-slow/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-09-12-mongodb-index-basics-why-your-query-became-slow/</guid><description>MongoDB&apos;s default behavior is a full collection scan when no index supports the query. Here is what you need to know about single-field, compound, and multikey indexes before your collection grows past 10K documents.</description><pubDate>Mon, 12 Sep 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;If a query runs fine at 10,000 documents and becomes slow at 100,000, the most likely cause is a missing index — not a MongoDB bug, not a schema problem, not a driver issue.&lt;/strong&gt; MongoDB’s query planner defaults to a full collection scan (COLLSCAN) when no suitable index exists. That scan touches every document in the collection regardless of how selective the filter is. Understanding how MongoDB builds and uses indexes is the operational knowledge that separates a collection that stays fast from one that degrades linearly with data volume.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Engineers moving to MongoDB from a relational background often expect the optimizer to behave like PostgreSQL or MySQL: add a column and the planner will figure the rest out. MongoDB does use indexes when they exist — but there is no implicit index creation. Without an explicit index on a field, every query that filters, sorts, or aggregates on that field will scan the entire collection.&lt;/p&gt;
&lt;p&gt;The rate of degradation is what surprises engineers: a COLLSCAN at 10K documents takes milliseconds; the same scan at 1M documents takes seconds. The collection felt fast during development because the data volume was too small for the problem to be visible.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The failure mode is predictable: somewhere between 50K and 200K documents, a query that returns a single record starts taking seconds. The engineer adds an index — but adds it on the field they notice in the filter, not on the field the planner needs. Latency improves slightly or not at all. The problem is that they did not know how to read the query planner output, and they did not understand how compound index ordering affects whether an index can be used for both filtering and sorting. The core question: given a query with a filter, a sort, and a range condition, how do you build an index the planner will actually use?&lt;/p&gt;
&lt;h2 id=&quot;how-mongodb-indexes-work&quot;&gt;How MongoDB Indexes Work&lt;/h2&gt;
&lt;p&gt;MongoDB uses B-tree indexes on individual fields or combinations of fields. Three index types matter for most applications.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Single-field indexes&lt;/strong&gt; are the starting point. An index on &lt;code&gt;{ status: 1 }&lt;/code&gt; lets the planner use IXSCAN for any query filtering on &lt;code&gt;status&lt;/code&gt;. If your query also sorts on &lt;code&gt;createdAt&lt;/code&gt;, the index handles the filter but leaves the sort as an in-memory operation — and if that result set exceeds 32MB, MongoDB aborts the sort with an error.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compound indexes&lt;/strong&gt; cover multiple fields in a declared order. The order matters because of the &lt;strong&gt;prefix rule&lt;/strong&gt;: an index on &lt;code&gt;{ status: 1, userId: 1, createdAt: -1 }&lt;/code&gt; supports queries on &lt;code&gt;status&lt;/code&gt;, on &lt;code&gt;status + userId&lt;/code&gt;, and on all three. It does not support a query filtering only on &lt;code&gt;userId&lt;/code&gt; — the prefix must be respected.&lt;/p&gt;
&lt;p&gt;For compound indexes that involve both equality filters, sort conditions, and range filters, MongoDB’s documentation describes the &lt;strong&gt;ESR rule&lt;/strong&gt; as the recommended ordering: &lt;strong&gt;Equality fields first, then Sort fields, then Range fields&lt;/strong&gt;. The rationale is mechanical: placing equality conditions first narrows the index scan to exact key matches before any range traversal or sort is applied. Putting a range field before the sort field forces the planner to sort within a wider range, which can make in-memory sorting unavoidable even when the index exists. The ESR rule is documented in the MongoDB manual under “Create Indexes to Support Your Queries.”&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multikey indexes&lt;/strong&gt; handle array fields. If a document has a field &lt;code&gt;tags: [&quot;mongodb&quot;, &quot;indexes&quot;, &quot;performance&quot;]&lt;/code&gt;, an index on &lt;code&gt;{ tags: 1 }&lt;/code&gt; creates one index entry per array element. Queries for any single tag value use IXSCAN. The constraint is that a compound index cannot have two multikey fields: MongoDB will reject index creation on &lt;code&gt;{ tags: 1, categories: 1 }&lt;/code&gt; if both are array fields in the same document.&lt;/p&gt;
&lt;p&gt;The diagnostic tool is &lt;code&gt;explain()&lt;/code&gt;. Appending &lt;code&gt;.explain(&quot;executionStats&quot;)&lt;/code&gt; returns the plan the planner chose. The critical fields: &lt;code&gt;winningPlan.stage&lt;/code&gt; (IXSCAN versus COLLSCAN), &lt;code&gt;executionStats.totalDocsExamined&lt;/code&gt; versus &lt;code&gt;executionStats.nReturned&lt;/code&gt; (a large ratio means poor selectivity or the wrong index), and &lt;code&gt;executionStats.executionTimeMillis&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;js&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;db.orders.&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;find&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ status: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;pending&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, userId: &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;u123&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;         .&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;sort&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;({ createdAt: &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; })&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;         .&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;explain&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&quot;executionStats&quot;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;COLLSCAN means no index supports the query. IXSCAN with &lt;code&gt;totalDocsExamined&lt;/code&gt; far exceeding &lt;code&gt;nReturned&lt;/code&gt; means the index exists but the wrong fields or order were used.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;MongoDB’s documentation covers the ESR rule and its rationale in the “Indexing Strategies” section of the manual. The prefix rule for compound indexes follows directly from how WiredTiger (MongoDB’s default storage engine since 3.2) walks the B-tree key space — behavior documented in the WiredTiger storage engine reference. The documented diagnostic pattern is: run &lt;code&gt;explain(&quot;executionStats&quot;)&lt;/code&gt;, confirm IXSCAN versus COLLSCAN, check &lt;code&gt;totalDocsExamined&lt;/code&gt; against &lt;code&gt;nReturned&lt;/code&gt;, and verify the compound index matches the ESR order for the query’s filter, sort, and range fields. This behavior has been consistent across MongoDB versions since 3.x.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Two array fields in a compound index&lt;/td&gt;&lt;td&gt;Index creation is rejected with a MongoServerError&lt;/td&gt;&lt;td&gt;WiredTiger cannot create a compound multikey index across two array fields — the cardinality expansion is unbounded&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Low-cardinality field as the leading equality key&lt;/td&gt;&lt;td&gt;Index exists but does not improve performance meaningfully&lt;/td&gt;&lt;td&gt;A field with five distinct values produces large index buckets; the planner scans a large fraction of the index even with IXSCAN&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Sort on a field not in the index&lt;/td&gt;&lt;td&gt;In-memory sort is triggered; aborts if the result set exceeds 32MB&lt;/td&gt;&lt;td&gt;When the sort field is absent from the index, the planner cannot use the index ordering and must buffer and sort the result in memory&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: A MongoDB collection that performs acceptably at development scale will degrade to COLLSCAN latency in production if indexes are not built to match query shapes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Run &lt;code&gt;.explain(&quot;executionStats&quot;)&lt;/code&gt; on every slow query, verify the winning plan uses IXSCAN, then build or rebuild compound indexes following the ESR rule — equality fields first, sort fields second, range fields last.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After adding the correctly ordered compound index, re-run &lt;code&gt;explain(&quot;executionStats&quot;)&lt;/code&gt; and confirm &lt;code&gt;winningPlan.stage&lt;/code&gt; shows IXSCAN and &lt;code&gt;totalDocsExamined&lt;/code&gt; drops to match &lt;code&gt;nReturned&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, run &lt;code&gt;.explain(&quot;executionStats&quot;)&lt;/code&gt; on the three slowest queries in your application and check whether any of them are using COLLSCAN.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The query planner cannot use an index it was not given. Once you can read &lt;code&gt;explain()&lt;/code&gt; output, the path from slow query to correct index is mechanical.&lt;/p&gt;</content:encoded><category>databases</category><category>failures</category></item><item><title>DynamoDB Single-Table Design: When It Works and When It Hurts</title><link>https://rajivonai.com/blog/2022-07-25-dynamodb-single-table-design-when-it-works-and-when-it-hurts/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-07-25-dynamodb-single-table-design-when-it-works-and-when-it-hurts/</guid><description>Single-table design in DynamoDB is an operational bet that access patterns are stable enough to encode into partition and sort keys — when the approach pays off, and when evolving query requirements turn it into a migration project.</description><pubDate>Mon, 25 Jul 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Single-table design is not a clever schema trick; it is an operational bet that your access patterns are stable enough to encode into keys.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;DynamoDB rewards teams that know exactly how their application reads and writes data. It gives predictable latency at large scale, managed replication, automatic partitioning, streams, TTL, conditional writes, transactions, and global secondary indexes. In exchange, it asks a hard question early: what are the queries?&lt;/p&gt;
&lt;p&gt;That tradeoff is why single-table design exists. Instead of creating one table per entity, a team stores multiple entity types in one table and uses composite primary keys to place related items together. An order, its line items, payment events, fulfillment records, and audit entries may all share the same partition key and differ by sort key prefixes.&lt;/p&gt;
&lt;p&gt;The result can be excellent. A request that would require joins in a relational database can become one partition query. A service can fetch an aggregate view with one call, keep latency stable under load, and avoid distributed transactions across multiple tables.&lt;/p&gt;
&lt;p&gt;But the pattern gets oversold. Single-table design is not automatically more scalable than multi-table design. It is more scalable when the shape of the workload matches the shape of the keys.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The failure usually starts after launch, not during the first schema review.&lt;/p&gt;
&lt;p&gt;A team models the happy-path access pattern: get customer dashboard, list orders by account, fetch order detail, append events. The key design works. The service is fast. Costs are reasonable.&lt;/p&gt;
&lt;p&gt;Then product behavior changes. Support wants to find all failed payments by provider. Finance wants reconciliation by settlement date. Operations wants open orders by warehouse and priority. Analytics wants historical exports. A new feature needs to query relationships in the opposite direction from the original aggregate.&lt;/p&gt;
&lt;p&gt;The table still contains the data, but it no longer contains the access path.&lt;/p&gt;
&lt;p&gt;Now the team has bad options. Add a global secondary index and backfill it. Overload an existing index with another entity shape and hope the naming convention remains understandable. Duplicate data into another item type. Stream changes into OpenSearch, S3, or a relational store. Run scans for rare workflows and accept cost spikes. Or migrate the model while production traffic continues.&lt;/p&gt;
&lt;p&gt;The core question is: &lt;strong&gt;when is DynamoDB single-table design an architecture advantage, and when does it become accumulated coupling disguised as performance?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;The answer is to treat single-table design as an access-pattern contract, not as a default modeling style.&lt;/p&gt;
&lt;p&gt;Use it when the service has bounded, high-volume operational queries. Avoid it when the service is still discovering its query surface, when ad hoc investigation is central to the workflow, or when many teams will independently add new entity relationships over time.&lt;/p&gt;
&lt;p&gt;A healthy single-table design starts with the request paths, not the nouns.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[product request — fetch account workspace] --&gt; B[access pattern inventory]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; C[partition key — account scope]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; D[sort key — entity and time ordering]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; E[primary query — account aggregate]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; F[index query — status queue]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; G[index query — user lookup]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; H[service response — bounded read]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt; I[worker response — bounded queue]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  G --&gt; J[support response — bounded lookup]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The design is good when each important request maps to a bounded key condition. The design is weak when important requests require scans, client-side filtering over broad partitions, or fragile conventions that only one engineer understands.&lt;/p&gt;
&lt;p&gt;A practical test: write the production questions as code comments before writing the entity model.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;text&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Get account workspace by account id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;List open tasks by account id and status&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Fetch task detail by account id and task id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;List tasks assigned to user id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Append task event if version matches&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Expire invitation after ttl&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Those statements tell you whether the table needs a primary key only, one global secondary index, a sparse index, duplicated lookup items, or a separate read model.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;h3 id=&quot;context&quot;&gt;Context&lt;/h3&gt;
&lt;p&gt;Amazon’s DynamoDB documentation and public talks describe single-table design as a pattern for known access patterns, especially workloads that need high scale and low-latency key-value or document access. The documented pattern is to model item collections around partition keys, use sort keys for hierarchy and ordering, and add secondary indexes for alternate access paths.&lt;/p&gt;
&lt;p&gt;This is not a relational modeling exercise. DynamoDB does not optimize arbitrary joins later. The schema is physical from the beginning: partition key choice affects distribution, sort key shape affects query behavior, and index definitions affect write amplification.&lt;/p&gt;
&lt;h3 id=&quot;action&quot;&gt;Action&lt;/h3&gt;
&lt;p&gt;The strong version of the pattern is deliberate denormalization.&lt;/p&gt;
&lt;p&gt;For an ecommerce workflow, an account partition might contain profile metadata, active carts, orders, order items, and order events. Sort keys encode stable query order:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;text&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;PK = ACCOUNT#123&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;SK = PROFILE#123&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;PK = ACCOUNT#123&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;SK = ORDER#2022-07-25#9001&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;PK = ACCOUNT#123&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;SK = ORDER#2022-07-25#9001#ITEM#1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;PK = ACCOUNT#123&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;SK = ORDER#2022-07-25#9001#EVENT#2022-07-25T10:30:00Z&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A sparse global secondary index might project only open fulfillment work:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;text&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;GSI1PK = FULFILLMENT#OPEN&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;GSI1SK = WAREHOUSE#DAL#PRIORITY#HIGH#ORDER#9001&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The application writes extra fields because the read path matters more than normalization. Conditional writes protect versioned updates. Transactions are reserved for small, critical multi-item changes. Streams can publish changes into downstream projections for search, analytics, or auditing.&lt;/p&gt;
&lt;h3 id=&quot;result&quot;&gt;Result&lt;/h3&gt;
&lt;p&gt;The result is operationally strong when the workload stays inside those paths.&lt;/p&gt;
&lt;p&gt;The account view is a partition query. The fulfillment queue is an index query. The order detail is a bounded range query. The service avoids joins at request time and keeps predictable latency because the database is doing exactly the work the keys describe.&lt;/p&gt;
&lt;p&gt;The result is operationally weak when the table becomes a dumping ground for every future question. Overloaded indexes become difficult to reason about because GSIs project different attributes for different entity types, forcing generic attribute names (&lt;code&gt;Data1&lt;/code&gt;, &lt;code&gt;Data2&lt;/code&gt;) and increasing storage costs. Backfills become risky because every item type has different attributes. Hot partitions appear when one tenant, status, or queue key receives disproportionate traffic. Cost shifts from read latency to write amplification and migration complexity.&lt;/p&gt;
&lt;h3 id=&quot;learning&quot;&gt;Learning&lt;/h3&gt;
&lt;p&gt;The documented pattern is not “put everything in one table.” The pattern is “put items that serve the same operational access patterns in one table.”&lt;/p&gt;
&lt;p&gt;That distinction matters. A single table can be a clean aggregate store. It can also become an undocumented protocol where every key prefix is a hidden API. The difference is whether the team maintains an access-pattern registry, capacity assumptions, ownership rules, and test coverage for key construction.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;













































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Why it hurts&lt;/th&gt;&lt;th&gt;Better response&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Unknown query surface&lt;/td&gt;&lt;td&gt;New product questions do not match existing keys&lt;/td&gt;&lt;td&gt;Start with multi-table or relational storage until access patterns stabilize&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Ad hoc investigation&lt;/td&gt;&lt;td&gt;Scans become normal operating procedure&lt;/td&gt;&lt;td&gt;Export to S3, index into OpenSearch, or use a relational read model&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Hot partitions&lt;/td&gt;&lt;td&gt;One tenant, queue, or status hits the 10GB or 1000 WCU partition limits&lt;/td&gt;&lt;td&gt;Add write sharding, redesign queue keys, or isolate the workload&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Index overloading without discipline&lt;/td&gt;&lt;td&gt;Key prefixes become tribal knowledge; GSI write amplification explodes&lt;/td&gt;&lt;td&gt;Maintain a key catalog and tests for every access pattern&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Excessive denormalization&lt;/td&gt;&lt;td&gt;Every write updates many item shapes&lt;/td&gt;&lt;td&gt;Separate read models by workflow and accept asynchronous projection&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cross-aggregate transactions&lt;/td&gt;&lt;td&gt;Business invariants span many partitions&lt;/td&gt;&lt;td&gt;Reconsider whether DynamoDB is the system of record for that workflow&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Multi-team ownership&lt;/td&gt;&lt;td&gt;Independent features mutate one physical table&lt;/td&gt;&lt;td&gt;Define table ownership or split bounded contexts&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The most dangerous failure is not a bad key name. It is a table whose operational contract is implicit.&lt;/p&gt;
&lt;p&gt;Once multiple services write different item types into the same table, the schema lives in application code, migration scripts, dashboards, and engineer memory. That can work for a disciplined platform team. It is painful for a fast-moving product surface without strong ownership.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; If your team cannot list the top access patterns, single-table design will force premature decisions into the physical schema.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Model requests first, then map each request to a primary key, sort key, index, or external projection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Verify every critical workflow with bounded &lt;code&gt;Query&lt;/code&gt; operations, conditional write tests, backfill rehearsal, and partition hot-spot analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Use single-table design for stable operational aggregates; use separate tables or read models when query discovery, analytics, or independent team ownership matters more than one-call retrieval.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>databases</category><category>cloud</category></item><item><title>MySQL EXPLAIN: Reading the Plan Without Guessing</title><link>https://rajivonai.com/blog/2022-06-06-mysql-explain-reading-the-plan/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-06-06-mysql-explain-reading-the-plan/</guid><description>How to read MySQL EXPLAIN output systematically — type column, key column, rows estimate, and Extra flags — so you stop adding indexes blindly.</description><pubDate>Mon, 06 Jun 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The most common mistake engineers make with &lt;code&gt;EXPLAIN&lt;/code&gt; is treating &lt;code&gt;type: ALL&lt;/code&gt; as an alarm that requires an index. It is a data point, not a verdict.&lt;/strong&gt; Whether a full scan is a problem depends on the rows estimate, the Extra flags, and what the optimizer decided to do with the indexes that already exist. Reading the plan systematically takes two minutes.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Every engineer who has investigated a slow query has seen &lt;code&gt;EXPLAIN&lt;/code&gt; output. Most can recognize the column names — &lt;code&gt;type&lt;/code&gt;, &lt;code&gt;key&lt;/code&gt;, &lt;code&gt;rows&lt;/code&gt;, &lt;code&gt;Extra&lt;/code&gt; — but not how to read them as a system.&lt;/p&gt;
&lt;p&gt;The common workflow is: see &lt;code&gt;type: ALL&lt;/code&gt;, add an index. That misses the reason the optimizer chose the plan it chose, and misses the cases where the new index will be ignored anyway. MySQL 8.0 added &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;, which executes the query and returns actual row counts alongside estimates. The gap between those two numbers is often the real story.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Indexes do not guarantee the optimizer will use them. InnoDB’s cost-based optimizer weighs index access cost against cardinality estimates. If those estimates suggest the index returns a large fraction of the table, the optimizer may choose a full scan instead. This behavior is documented: MySQL uses index dive estimates and statistics from &lt;code&gt;INFORMATION_SCHEMA.INNODB_TABLE_STATS&lt;/code&gt; to make that call.&lt;/p&gt;
&lt;p&gt;When statistics are stale — after bulk loads, large deletes, or fast-growing tables — the optimizer’s row estimates can be wrong by an order of magnitude. A plan that looks safe in &lt;code&gt;EXPLAIN&lt;/code&gt; may be running against a table ten times larger.&lt;/p&gt;
&lt;p&gt;What does each column actually mean, and how do you read them together to know whether the optimizer’s choice was reasonable?&lt;/p&gt;
&lt;h2 id=&quot;how-to-read-explain-output&quot;&gt;How to Read EXPLAIN Output&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;EXPLAIN&lt;/code&gt; returns one row per table in the query, in the join order the optimizer chose. The columns that carry diagnostic weight are &lt;code&gt;type&lt;/code&gt;, &lt;code&gt;key&lt;/code&gt;, &lt;code&gt;rows&lt;/code&gt;, and &lt;code&gt;Extra&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The &lt;code&gt;type&lt;/code&gt; column&lt;/strong&gt; describes the access method. From best to worst: &lt;code&gt;const&lt;/code&gt; (single-row primary key match), &lt;code&gt;eq_ref&lt;/code&gt; (one matching row per join from a unique index), &lt;code&gt;ref&lt;/code&gt; (non-unique index lookup), &lt;code&gt;range&lt;/code&gt; (bounded index scan), &lt;code&gt;index&lt;/code&gt; (full index scan), &lt;code&gt;ALL&lt;/code&gt; (full table scan). The useful breakpoint is between &lt;code&gt;range&lt;/code&gt; and &lt;code&gt;index&lt;/code&gt; — anything at &lt;code&gt;index&lt;/code&gt; or &lt;code&gt;ALL&lt;/code&gt; with a high &lt;code&gt;rows&lt;/code&gt; estimate is worth investigating.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The &lt;code&gt;key&lt;/code&gt; column&lt;/strong&gt; shows which index the optimizer actually chose. If &lt;code&gt;key&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt; and &lt;code&gt;possible_keys&lt;/code&gt; lists candidates, the optimizer decided the available indexes were not selective enough to be worth using. That is the cardinality problem — not a missing index.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The &lt;code&gt;rows&lt;/code&gt; column&lt;/strong&gt; is the optimizer’s estimate of how many rows it will examine to satisfy the query. For &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; (MySQL 8.0+), the output also shows &lt;code&gt;actual rows&lt;/code&gt; — the count from the real execution. A large gap between estimated and actual rows means statistics are stale. Run &lt;code&gt;ANALYZE TABLE tablename;&lt;/code&gt; to refresh them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The &lt;code&gt;Extra&lt;/code&gt; column&lt;/strong&gt; carries execution flags. &lt;code&gt;Using filesort&lt;/code&gt; means MySQL sorted the result after retrieval — no index covers the &lt;code&gt;ORDER BY&lt;/code&gt;, and on large result sets this spills to disk. &lt;code&gt;Using temporary&lt;/code&gt; means an internal temp table was created, common with &lt;code&gt;GROUP BY&lt;/code&gt; on non-indexed columns. &lt;code&gt;Using index&lt;/code&gt; is a positive signal — a covering index served the query without touching table rows.&lt;/p&gt;
&lt;p&gt;Reading these together: &lt;code&gt;type: ALL&lt;/code&gt;, &lt;code&gt;rows: 4000000&lt;/code&gt;, &lt;code&gt;Extra: Using temporary; Using filesort&lt;/code&gt; means the optimizer scanned four million rows, built a temp table, and sorted it. That is not a statistics problem — that is a schema problem.&lt;/p&gt;
&lt;p&gt;A concrete example with &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on MySQL 8.0:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN ANALYZE&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; user_id, created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pending&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2022-01-01&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;\G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;-&gt; Filter: ((orders.status = &apos;pending&apos;) and (orders.created_at &gt; &apos;2022-01-01&apos;))&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;   (cost=48213.45 rows=45823)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;   (actual time=0.112..842.361 rows=12847 loops=1)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;   -&gt; Table scan on orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;      (cost=48213.45 rows=458230)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;      (actual time=0.089..721.903 rows=458230 loops=1)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;rows&lt;/code&gt; estimate (458,230 for the table scan) matches actual rows — statistics are current. But &lt;code&gt;actual time=842ms&lt;/code&gt; for a filter that returns 12,847 rows confirms the full scan is the problem: no index covers &lt;code&gt;(status, created_at)&lt;/code&gt;. Adding &lt;code&gt;idx_status_created (status, created_at)&lt;/code&gt; would reduce the scan to an index range lookup.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The MySQL 8.0 Reference Manual documents that InnoDB’s optimizer uses cardinality statistics from &lt;code&gt;INFORMATION_SCHEMA.INNODB_TABLE_STATS&lt;/code&gt; to choose between an index range scan and a full table scan. &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;, introduced in MySQL 8.0.18, returns both estimated and actual row counts per step. The manual identifies a large gap between the two as the primary signal for stale statistics — estimated 500, actual 2,400,000 means the plan was optimized for a table that no longer exists.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Stale statistics after bulk load&lt;/td&gt;&lt;td&gt;&lt;code&gt;rows&lt;/code&gt; estimate is far below actual; optimizer picks a plan sized for the old table&lt;/td&gt;&lt;td&gt;&lt;code&gt;innodb_stats_auto_recalc&lt;/code&gt; threshold (10% of rows changed) was not met; run &lt;code&gt;ANALYZE TABLE&lt;/code&gt; manually&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;JOIN order surprises&lt;/td&gt;&lt;td&gt;&lt;code&gt;type: ALL&lt;/code&gt; appears on a table you expected to be driven by an index&lt;/td&gt;&lt;td&gt;InnoDB’s cost model may reorder joins; the &lt;code&gt;id&lt;/code&gt; column in &lt;code&gt;EXPLAIN&lt;/code&gt; output shows actual join order&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Index ignored due to low cardinality&lt;/td&gt;&lt;td&gt;&lt;code&gt;possible_keys&lt;/code&gt; lists the index; &lt;code&gt;key&lt;/code&gt; is NULL&lt;/td&gt;&lt;td&gt;Column has few distinct values (boolean, status enum); optimizer’s index dive concluded the full scan was cheaper&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Engineers add indexes without confirming the optimizer will use them, because they read &lt;code&gt;type: ALL&lt;/code&gt; without reading &lt;code&gt;key&lt;/code&gt;, &lt;code&gt;rows&lt;/code&gt;, and &lt;code&gt;Extra&lt;/code&gt; together.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Treat EXPLAIN output as a system — check &lt;code&gt;key&lt;/code&gt; first, then &lt;code&gt;rows&lt;/code&gt;, then &lt;code&gt;Extra&lt;/code&gt;, before drawing any conclusion about what is wrong.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on MySQL 8.0+. If actual rows diverges significantly from estimated rows, the plan is stale — run &lt;code&gt;ANALYZE TABLE&lt;/code&gt; and re-check before adding any index.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, take one slow query your team has been discussing and run &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on it. Read &lt;code&gt;type&lt;/code&gt;, &lt;code&gt;key&lt;/code&gt;, &lt;code&gt;rows&lt;/code&gt;, &lt;code&gt;Extra&lt;/code&gt; in order. Write one sentence describing what the optimizer decided. That sentence is more useful than a blind &lt;code&gt;CREATE INDEX&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category></item><item><title>MySQL Slow Query Playbook: From Slow Log to Fix</title><link>https://rajivonai.com/blog/2022-05-23-mysql-slow-query-playbook-from-slow-log-to-fix/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-05-23-mysql-slow-query-playbook-from-slow-log-to-fix/</guid><description>A repeatable workflow for diagnosing MySQL slow queries — from enabling the slow log through reading EXPLAIN output to committing a safe fix.</description><pubDate>Mon, 23 May 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Most MySQL slowdowns have a short list of root causes: a missing index, a lock wait, or stale optimizer statistics. The hard part is not the fix — it is getting from “p99 alert fired” to “I know which query, why it is slow, and what the safe remediation is” without wasting an hour looking at the wrong thing.&lt;/strong&gt; This playbook gives you that path as a repeatable workflow. Run these checks in order, and you will have a diagnosis before you start guessing.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;The alert fires. Maybe it is a CloudWatch &lt;code&gt;SlowQueries&lt;/code&gt; metric spike on RDS, a p99 latency alarm from your application APM, or a PagerDuty page from a long-running query threshold. You open a terminal, connect to the database, and face the standard problem: MySQL is running dozens of queries per second, and you need to identify the one that is costing you.&lt;/p&gt;
&lt;p&gt;MySQL gives you several places to look — the slow query log, Performance Schema digest tables, &lt;code&gt;SHOW PROCESSLIST&lt;/code&gt;, and InnoDB status — and the right place to start depends on whether the problem is active right now or a pattern you are trying to reconstruct after the fact. This runbook covers both: active incidents where queries are blocking or running hot, and post-incident analysis where you need to find the pattern in aggregated data.&lt;/p&gt;
&lt;p&gt;The version context matters. MySQL 8.0 added &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;, which gives actual row counts alongside estimated ones. If you are on MySQL 5.7 or RDS Aurora MySQL, the same diagnostic steps apply but you will use &lt;code&gt;EXPLAIN FORMAT=JSON&lt;/code&gt; without &lt;code&gt;ANALYZE&lt;/code&gt; for the execution plan.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Query_time&lt;/code&gt; &gt;&gt; &lt;code&gt;Lock_time&lt;/code&gt; in slow log entry&lt;/td&gt;&lt;td&gt;&lt;code&gt;slow_query_log_file&lt;/code&gt; or &lt;code&gt;mysqldumpslow&lt;/code&gt; output&lt;/td&gt;&lt;td&gt;Query is executing slowly independent of locking — likely index or scan issue&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High &lt;code&gt;Lock_time&lt;/code&gt; in slow log&lt;/td&gt;&lt;td&gt;Same source&lt;/td&gt;&lt;td&gt;Transaction waiting on a row lock before it can execute&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;rows_examined&lt;/code&gt; far exceeds &lt;code&gt;rows_sent&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Slow log entry or &lt;code&gt;events_statements_summary_by_digest&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Full or partial table scan — index not covering the WHERE clause&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Thread in &lt;code&gt;Waiting for table metadata lock&lt;/code&gt; state&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW PROCESSLIST&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Another connection holds a metadata lock, usually from an open transaction or an ALTER TABLE&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High &lt;code&gt;SUM_TIMER_WAIT&lt;/code&gt; for a specific digest&lt;/td&gt;&lt;td&gt;&lt;code&gt;performance_schema.events_statements_summary_by_digest&lt;/code&gt;&lt;/td&gt;&lt;td&gt;A specific query pattern accounts for most DB wall-clock time&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;LATEST DETECTED DEADLOCK&lt;/code&gt; section present&lt;/td&gt;&lt;td&gt;&lt;code&gt;SHOW ENGINE INNODB STATUS&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Two transactions deadlocked; one was rolled back&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enable the slow query log and read it&lt;/strong&gt; — If the slow log is not already running, turn it on without a restart:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; GLOBAL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; slow_query_log &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; GLOBAL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; long_query_time &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; GLOBAL&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; log_output &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;FILE&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW VARIABLES &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIKE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;slow_query_log_file&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then use &lt;code&gt;mysqldumpslow&lt;/code&gt; to aggregate entries. The &lt;code&gt;-s t&lt;/code&gt; flag sorts by total time, which surfaces the queries with the most cumulative cost rather than just the single longest run:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;mysqldumpslow&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -s&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; t&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; -t&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /var/lib/mysql/hostname-slow.log&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each entry shows &lt;code&gt;Query_time&lt;/code&gt;, &lt;code&gt;Lock_time&lt;/code&gt;, &lt;code&gt;Rows_sent&lt;/code&gt;, and &lt;code&gt;Rows_examined&lt;/code&gt;. A &lt;code&gt;rows_examined / rows_sent&lt;/code&gt; ratio above 100 is a strong signal of a full or near-full table scan.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Find top queries by total time in Performance Schema&lt;/strong&gt; — For RDS or environments where you cannot read the log file directly, Performance Schema digest tables give the same aggregate view:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  DIGEST_TEXT,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  COUNT_STAR,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  SUM_TIMER_WAIT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1000000000000&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; total_sec,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  AVG_TIMER_WAIT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1000000000000&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; avg_sec,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  SUM_ROWS_EXAMINED,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  SUM_ROWS_SENT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; performance_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;events_statements_summary_by_digest&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUM_TIMER_WAIT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;DIGEST_TEXT&lt;/code&gt; column normalizes literals to &lt;code&gt;?&lt;/code&gt; placeholders, so you see the query pattern regardless of parameter values. Focus on rows where &lt;code&gt;SUM_ROWS_EXAMINED&lt;/code&gt; greatly exceeds &lt;code&gt;SUM_ROWS_SENT&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check current lock waits&lt;/strong&gt; — If the incident is active and threads are blocked, identify the blocking transaction immediately. On MySQL 8.0, use &lt;code&gt;performance_schema.data_lock_waits&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; waiting_trx_id,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_mysql_thread_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; waiting_thread,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_query&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; waiting_query,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  b&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_trx_id,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  b&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_mysql_thread_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_thread,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  b&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_query&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_query&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;innodb_lock_waits&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; w&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INNER JOIN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;innodb_trx&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; b&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; b&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; w&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;blocking_trx_id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INNER JOIN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;innodb_trx&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; r&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; r&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;trx_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; w&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;requesting_trx_id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;blocking_query&lt;/code&gt; column often shows &lt;code&gt;NULL&lt;/code&gt; — this means the blocking transaction has already executed its statement and is sitting idle with an open transaction, holding row locks. Check &lt;code&gt;b.trx_started&lt;/code&gt; to see how long it has been open.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check index usage for the affected table&lt;/strong&gt; — The &lt;code&gt;sys&lt;/code&gt; schema surfaces unused indexes, which are candidates for removal, and lets you quickly see what indexes exist:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Indexes that have never been used since last server restart&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; sys&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schema_unused_indexes&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; object_schema &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;your_db&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- All indexes on the table with cardinality&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INDEX&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; your_table;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Low &lt;code&gt;Cardinality&lt;/code&gt; on a column you are filtering by is a sign the index may not help the optimizer — or that statistics are stale and need updating. A &lt;code&gt;Cardinality&lt;/code&gt; of 1 on a column with millions of rows is usually wrong.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Get EXPLAIN for the slow query&lt;/strong&gt; — Once you have identified the query pattern, capture its execution plan. On MySQL 8.0, &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; runs the query and returns actual row counts alongside estimates:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- MySQL 8.0+ — runs the query and returns actual vs estimated rows&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN ANALYZE&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; user_id, created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pending&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2022-01-01&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- All versions — returns JSON with full cost estimates&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN FORMAT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=JSON&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; user_id, created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pending&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2022-01-01&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the output, look for &lt;code&gt;type: ALL&lt;/code&gt; (full table scan), &lt;code&gt;type: index&lt;/code&gt; (full index scan), &lt;code&gt;Extra: Using filesort&lt;/code&gt;, and &lt;code&gt;Extra: Using temporary&lt;/code&gt;. Any of these signals a query that is doing more work than it needs to. The &lt;code&gt;rows&lt;/code&gt; column shows the optimizer’s estimate; with &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;, the &lt;code&gt;actual rows&lt;/code&gt; field shows what actually happened.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Slow query alert fires] --&gt; B{rows_examined far exceeds rows_sent?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| C[Check EXPLAIN for full scan or wrong index]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; D{type=ALL or index in EXPLAIN?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes| E[Add or modify index based on WHERE clause]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|no| F[Check for filesort or temporary table in Extra]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no| G{lock_time high in slow log?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|yes| H[Query innodb_lock_waits for blocking thread]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt; I[Kill blocking thread or wait for commit]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|no| J{Query recently regressed?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|yes| K{Cardinality looks wrong in SHOW INDEX?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|yes| L[Run ANALYZE TABLE to refresh statistics]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|no| M[Check for schema change or data distribution shift]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    J --&gt;|no| N{I/O bound — buffer pool hit rate low?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    N --&gt;|yes| O[Check innodb_buffer_pool hit rate and increase if possible]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    N --&gt;|no| P[Profile with Performance Schema events_stages_summary]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What this diagram shows:&lt;/strong&gt; A MySQL slow query decision tree — starting with the rows_examined/rows_sent ratio to detect full scans, then lock_time for blocking threads, cardinality estimates for stale statistics, and buffer pool hit rate for I/O saturation — each branch leads to a specific actionable fix.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Add or modify an index based on EXPLAIN output&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When &lt;code&gt;EXPLAIN&lt;/code&gt; shows &lt;code&gt;type: ALL&lt;/code&gt; or the optimizer is choosing an index that does not cover the WHERE clause, the fix is usually a covering index that includes all columns referenced in the WHERE, ORDER BY, and SELECT list. In MySQL 8.0, &lt;code&gt;ALTER TABLE ... ADD INDEX&lt;/code&gt; uses online DDL by default, which means reads and writes continue during the operation:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Add a covering index for the query above&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  ADD&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_status_created_user (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, created_at, user_id);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify the optimizer uses it&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; user_id, created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pending&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; created_at &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;2022-01-01&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Column order in the index matters. MySQL’s B-tree indexes support leftmost prefix matching — the optimizer can use &lt;code&gt;(status, created_at)&lt;/code&gt; for a filter on &lt;code&gt;status&lt;/code&gt; alone, but it cannot use &lt;code&gt;(created_at, status)&lt;/code&gt; for a filter on &lt;code&gt;status&lt;/code&gt; alone. Put the equality predicates first, range predicates last.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Update statistics with ANALYZE TABLE&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When the optimizer is choosing a bad plan despite a suitable index, the cause is often stale statistics. This happens after large data loads, bulk deletes, or tables that have grown significantly since the last statistics update. &lt;code&gt;ANALYZE TABLE&lt;/code&gt; is non-blocking in InnoDB and safe to run in production:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify cardinality updated&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INDEX&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;According to the MySQL 8.0 Reference Manual, InnoDB calculates index statistics by sampling random pages — &lt;code&gt;innodb_stats_sample_pages&lt;/code&gt; controls sample size. If your table has extremely skewed data distribution, increasing this value can improve plan quality at the cost of more I/O during the statistics update.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Kill the blocking transaction&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When lock waits are causing the slowdown, the fastest resolution is to identify and kill the blocking thread. Use the blocking thread ID from the lock wait query in Check 3:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Show full information about the blocking thread&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;processlist&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;blocking_thread_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Kill it (this rolls back the blocking transaction)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;KILL&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;blocking_thread_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;KILL&lt;/code&gt; in MySQL sends a signal to the thread to terminate cleanly. The thread’s current transaction is rolled back. This is the correct tool for a long-running idle transaction holding row locks — not a hard connection reset. After killing, verify the waiting queries resume with &lt;code&gt;SHOW PROCESSLIST&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Adding an index&lt;/strong&gt; — Reversible at any time with &lt;code&gt;DROP INDEX&lt;/code&gt;. The online DDL used in MySQL 8.0 InnoDB means the add is also reversible mid-execution by canceling the ALTER (though partial progress is lost and the operation must restart). To remove: &lt;code&gt;ALTER TABLE orders DROP INDEX idx_status_created_user;&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;ANALYZE TABLE&lt;/strong&gt; — No rollback needed. &lt;code&gt;ANALYZE TABLE&lt;/code&gt; updates statistics but does not change data. If the new statistics produce a worse plan, you can hint the optimizer with &lt;code&gt;USE INDEX (index_name)&lt;/code&gt; as a temporary workaround while investigating the plan regression. Statistics will also auto-update over time as InnoDB detects data changes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;KILL thread&lt;/strong&gt; — The killed transaction is rolled back. There is no undo for the kill itself — the work that transaction had done is lost. Before killing, check &lt;code&gt;trx_query&lt;/code&gt; and &lt;code&gt;trx_rows_modified&lt;/code&gt; to understand what the transaction was doing. For a long-running OLAP query that was just reading, the only cost is rerunning the query. For a transaction in the middle of writes, the application will see a lost connection error and should retry.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;The diagnosis steps in this playbook can be partially automated with two tools.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Percona Toolkit’s &lt;code&gt;pt-query-digest&lt;/code&gt;&lt;/strong&gt; processes slow log files and produces an aggregated report sorted by total time, showing query patterns, execution statistics, and EXPLAIN output. It is the documented standard for batch slow log analysis and handles log rotation correctly:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;bash&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pt-query-digest&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /var/lib/mysql/hostname-slow.log&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; digest_report.txt&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#B392F0&quot;&gt;pt-query-digest&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; --since=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;1h&apos;&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; /var/lib/mysql/hostname-slow.log&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Percona Toolkit is open-source and documented at &lt;a href=&quot;https://www.percona.com/software/database-tools/percona-toolkit&quot;&gt;percona.com/software/database-tools/percona-toolkit&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Trending with Performance Schema&lt;/strong&gt; — The digest table retains aggregated data across the server’s uptime. A scheduled query that snapshots &lt;code&gt;SUM_TIMER_WAIT&lt;/code&gt; and &lt;code&gt;COUNT_STAR&lt;/code&gt; into a monitoring table every 5 minutes gives you a trend line for query cost over time, which is more useful than a point-in-time alert:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Snapshot top 20 digests into a monitoring table every 5 minutes&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; perf_snapshots (captured_at, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;digest&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, total_sec, call_count)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  NOW&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(),&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  DIGEST_TEXT,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  SUM_TIMER_WAIT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1000000000000&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  COUNT_STAR&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; performance_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;events_statements_summary_by_digest&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; SUM_TIMER_WAIT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On RDS, the &lt;code&gt;SlowQueries&lt;/code&gt; CloudWatch metric counts queries exceeding &lt;code&gt;long_query_time&lt;/code&gt; per minute. Set an alarm at a threshold above your baseline (e.g., more than 5 slow queries per minute) to trigger early before p99 latency is customer-visible.&lt;/p&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;A database query exceeded the response time threshold, causing elevated p99 latency visible in application monitoring.&lt;/li&gt;
&lt;li&gt;The slow query was identified using Performance Schema digest tables and the slow query log; root cause was a missing index causing a full table scan. The index was added using online DDL with no downtime.&lt;/li&gt;
&lt;li&gt;Automated slow query alerting via CloudWatch and a scheduled Performance Schema snapshot prevents undetected regressions going forward.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;checklist&quot;&gt;Checklist&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Confirm &lt;code&gt;slow_query_log = ON&lt;/code&gt; and &lt;code&gt;long_query_time&lt;/code&gt; is set to a meaningful threshold (1 second is standard; 0.5 on high-volume OLTP).&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;mysqldumpslow -s t -t 10&lt;/code&gt; on the slow log to identify the top queries by total time.&lt;/li&gt;
&lt;li&gt;Query &lt;code&gt;performance_schema.events_statements_summary_by_digest&lt;/code&gt; sorted by &lt;code&gt;SUM_TIMER_WAIT DESC&lt;/code&gt; to confirm the same pattern.&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;information_schema.innodb_lock_waits&lt;/code&gt; for any active lock waits involving the slow query’s table.&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;SHOW INDEX FROM &amp;#x3C;table&gt;&lt;/code&gt; and check &lt;code&gt;Cardinality&lt;/code&gt; values — anomalously low values indicate stale statistics.&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;EXPLAIN FORMAT=JSON&lt;/code&gt; (or &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on MySQL 8.0+) on the identified query and look for &lt;code&gt;type: ALL&lt;/code&gt;, &lt;code&gt;Using filesort&lt;/code&gt;, and &lt;code&gt;Using temporary&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If a full scan is confirmed, design a covering index that places equality predicates first and range predicates last, then test with &lt;code&gt;EXPLAIN&lt;/code&gt; before adding.&lt;/li&gt;
&lt;li&gt;If lock contention is confirmed, identify the blocking thread using &lt;code&gt;innodb_lock_waits&lt;/code&gt; and decide whether to kill it based on transaction age and &lt;code&gt;trx_rows_modified&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If plan is bad despite good indexes, run &lt;code&gt;ANALYZE TABLE&lt;/code&gt; to refresh InnoDB statistics.&lt;/li&gt;
&lt;li&gt;After adding an index, re-run the original query under load and verify &lt;code&gt;rows_examined&lt;/code&gt; drops to near &lt;code&gt;rows_sent&lt;/code&gt; in the slow log.&lt;/li&gt;
&lt;li&gt;Set up a CloudWatch alarm on &lt;code&gt;SlowQueries&lt;/code&gt; above baseline, or configure a Performance Schema snapshot job to trend query cost over time.&lt;/li&gt;
&lt;li&gt;Document the root cause, the index added, and the cardinality values before and after for the incident record.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;what-this-post-does-not-cover&quot;&gt;What This Post Does Not Cover&lt;/h2&gt;
&lt;p&gt;This post covers identifying and resolving an active slow query in MySQL or Aurora MySQL. It does not cover: InnoDB full-text search tuning, ProxySQL query routing and query cache invalidation, Aurora Serverless v2 capacity scaling behavior during query spikes, or MySQL Group Replication lag as a driver of secondary read slowness. Those are distinct triage paths.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: When a slow query alert fires, engineers waste time looking at the wrong signal — checking instance CPU when the real cause is a missing index, or tuning configuration when lock contention is blocking a single thread.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Run the five checks in order — slow log, Performance Schema digest, lock waits, index cardinality, EXPLAIN — before touching any configuration or schema. Each check either confirms the cause or narrows it to the next step.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After applying the fix, &lt;code&gt;rows_examined&lt;/code&gt; drops to within 2× of &lt;code&gt;rows_sent&lt;/code&gt; in the slow log and &lt;code&gt;SUM_TIMER_WAIT&lt;/code&gt; for the affected digest falls out of the top-10 list.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, confirm &lt;code&gt;slow_query_log = ON&lt;/code&gt; and &lt;code&gt;long_query_time &amp;#x3C;= 1&lt;/code&gt; on every production MySQL instance, and set a CloudWatch &lt;code&gt;SlowQueries&lt;/code&gt; alarm above your normal baseline so the next regression is detected before it reaches p99 latency.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category><category>failures</category></item><item><title>MySQL InnoDB Buffer Pool: The First Thing to Check</title><link>https://rajivonai.com/blog/2022-05-09-mysql-innodb-buffer-pool-the-first-thing-to-check/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-05-09-mysql-innodb-buffer-pool-the-first-thing-to-check/</guid><description>The InnoDB buffer pool hit ratio and size are the first metrics to verify on any MySQL server — a default 128MB pool on a 32GB machine sends every query to disk.</description><pubDate>Mon, 09 May 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The InnoDB buffer pool is MySQL’s most important tuning knob, and it ships with a default that is wrong for almost every production server.&lt;/strong&gt; On a dedicated 32 GB database host, the default &lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; is 128 MB. Every page that does not fit in that 128 MB goes to disk. The result is predictable: IOPS saturate, query latency climbs, and the server looks overloaded even at modest traffic levels.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;InnoDB is a disk-based storage engine. It caches data pages, index pages, and undo information in the buffer pool — a region of RAM managed entirely by the engine. When a query reads a row, InnoDB first checks the buffer pool. A hit means the row is returned from memory. A miss means InnoDB issues a read from the underlying block device, which costs orders of magnitude more time.&lt;/p&gt;
&lt;p&gt;On a freshly provisioned MySQL server, &lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; defaults to 128 MB. That number was chosen for embedded and low-memory deployments. It has nothing to do with what a production workload needs. Engineers who inherit a server and do not check this setting often spend weeks chasing index problems, connection pool tuning, and query rewrites that cannot fix a fundamentally undersized memory tier.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;When the buffer pool is too small for the active working set, InnoDB continuously evicts pages to make room for new reads. Every evicted page that is needed again becomes a physical disk read. At high request rates, that eviction cycle saturates storage I/O, drives up query latency, and eventually limits throughput entirely.&lt;/p&gt;
&lt;p&gt;The failure is not subtle. IOPS on the storage volume spike to near its limit. Query latency climbs. CPU stays moderate because the bottleneck is I/O wait, not compute. SHOW ENGINE INNODB STATUS reports high physical reads per second. The standard diagnostic path — look at slow query log, add indexes, tune joins — does not help because the bottleneck is upstream of query execution.&lt;/p&gt;
&lt;p&gt;The core question is simple: does the buffer pool hold your working set, or is MySQL reading from disk on every cache miss?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;InnoDB divides the buffer pool into pages (16 KB by default). It manages those pages using a modified LRU algorithm: pages accessed recently stay near the head; pages that have not been touched are evicted from the tail when space is needed. A read-ahead mechanism pre-fetches sequential pages during full scans — useful for analytics queries, but a source of unnecessary eviction pressure when it floods the pool with pages that will not be reused.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Query[Client Query] --&gt; Engine[InnoDB Storage Engine]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Engine --&gt; Check{Page in Buffer Pool}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Check --&gt;|Hit| HitNode[Return Row from Memory]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Check --&gt;|Miss| MissNode[Read Page from Disk]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    MissNode --&gt; Load[Load Page into LRU Head]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Load --&gt; Evict[Evict Page from LRU Tail if Full]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    Evict --&gt; HitNode&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Checking hit ratio and sizing:&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Buffer pool metrics&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;STATUS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LIKE&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;Innodb_buffer_pool%&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The key metrics:&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Metric&lt;/th&gt;&lt;th&gt;What it measures&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Innodb_buffer_pool_read_requests&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Logical reads attempted from the pool&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Innodb_buffer_pool_reads&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Physical reads from disk (pool misses)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Innodb_buffer_pool_pages_data&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Pages currently holding data&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Innodb_buffer_pool_pages_free&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Pages available for new data&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Hit ratio formula:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  (&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; -&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    variable_value &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; variable_value &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;global_status&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;     WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; variable_name &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;Innodb_buffer_pool_read_requests&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  )) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; buffer_pool_hit_ratio_pct&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; information_schema&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;global_status&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; variable_name &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;Innodb_buffer_pool_reads&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A healthy server runs above 99%. Below 95% is a strong signal that the pool is undersized for the workload.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sizing guidance from MySQL InnoDB documentation:&lt;/strong&gt; set &lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; to 70–80% of available RAM on a dedicated MySQL server. On a 32 GB server, that is 22–25 GB. On a 64 GB server, 45–50 GB.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multiple instances:&lt;/strong&gt; For multi-core servers where the buffer pool is larger than 1 GB, MySQL documentation recommends setting &lt;code&gt;innodb_buffer_pool_instances&lt;/code&gt; to one instance per 1 GB of pool size (capped at 64). Multiple instances reduce internal mutex contention on the pool itself.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;ini&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;# /etc/mysql/mysql.conf.d/mysqld.cnf&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;innodb_buffer_pool_size&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = 24G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;innodb_buffer_pool_instances&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; = 24&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Changes require a server restart. On MySQL 5.7.5 and later, dynamic resizing is supported with some limitations; for large changes, a coordinated restart is safer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SHOW ENGINE INNODB STATUS&lt;/strong&gt; provides additional diagnostics in the &lt;code&gt;BUFFER POOL AND MEMORY&lt;/code&gt; section, including pages read, pages written, buffer pool hit rate (as a rolling 1000-second average), and pending reads.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented behavior of InnoDB, as described in the MySQL 8.0 Reference Manual (chapter “InnoDB Buffer Pool”), is that the buffer pool is the primary memory structure controlling InnoDB I/O performance. MySQL documentation explicitly states the 70–80% guideline for dedicated servers and notes that the default 128 MB is appropriate only for small or testing environments.&lt;/p&gt;
&lt;p&gt;The pattern of buffer pool undersizing causing I/O saturation is documented in the MySQL performance schema and SHOW STATUS output — the ratio of &lt;code&gt;Innodb_buffer_pool_reads&lt;/code&gt; to &lt;code&gt;Innodb_buffer_pool_read_requests&lt;/code&gt; directly reflects how often the server falls through to disk. Any ratio above 1–2% physical reads warrants investigation of pool size against working set.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Working set grows beyond pool size&lt;/td&gt;&lt;td&gt;Hit ratio drops; IOPS spike&lt;/td&gt;&lt;td&gt;Eviction cycle exceeds storage bandwidth&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Buffer pool sized too large on a shared host&lt;/td&gt;&lt;td&gt;OS swap pressure; latency spikes&lt;/td&gt;&lt;td&gt;MySQL takes memory the OS needed for file cache&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Many small short-lived transactions&lt;/td&gt;&lt;td&gt;Pool fragmented with small dirty pages&lt;/td&gt;&lt;td&gt;Checkpoint pressure increases; write amplification grows&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: The buffer pool is sized at default 128 MB on a production server, sending nearly every cache miss to disk and saturating storage I/O.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Set &lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; to 70–80% of RAM on dedicated servers; set &lt;code&gt;innodb_buffer_pool_instances&lt;/code&gt; to one per GB of pool size.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run &lt;code&gt;SHOW STATUS LIKE &apos;Innodb_buffer_pool%&apos;&lt;/code&gt; before and after resize and verify the hit ratio climbs above 99%; watch &lt;code&gt;Innodb_buffer_pool_reads&lt;/code&gt; drop toward zero.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, calculate the current hit ratio using the formula above. If it is below 99%, check the configured pool size and compare it against the server’s total RAM.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The buffer pool is not a performance optimization — it is the baseline. Everything else in InnoDB tuning assumes the working set fits in memory. If it does not, no amount of index work or query rewriting closes the gap.&lt;/p&gt;</content:encoded><category>databases</category></item><item><title>PostgreSQL Autovacuum: What Every Engineer Should Know</title><link>https://rajivonai.com/blog/2022-04-11-postgresql-autovacuum-what-every-engineer-should-know/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-04-11-postgresql-autovacuum-what-every-engineer-should-know/</guid><description>Autovacuum is not optional maintenance — it is the mechanism that prevents table bloat and transaction ID wraparound from taking your database offline.</description><pubDate>Mon, 11 Apr 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Autovacuum is not a background nicety. It is the process that keeps PostgreSQL’s MVCC machinery from accumulating dead tuples until the table is unreadable, and the process that prevents transaction ID wraparound — a condition where PostgreSQL freezes all writes and forces an emergency vacuum on the entire cluster.&lt;/strong&gt; Treating autovacuum as optional, throttling it too hard on OLTP servers, or simply not knowing what its thresholds mean is one of the most common ways production PostgreSQL clusters degrade over months before anyone notices.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL uses multi-version concurrency control (MVCC). When a row is updated or deleted, PostgreSQL does not overwrite it in place — it marks the old row version as dead and writes a new version. The dead row versions (dead tuples) accumulate on disk and remain visible to old transactions that might still need them. This is what makes non-blocking reads possible: readers never block writers, and writers never block readers.&lt;/p&gt;
&lt;p&gt;But dead tuples cost disk space, and they slow down sequential scans because the storage engine has to skip over them. At the extreme end, transaction IDs are 32-bit integers — after about 2 billion transactions, PostgreSQL will wrap around and enter a state where it cannot guarantee which data is old and which is new. To prevent corruption, PostgreSQL will refuse all writes and force a full-cluster VACUUM FREEZE.&lt;/p&gt;
&lt;p&gt;Autovacuum is the background daemon that reclaims dead tuples and advances the freeze horizon before either of these problems becomes a crisis.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The default autovacuum thresholds are designed for small-to-medium tables. The trigger condition is:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × n_live_tup&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With &lt;code&gt;autovacuum_vacuum_scale_factor = 0.2&lt;/code&gt; (the default), autovacuum triggers a VACUUM when 20% of the live row count has accumulated as dead tuples. On a table with 1,000 rows, this fires after 200 dead tuples — reasonable. On a table with 50 million rows, it fires after 10 million dead tuples have accumulated. That is a lot of bloat before the cleanup runs.&lt;/p&gt;
&lt;p&gt;High-write tables — event logs, audit trails, queues, sessions — accumulate dead tuples faster than autovacuum can clear them at the default settings. The table grows. Indexes bloat. Query plans drift toward sequential scans. The system appears slow without an obvious cause, and the only way to recover is an explicit VACUUM or, worse, a VACUUM FULL (which rewrites the entire table and requires an exclusive lock).&lt;/p&gt;
&lt;p&gt;The core question: how do you tune autovacuum before table bloat becomes a production incident?&lt;/p&gt;
&lt;h2 id=&quot;how-autovacuum-threshold-and-cost-throttling-work&quot;&gt;How Autovacuum Threshold and Cost Throttling Work&lt;/h2&gt;
&lt;p&gt;Autovacuum has two independently important levers: &lt;strong&gt;when it runs&lt;/strong&gt; and &lt;strong&gt;how fast it runs&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When it runs&lt;/strong&gt; is controlled by the threshold formula above. For large, high-write tables, you almost always need to override &lt;code&gt;autovacuum_vacuum_scale_factor&lt;/code&gt; at the table level rather than globally:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; events &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  autovacuum_vacuum_scale_factor &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;01&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  autovacuum_vacuum_threshold &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This tells autovacuum to trigger after 1% of rows become dead (plus a baseline of 1,000 dead tuples), rather than 20%. For a 50 million row table, that fires after 500,000 dead tuples instead of 10 million.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How fast it runs&lt;/strong&gt; is controlled by &lt;code&gt;autovacuum_vacuum_cost_delay&lt;/code&gt; (default: 2ms in PG13+, 20ms in older versions). This is a per-page throttle: after vacuuming &lt;code&gt;autovacuum_vacuum_cost_limit&lt;/code&gt; worth of pages, autovacuum sleeps for &lt;code&gt;autovacuum_vacuum_cost_delay&lt;/code&gt; milliseconds. The intent is to prevent autovacuum from overwhelming I/O on a shared server. The side effect is that on OLTP servers with continuous high write throughput, autovacuum can be so throttled that it never catches up.&lt;/p&gt;
&lt;p&gt;You can observe the current autovacuum state per-table in &lt;code&gt;pg_stat_user_tables&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  relname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_live_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_dead_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_autovacuum,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_autoanalyze&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A table with a high &lt;code&gt;n_dead_tup&lt;/code&gt; relative to &lt;code&gt;n_live_tup&lt;/code&gt; and a stale &lt;code&gt;last_autovacuum&lt;/code&gt; timestamp is a table where autovacuum is not keeping up.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;autovacuum_max_workers&lt;/code&gt; (default: 3) controls how many autovacuum processes can run simultaneously. On clusters with many high-write tables, this can become the binding constraint — all workers are busy on large tables and smaller tables go unvacuumed.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s autovacuum documentation (postgresql.org/docs/current/routine-vacuuming.html) documents the wraparound risk directly: when a table’s &lt;code&gt;relfrozenxid&lt;/code&gt; age approaches &lt;code&gt;autovacuum_freeze_max_age&lt;/code&gt; (default: 200 million transactions), PostgreSQL will force an anti-wraparound vacuum that ignores the normal cost throttling. This means a heavily throttled autovacuum configuration will eventually be overridden by the system — but not before the forced vacuum causes a visible I/O spike.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;pg_stat_user_tables&lt;/code&gt; view is the documented interface for observing autovacuum behavior per table. The columns &lt;code&gt;n_dead_tup&lt;/code&gt;, &lt;code&gt;last_autovacuum&lt;/code&gt;, &lt;code&gt;last_autoanalyze&lt;/code&gt;, and &lt;code&gt;autovacuum_count&lt;/code&gt; give the observable signal for whether thresholds are tuned correctly.&lt;/p&gt;
&lt;p&gt;The documented pattern from PostgreSQL’s VACUUM documentation is that per-table storage parameters (&lt;code&gt;autovacuum_vacuum_scale_factor&lt;/code&gt;, &lt;code&gt;autovacuum_vacuum_cost_delay&lt;/code&gt;) override the server-level &lt;code&gt;postgresql.conf&lt;/code&gt; settings — this is the correct mechanism for table-level tuning without changing global behavior.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Autovacuum disabled explicitly (&lt;code&gt;autovacuum = off&lt;/code&gt;)&lt;/td&gt;&lt;td&gt;Dead tuples accumulate unbounded; XID wraparound will eventually force a full-cluster emergency vacuum&lt;/td&gt;&lt;td&gt;The only thing preventing unbounded table bloat is operator-run VACUUM; one missed cycle compounds&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cost delay set too high on OLTP servers&lt;/td&gt;&lt;td&gt;Autovacuum runs slower than dead tuples accumulate; table bloat grows continuously&lt;/td&gt;&lt;td&gt;Each worker sleeps too long between pages; on high-write tables the math never closes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;XID wraparound forces anti-wraparound vacuum&lt;/td&gt;&lt;td&gt;All autovacuum workers redirect to the aging table, ignoring cost limits; other tables go unvacuumed&lt;/td&gt;&lt;td&gt;Anti-wraparound vacuum is not throttled — it will consume I/O to protect data integrity&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: On large, high-write tables the default 20% scale factor lets millions of dead tuples accumulate before autovacuum triggers, causing progressive table and index bloat.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Override &lt;code&gt;autovacuum_vacuum_scale_factor&lt;/code&gt; at the table level (set to 0.01–0.05 for tables over 1M rows) and reduce &lt;code&gt;autovacuum_vacuum_cost_delay&lt;/code&gt; on servers where autovacuum is falling behind.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Query &lt;code&gt;pg_stat_user_tables&lt;/code&gt; and confirm &lt;code&gt;n_dead_tup&lt;/code&gt; on your high-write tables stays below 1–2% of &lt;code&gt;n_live_tup&lt;/code&gt; over a 24-hour window.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, run &lt;code&gt;SELECT relname, n_dead_tup, n_live_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;&lt;/code&gt; and identify which tables have not been vacuumed recently or have high dead tuple ratios — those are the candidates for per-table threshold tuning.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category></item><item><title>PostgreSQL Slow Query Triage Workflow</title><link>https://rajivonai.com/blog/2022-03-21-postgresql-slow-query-triage-workflow/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-03-21-postgresql-slow-query-triage-workflow/</guid><description>A structured runbook for diagnosing slow query root causes in PostgreSQL — missing indexes, stale statistics, lock contention, and I/O saturation — in the order that wastes the least time.</description><pubDate>Mon, 21 Mar 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;When p95 latency spikes and the on-call alert fires, most engineers open the slow query log and immediately jump to the biggest query by average execution time. That is the wrong move. The query that shows up longest in &lt;code&gt;pg_stat_statements&lt;/code&gt; is often not the query that caused the spike — it is the query that was already slow. The blocking transaction, the missing index on a newly-deployed code path, or autovacuum being interrupted mid-table are the usual culprits. This runbook gives you the order to check that actually closes incidents.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;A p95 latency spike lands in monitoring. The graphs show it clearly: something changed in the last five to fifteen minutes. The application is returning slow responses. Your first instinct is to check the dashboard, which shows elevated CPU and read latency on the database host. &lt;code&gt;pg_stat_activity&lt;/code&gt; has more active connections than usual. The alert threshold on slow queries crossed.&lt;/p&gt;
&lt;p&gt;At this point, engineers split into two groups. The first opens the slow query log, picks the worst query, and starts trying to add an index or rewrite the SQL. The second checks what PostgreSQL is actually doing right now — what is blocked, what is waiting, and what happened to statistics or autovacuum in the last hour. The second group resolves the incident faster because they are reading system state rather than historical averages.&lt;/p&gt;
&lt;p&gt;The problem with jumping straight to the slow query log is that &lt;code&gt;pg_stat_statements&lt;/code&gt; accumulates over time. A query that has always been slow will look exactly like a query that just started being slow because of a table scan it previously avoided. You need the current state first, then the cumulative data as context.&lt;/p&gt;
&lt;p&gt;PostgreSQL exposes the information you need through its system catalog views. The triage workflow below uses five queries — in order — to eliminate root causes before you start making changes.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;








































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Where to see it&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Active query count above baseline&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt;, CloudWatch connections metric&lt;/td&gt;&lt;td&gt;Connection pressure or query backup — check for lock waits first&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Queries appearing in slow query log with new query shapes&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_statements&lt;/code&gt;, auto_explain log output&lt;/td&gt;&lt;td&gt;New code path or table growth crossed a plan-change threshold&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Sequential scan on a large table in explain output&lt;/td&gt;&lt;td&gt;&lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt; output&lt;/td&gt;&lt;td&gt;Missing index or statistics too stale to use an existing one&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;waiting&lt;/code&gt; column true for multiple queries&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_activity&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Lock contention — one transaction is blocking others&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High read I/O on the database host&lt;/td&gt;&lt;td&gt;CloudWatch read latency, Datadog disk metrics&lt;/td&gt;&lt;td&gt;Table or index bloat forcing extra page reads; autovacuum may be behind&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;last_autoanalyze&lt;/code&gt; timestamp hours or days old on active table&lt;/td&gt;&lt;td&gt;&lt;code&gt;pg_stat_user_tables&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Stale statistics — planner is working from outdated row estimates&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Find currently running slow queries&lt;/strong&gt; — This is always first. Before looking at anything historical, see what PostgreSQL is doing right now. Queries held open for more than five seconds are either blocked, doing real work, or stuck. The &lt;code&gt;state&lt;/code&gt; column tells you whether they are actively executing or waiting.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query_start &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; duration,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  wait_event_type,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  wait_event,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  query&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; !=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query_start &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; interval &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;5 seconds&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; duration &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Look at &lt;code&gt;wait_event_type&lt;/code&gt;. If it reads &lt;code&gt;Lock&lt;/code&gt;, you have a lock contention issue. If it reads &lt;code&gt;IO&lt;/code&gt;, the query is waiting on disk. If it is null, the query is actively executing — check the plan next.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Find top queries by cumulative execution time&lt;/strong&gt; — Once you know what is running now, pull the historical picture from &lt;code&gt;pg_stat_statements&lt;/code&gt;. This extension is documented in the PostgreSQL &lt;code&gt;pg_stat_statements&lt;/code&gt; module reference and accumulates statistics since the last reset. Sort by &lt;code&gt;total_exec_time&lt;/code&gt; to find queries that are expensive in aggregate, not just occasionally slow.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  query,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  calls,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  total_exec_time &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;/&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; calls &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; avg_ms,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  total_exec_time,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  rows&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_statements&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; total_exec_time &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A query with high &lt;code&gt;avg_ms&lt;/code&gt; but low &lt;code&gt;calls&lt;/code&gt; is an outlier. A query with moderate &lt;code&gt;avg_ms&lt;/code&gt; but millions of &lt;code&gt;calls&lt;/code&gt; is a throughput problem. Both need attention, but the right fix differs.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Check for lock waits&lt;/strong&gt; — If check 1 showed any &lt;code&gt;wait_event_type = &apos;Lock&apos;&lt;/code&gt; rows, this query identifies the full blocking chain. &lt;code&gt;pg_blocking_pids()&lt;/code&gt; is a PostgreSQL built-in that returns the PIDs of sessions blocking a given session.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  blocked&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  blocked&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;query&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  blocked&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;wait_event_type&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_pid,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;query&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_query,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;  blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_state,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;query_start&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_duration&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity blocked&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity blocking&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ANY(pg_blocking_pids(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;blocked&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;));&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;blocking_query&lt;/code&gt; column often reveals the transaction holding the lock. An idle-in-transaction connection is a common culprit: a transaction that opened, ran one query, and then paused while the application did something else — holding its lock the whole time.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Check table statistics age&lt;/strong&gt; — If lock waits are not the issue, check whether the planner is working from stale statistics. PostgreSQL uses statistics collected by &lt;code&gt;ANALYZE&lt;/code&gt; to estimate row counts and choose access paths. When statistics fall behind the actual table state — after a large data load, a batch delete, or a period when autovacuum was interrupted — the planner can choose a sequential scan where an index would be far faster.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  schemaname,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  tablename,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_analyze,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_autoanalyze,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_live_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_dead_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_dead_tup::&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;float&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; NULLIF&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(n_live_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;+&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; dead_ratio&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_dead_tup &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A table with a &lt;code&gt;last_autoanalyze&lt;/code&gt; timestamp more than a few hours old on a high-write workload, or a &lt;code&gt;dead_ratio&lt;/code&gt; above 10–20%, is a candidate. The autovacuum capacity implications of this pattern are covered in depth in &lt;a href=&quot;https://rajivonai.com/blog/2025-09-13-autovacuum-is-a-capacity-problem-not-a-maintenance-task/&quot;&gt;Autovacuum Is a Capacity Problem, Not a Maintenance Task&lt;/a&gt;.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Get EXPLAIN ANALYZE for the slow query&lt;/strong&gt; — Once you have identified the specific query from checks 1 or 2, pull the execution plan with buffer statistics. &lt;code&gt;BUFFERS&lt;/code&gt; output shows how many shared buffer hits versus disk reads the query required, which distinguishes a missing index (high shared hits, no index scan) from an I/O problem (high disk reads).&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN (ANALYZE, BUFFERS, FORMAT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TEXT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;paste slow query here&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Look for: &lt;code&gt;Seq Scan&lt;/code&gt; on a table with high &lt;code&gt;rows=&lt;/code&gt; estimates, &lt;code&gt;rows=1&lt;/code&gt; estimates on nodes where the actual rows are in the thousands (stale statistics), and &lt;code&gt;Buffers: shared read=&lt;/code&gt; values that are high relative to table size.&lt;/p&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Slow query alert fires] --&gt; B{pg_stat_activity — queries waiting on Lock?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|yes| C[Check blocking chain — kill or wait out blocker]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|no| D{EXPLAIN shows Seq Scan on large table?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt;|yes| E{Index exists for this predicate?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|no| F[Add index with CREATE INDEX CONCURRENTLY]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|yes| G{Statistics stale — last_autoanalyze old?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|yes| H[Run ANALYZE on table — recheck plan]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    G --&gt;|no| I{High Buffers: shared read in EXPLAIN?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt;|yes| J[Check table bloat and autovacuum lag]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt;|no| K{Connection count near pool limit?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|yes| L[Check pool settings and idle-in-transaction connections]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt;|no| M[Profile query logic — may be algorithmic]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What this diagram shows:&lt;/strong&gt; A decision tree for PostgreSQL slow query triage — starting with active lock waits, then sequential scans on large tables, missing indexes, stale statistics (last_autoanalyze), high shared buffer reads indicating bloat, and connection pool saturation — in the order that eliminates the most common root causes first.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Option 1 — Add a missing index&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When &lt;code&gt;EXPLAIN&lt;/code&gt; shows a sequential scan on a large table and no index covers the query predicate, create one online. &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; builds the index without blocking reads or writes. It takes longer than a standard index build, and it can fail if the transaction load is very high, but it is the safe choice for production.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; CONCURRENTLY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; idx_orders_customer_created&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders (customer_id, created_at)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; !=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;cancelled&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Partial indexes (the &lt;code&gt;WHERE&lt;/code&gt; clause above) reduce size and improve selectivity when the query always filters on a stable condition. After creation, run &lt;code&gt;EXPLAIN&lt;/code&gt; again to confirm the planner picks up the new index. If it does not, check that the statistics are current — &lt;code&gt;ANALYZE orders;&lt;/code&gt; and re-examine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2 — Refresh stale statistics&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When &lt;code&gt;EXPLAIN&lt;/code&gt; shows row estimates that are far off from actual rows — typically &lt;code&gt;rows=1&lt;/code&gt; or a small number where the actual is thousands — and &lt;code&gt;pg_stat_user_tables&lt;/code&gt; shows a stale &lt;code&gt;last_autoanalyze&lt;/code&gt;, run &lt;code&gt;ANALYZE&lt;/code&gt; manually.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VERBOSE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;ANALYZE&lt;/code&gt; is always safe. It takes a &lt;code&gt;SHARE UPDATE EXCLUSIVE&lt;/code&gt; lock, which does not block reads or writes. It completes quickly on most tables. After it finishes, run &lt;code&gt;EXPLAIN&lt;/code&gt; again. If the plan does not change, the statistics were not the issue — move to the next check.&lt;/p&gt;
&lt;p&gt;If autovacuum is consistently falling behind on this table, the default &lt;code&gt;autovacuum_analyze_scale_factor&lt;/code&gt; of 20% is too coarse for large or frequently-modified tables. Lower it per-table:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (autovacuum_analyze_scale_factor &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;01&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Option 3 — Resolve lock contention&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When the blocking chain query from check 3 shows a long-running transaction holding a lock that others are waiting on, you have two choices: wait for it to finish, or terminate it.&lt;/p&gt;
&lt;p&gt;Terminate with care. &lt;code&gt;pg_terminate_backend()&lt;/code&gt; sends SIGTERM to the backend process; the transaction rolls back and its locks are released immediately. Use it when the blocking transaction has been idle for longer than your incident SLA, or when it is clearly stuck.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_terminate_backend(blocking_pid)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  SELECT DISTINCT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; blocking_pid&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity blocked&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity blocking&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; ANY(pg_blocking_pids(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;blocked&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;pid&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;))&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;  WHERE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle in transaction&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    AND&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; blocking&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;query_start&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; interval &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;2 minutes&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) sub;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After terminating, investigate why the transaction stayed open. Idle-in-transaction connections usually point to application-side connection pool misconfiguration or missing error handling that closes transactions on exception.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 4 — Address bloat and autovacuum lag&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When &lt;code&gt;EXPLAIN&lt;/code&gt; shows high &lt;code&gt;Buffers: shared read=&lt;/code&gt; values disproportionate to the query’s logical data needs, and &lt;code&gt;pg_stat_user_tables&lt;/code&gt; shows high &lt;code&gt;n_dead_tup&lt;/code&gt; on the relevant table, dead row versions are inflating the table and causing unnecessary disk reads.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check bloat on a specific table&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_live_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  n_dead_tup,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_autovacuum,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  last_autoanalyze&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; relname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Force vacuum manually during the incident&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;VACUUM (&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;VERBOSE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, ANALYZE) orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Standard &lt;code&gt;VACUUM&lt;/code&gt; — as opposed to &lt;code&gt;VACUUM FULL&lt;/code&gt; — does not block reads or writes. It reclaims dead tuple space and updates statistics. &lt;code&gt;VACUUM FULL&lt;/code&gt; requires an exclusive lock and rewrites the table; it should not be used on production tables during an incident.&lt;/p&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Created index with &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt;&lt;/strong&gt; — Drop it with &lt;code&gt;DROP INDEX CONCURRENTLY&lt;/code&gt;. The drop is also online and does not block queries. If the index was a partial index, dropping it has no data impact.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ran &lt;code&gt;ANALYZE&lt;/code&gt;&lt;/strong&gt; — No rollback needed. &lt;code&gt;ANALYZE&lt;/code&gt; updates statistics only. The planner reverts to the previous plan at the next statistics collection if the table state reverts. There is no mechanism to restore old statistics directly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Killed a blocking transaction&lt;/strong&gt; — The killed transaction rolls back automatically. Any work it had done is undone. Monitor &lt;code&gt;pg_stat_activity&lt;/code&gt; to confirm the blocked queries resume. If they do not, check for a new blocking chain.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ran &lt;code&gt;VACUUM&lt;/code&gt;&lt;/strong&gt; — No rollback needed. Vacuum is additive: it reclaims space but does not modify live rows. Re-enable autovacuum if it was disabled during the incident.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Two automation patterns are worth implementing before the next incident rather than after.&lt;/p&gt;
&lt;p&gt;The first is continuous slow query capture. PostgreSQL’s &lt;code&gt;auto_explain&lt;/code&gt; extension logs execution plans automatically when a query exceeds a duration threshold. Add these settings to &lt;code&gt;postgresql.conf&lt;/code&gt; (or as session-level settings for testing):&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Load the extension (requires restart or ALTER SYSTEM)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LOAD&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;auto_explain&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; auto_explain&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;log_min_duration&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;1s&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; auto_explain&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;log_analyze&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; true;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; auto_explain&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;log_buffers&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; true;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With &lt;code&gt;auto_explain&lt;/code&gt; active, every query over one second logs its plan to the PostgreSQL log. Feed those logs to a log aggregator and you will have plan history before the incident rather than needing to reconstruct it after.&lt;/p&gt;
&lt;p&gt;The second is a scheduled &lt;code&gt;pg_stat_activity&lt;/code&gt; snapshot. Use &lt;code&gt;pg_cron&lt;/code&gt; to capture long-running queries every minute to a local table. This gives you a timeline to review post-incident that &lt;code&gt;pg_stat_statements&lt;/code&gt; alone cannot provide, since &lt;code&gt;pg_stat_statements&lt;/code&gt; aggregates across time but does not record when queries were running.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Requires pg_cron extension&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; cron&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;schedule&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &apos;capture-slow-queries&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;  &apos;* * * * *&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  $$&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    INSERT INTO&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; slow_query_log (captured_at, pid, duration, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, query)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(), pid, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query_start, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, query&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;    WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; !=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;      AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query_start &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&amp;#x3C;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; interval &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;10 seconds&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  $$&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alert on this table when row counts spike: that is an early signal that something is blocking normal query throughput before the application-side p95 alert fires.&lt;/p&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;What broke&lt;/strong&gt;: Queries slowed because of lock contention from a long-running transaction, or because the query planner chose a sequential scan after table statistics fell out of date.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What was done&lt;/strong&gt;: Identified the root cause using PostgreSQL system catalog queries, terminated the blocking connection or added a missing index, and ran &lt;code&gt;ANALYZE&lt;/code&gt; to refresh planner statistics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What prevents recurrence&lt;/strong&gt;: &lt;code&gt;auto_explain&lt;/code&gt; now captures slow query plans automatically; per-table autovacuum thresholds are set for high-write tables; a &lt;code&gt;pg_cron&lt;/code&gt; job snapshots long-running queries every minute for post-incident review.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;checklist&quot;&gt;Checklist&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Pull currently running queries from &lt;code&gt;pg_stat_activity&lt;/code&gt; — check &lt;code&gt;wait_event_type&lt;/code&gt; before anything else&lt;/li&gt;
&lt;li&gt;Identify any sessions with &lt;code&gt;wait_event_type = &apos;Lock&apos;&lt;/code&gt; and trace the blocking chain&lt;/li&gt;
&lt;li&gt;Pull top queries by &lt;code&gt;total_exec_time&lt;/code&gt; from &lt;code&gt;pg_stat_statements&lt;/code&gt; — distinguish outliers from throughput problems&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt; on the specific slow query — look for Seq Scan and row estimate mismatches&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;pg_stat_user_tables&lt;/code&gt; for tables with stale &lt;code&gt;last_autoanalyze&lt;/code&gt; or high &lt;code&gt;n_dead_tup&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;If lock contention: terminate idle-in-transaction connections blocking others for more than two minutes&lt;/li&gt;
&lt;li&gt;If missing index: create with &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; — confirm plan change with &lt;code&gt;EXPLAIN&lt;/code&gt; afterward&lt;/li&gt;
&lt;li&gt;If stale statistics: run &lt;code&gt;ANALYZE&lt;/code&gt; on the affected table — always safe, non-blocking&lt;/li&gt;
&lt;li&gt;If bloat: run &lt;code&gt;VACUUM (VERBOSE, ANALYZE)&lt;/code&gt; — do not use &lt;code&gt;VACUUM FULL&lt;/code&gt; during an incident&lt;/li&gt;
&lt;li&gt;After resolving: lower &lt;code&gt;autovacuum_analyze_scale_factor&lt;/code&gt; on high-write tables to prevent recurrence&lt;/li&gt;
&lt;li&gt;Enable &lt;code&gt;auto_explain&lt;/code&gt; with &lt;code&gt;log_min_duration&lt;/code&gt; set to your slow query threshold&lt;/li&gt;
&lt;li&gt;Schedule a &lt;code&gt;pg_cron&lt;/code&gt; job to snapshot &lt;code&gt;pg_stat_activity&lt;/code&gt; for future post-incident timelines&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;what-this-post-does-not-cover&quot;&gt;What This Post Does Not Cover&lt;/h2&gt;
&lt;p&gt;This post covers triage of an active slow query incident. It does not cover: &lt;code&gt;pg_partman&lt;/code&gt; partition pruning for large tables, physical replication lag as a source of slow reads on replicas, connection pooler (PgBouncer) saturation that precedes the slow query symptom, or schema migration locking analysis. Each of those is a distinct failure mode with its own triage path.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: A slow query alert fires and the on-call engineer spends 30 minutes checking the wrong root cause — stale statistics were the issue, not the query they were tuning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Work through the five checks in order: current activity first, then historical aggregates, then lock contention, then statistics age, then the execution plan.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Running &lt;code&gt;pg_stat_activity&lt;/code&gt; before touching anything else shows whether the incident is lock-driven within 60 seconds — that confirmation eliminates half the possible root causes immediately.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Add &lt;code&gt;pg_stat_statements&lt;/code&gt; and &lt;code&gt;auto_explain&lt;/code&gt; to your PostgreSQL configuration this week; validate they are collecting data; add the five check queries to your team’s runbook.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>checklist</category><category>failures</category></item></channel></rss>