<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Engineering Fundamentals | RajivOnAI</title><description>Core engineering principles, debugging workflows, observability, performance basics, reviews, and practical operating habits.</description><link>https://rajivonai.com/topics/fundamentals/</link><item><title>AI Cost Observability Dashboard: LangSmith vs Helicone</title><link>https://rajivonai.com/blog/2026-04-15-ai-cost-observability-dashboard/</link><guid isPermaLink="true">https://rajivonai.com/blog/2026-04-15-ai-cost-observability-dashboard/</guid><description>How to build an AI FinOps dashboard and choose between proxy-based and instrumentation-based observability.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;If you cannot map an unexpected $500 Anthropic API spike to a specific PR, developer, or infinite agent loop within five minutes, your AI engineering team is flying blind.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Engineering teams are deploying AI not just as chatbots, but as embedded agents within continuous integration pipelines, IDEs, and local terminal workflows. As organizations shift from flat-rate seat licenses to metered API consumption, the primary operational risk shifts from “uptime” to “runaway cloud spend.”&lt;/p&gt;
&lt;p&gt;Platform engineering teams are tasked with bringing this spend under control. They need a dashboard. However, the AI observability tooling market has split into two fundamentally different architectural patterns: &lt;strong&gt;Proxy-Based Gateways&lt;/strong&gt; and &lt;strong&gt;Deep Agent Instrumentation&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Most platform teams choose their observability tool based on marketing rather than their actual engineering bottleneck.&lt;/p&gt;
&lt;p&gt;If you use a deep instrumentation tool when all you need is a budget cutoff, you waste weeks fighting SDK integrations. If you use a simple proxy gateway when you are trying to debug a complex multi-stage agent, you will see a massive token spike on your dashboard but have absolutely no idea &lt;em&gt;why&lt;/em&gt; the agent decided to ingest the entire repository.&lt;/p&gt;
&lt;p&gt;You need to track critical metrics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cost by user, team, and repository.&lt;/li&gt;
&lt;li&gt;Tokens per session and average session duration.&lt;/li&gt;
&lt;li&gt;Retry loops (identifying agents stuck in failure states).&lt;/li&gt;
&lt;li&gt;Cost per merged PR.&lt;/li&gt;
&lt;li&gt;Monthly burn rate and forecasted overrun.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Choosing between LangSmith and Helicone dictates whether you can actually extract these metrics without suffocating your developers.&lt;/p&gt;
&lt;h2 id=&quot;the-architecture-of-observability&quot;&gt;The Architecture of Observability&lt;/h2&gt;
&lt;p&gt;Your dashboard architecture depends entirely on your primary goal: Cost Control vs. Lifecycle Debugging.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    App[AI Application / CLI]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    subgraph Proxy Architecture&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        Helicone[Helicone API Gateway]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        Helicone --&gt;|Cache — Rate Limit| API1[Provider API]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    end&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    subgraph Instrumentation Architecture&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        LangChain[LangChain — LiteLLM — SDK]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        LangSmith[LangSmith Tracing Backend]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        LangChain -.-&gt;|Async Trace — OTel| LangSmith&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;        LangChain --&gt; API2[Provider API]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    end&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    App --&gt; Helicone&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    App --&gt; LangChain&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;1-the-proxy-gateway-pattern-helicone--openmeter&quot;&gt;1. The Proxy Gateway Pattern (Helicone / OpenMeter)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt; Operational cost monitoring, strict budget enforcement, and zero-instrumentation setups.&lt;/p&gt;
&lt;p&gt;Helicone acts as an API gateway. You change the &lt;code&gt;baseURL&lt;/code&gt; in your Anthropic or OpenAI client to point to Helicone, and it immediately starts logging traffic. It sits between your application and the provider, making it perfect for caching repeated prompts and enforcing hard rate limits.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The Advantage:&lt;/strong&gt; It “just works.” You can cut off a team’s API access the second they hit a $500 monthly limit, regardless of how complex their code is.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Drawback:&lt;/strong&gt; It only sees the HTTP request and response. If a LangGraph agent makes 15 calls in a row, the proxy sees 15 isolated calls; it doesn’t understand the conceptual “chain” that connects them.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;2-the-agent-lifecycle-pattern-langsmith&quot;&gt;2. The Agent Lifecycle Pattern (LangSmith)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt; Complex agent debugging, evaluation pipelines, and multi-step trace visibility.&lt;/p&gt;
&lt;p&gt;LangSmith requires SDK integration. It hooks directly into the logic of your code. If an agent executes a plan, makes three tool calls, does a vector search, and then formats a response, LangSmith traces that entire hierarchy. LangSmith supports LangChain/LangGraph natively and also accepts OpenTelemetry (OTel) traces from non-LangChain frameworks via its REST ingest API.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The Advantage:&lt;/strong&gt; Unmatched depth. You can click into a trace and see exactly which node in your agent graph caused the 100,000-token context explosion. Evaluation pipelines (“Evals”) let you measure whether a prompt change actually improved output quality.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Drawback:&lt;/strong&gt; Requires instrumentation code changes; each framework has different integration depth. Budget and per-developer spend reporting requires custom aggregation — the tool is optimized for trace debugging, not FinOps dashboards.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented public pattern for enterprise AI observability recognizes that these two architectures serve different audiences.&lt;/p&gt;
&lt;p&gt;The platform engineering and FinOps teams rely on the &lt;strong&gt;Proxy Pattern&lt;/strong&gt;. The standard enterprise practice of routing all external API traffic through a centralized gateway — enforcing per-service quotas and attribution — applies directly to AI. Platform teams provision Helicone to manage the organizational budget, ensuring that a single runaway script cannot drain the corporate card.&lt;/p&gt;
&lt;p&gt;Conversely, AI product engineers rely on the &lt;strong&gt;Instrumentation Pattern&lt;/strong&gt;. When building highly autonomous agents, developers use LangSmith to run “Evals” (LLM-as-a-judge) to measure whether a new prompt actually improved output quality, trading the simplicity of a proxy for deep execution traces.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;
&lt;p&gt;If you implement the wrong observability layer, your FinOps dashboard will fail.&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;Dashboard Failure&lt;/th&gt;&lt;th align=&quot;left&quot;&gt;Trigger&lt;/th&gt;&lt;th align=&quot;left&quot;&gt;Impact&lt;/th&gt;&lt;th align=&quot;left&quot;&gt;Mitigation&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;strong&gt;The Opaque Spike&lt;/strong&gt;&lt;/td&gt;&lt;td align=&quot;left&quot;&gt;Using a proxy to monitor a complex multi-agent system.&lt;/td&gt;&lt;td align=&quot;left&quot;&gt;The dashboard shows a $50 spike, but engineers cannot figure out which agent logic triggered it.&lt;/td&gt;&lt;td align=&quot;left&quot;&gt;Use LangSmith to trace the specific execution nodes of complex agents.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;strong&gt;The SDK Tax&lt;/strong&gt;&lt;/td&gt;&lt;td align=&quot;left&quot;&gt;Forcing LangSmith on a team writing simple Python scripts.&lt;/td&gt;&lt;td align=&quot;left&quot;&gt;Developers spend more time configuring traces than writing the actual business logic.&lt;/td&gt;&lt;td align=&quot;left&quot;&gt;Use Helicone for a zero-instrumentation gateway integration.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;strong&gt;Unattributed Spend&lt;/strong&gt;&lt;/td&gt;&lt;td align=&quot;left&quot;&gt;Using an API gateway but failing to pass custom headers.&lt;/td&gt;&lt;td align=&quot;left&quot;&gt;You know you spent $1,000, but you don’t know which team or user spent it.&lt;/td&gt;&lt;td align=&quot;left&quot;&gt;Enforce a strict policy that all proxy requests must include a &lt;code&gt;User-ID&lt;/code&gt; header.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Transitioning to usage-based AI developer tools creates a critical blind spot for platform teams managing organizational budgets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Deploy an AI observability dashboard that aligns with your engineering bottleneck—Helicone for budget proxies, LangSmith for deep agent debugging.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; The established behavior of proxy gateways demonstrates that enforcing hard spending limits and request caching at the network edge prevents runaway API charges from unconstrained developer keys — a failed request is still billed, and retry loops are invisible without a gateway layer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Immediately provision an API proxy (like Helicone) and issue internal keys to your developers. Refuse to fund direct Anthropic or OpenAI API keys that bypass this observability layer.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>ai-engineering</category><category>architecture</category><category>checklist</category></item><item><title>Alert Fatigue Engineering: How to Build Fewer, Better, Actionable Alerts</title><link>https://rajivonai.com/blog/2025-10-21-alert-fatigue-engineering-actionable-alerts/</link><guid isPermaLink="true">https://rajivonai.com/blog/2025-10-21-alert-fatigue-engineering-actionable-alerts/</guid><description>A dashboard is not observability, and an alert without a specific action is just operational debt masquerading as monitoring.</description><pubDate>Tue, 21 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;If an engineer’s first instinct when their pager goes off is to mute it and go back to sleep, your entire observability stack has failed its primary purpose.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;As teams migrate from monolithic infrastructure to microservices and cloud databases, they tend to over-monitor. They instrument every container, queue, and database instance, and map an alert to every available metric. In theory, this provides comprehensive coverage. In reality, it creates a crushing wave of noise.&lt;/p&gt;
&lt;p&gt;Alert fatigue is the silent killer of engineering culture. When a platform team receives 500 alerts in a week, the human brain stops processing them as signals and starts treating them as background static. This leads to the most dangerous state in systems engineering: a legitimate, catastrophic failure alert is ignored because it looks exactly like the 499 false positives that preceded it.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The root of alert fatigue is a misunderstanding of what an alert is. A dashboard is meant for exploration and context. An alert is meant to demand immediate human action.&lt;/p&gt;
&lt;p&gt;Most teams configure “informational alerts”—pages that fire to tell an engineer that a queue is slightly full, or that CPU is running a bit hot, even though no user impact is occurring and no action is required. These informational pages dilute the urgency of the alerting system. Furthermore, alerts are often created without clear ownership or runbooks, leaving the paged engineer guessing what they are supposed to do to mitigate the issue.&lt;/p&gt;
&lt;h2 id=&quot;actionable-alert-engineering&quot;&gt;Actionable Alert Engineering&lt;/h2&gt;
&lt;p&gt;A mature observability system treats every alert as a formal contract between the system and the engineer. Every alert must strictly adhere to the following framework:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Owner:&lt;/strong&gt; The team responsible for maintaining the alert and resolving the underlying issue.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Impact:&lt;/strong&gt; The specific business or user impact (e.g., “Checkout service is failing”).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Severity:&lt;/strong&gt; The urgency of the response (e.g., SEV1 means immediate page, SEV3 means Slack notification during business hours).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Runbook:&lt;/strong&gt; A direct link to the exact steps required to triage and mitigate the issue.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Threshold Rationale:&lt;/strong&gt; A documented explanation of &lt;em&gt;why&lt;/em&gt; the threshold is set where it is.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Suppression Logic:&lt;/strong&gt; Rules that silence the alert during known maintenance windows or downstream outages.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented pattern for surviving alert fatigue involves aggressive alert bankruptcy and continuous pruning.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Google’s Site Reliability Engineering book describes alert fatigue as a direct consequence of alerts that require no human action, documenting the principle that every page must be actionable and that systems should not generate pages the engineer can resolve by doing nothing (&lt;a href=&quot;https://sre.google/sre-book/practical-alerting/&quot;&gt;Google SRE Book: Practical Alerting from Time-Series Data&lt;/a&gt;). The SRE book states: “if humans are required to read an email or message more than twice a week to determine whether action is needed, that’s a symptom of a monitoring problem.”&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The documented operational practice is to review pager history and delete any alert that was consistently acknowledged and resolved without engineer action. Evaluating alerts over a rolling window — “condition must be true for 5 consecutive minutes” — rather than triggering on a single anomalous data point absorbs the transient spikes that account for the majority of false-positive pages in high-cardinality database environments.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The same SRE principles recommend a regular alert review cadence — sometimes called “alert bankruptcy” — where the team asks: if we deleted this alert and something bad happened, would we catch it through another signal? If yes, the alert is noise.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; An alert that auto-resolves before the engineer logs in should never have paged. Delay-based evaluation (sustained condition, not instantaneous breach) is the mechanical fix; runbook discipline is the organizational fix.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;
&lt;p&gt;Implementing strict alert governance comes with organizational friction:&lt;/p&gt;























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Approach&lt;/th&gt;&lt;th&gt;Advantage&lt;/th&gt;&lt;th&gt;Disadvantage&lt;/th&gt;&lt;th&gt;Failure Mode&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Broad Infrastructure Alerts&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Easy to set up; catches any anomaly on any host.&lt;/td&gt;&lt;td&gt;Generates massive noise; low correlation to user pain.&lt;/td&gt;&lt;td&gt;Engineers ignore the pager, missing real outages.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Strict SLO/User-Impact Alerts&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Extremely high signal-to-noise ratio; pages only when users suffer.&lt;/td&gt;&lt;td&gt;Requires deep instrumentation of the application stack.&lt;/td&gt;&lt;td&gt;A database fills its disk silently until it hard-crashes, causing a massive outage.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Alert fatigue is not a volume problem — it’s a contract problem. Alerts that fire without a clear required action train engineers to ignore pages, making the one alert that matters indistinguishable from the noise.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Require every alert to pass an actionability review before deployment: who owns it, what specific runbook step executes when it fires, what threshold justification exists — alerts failing this review are rejected, not tuned.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Identify your top-firing alert from the past month, delete it, and monitor for two weeks — if no business impact occurs, it was noise. If impact occurs, the condition should have been caught upstream by an SLO-based alert, not this threshold.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Run a pager review meeting this week. For every alert that fired and was resolved without action, either delete it or document why it deserved a page. The goal is to cut weekly alert volume by at least 50% before the next on-call rotation.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>failures</category><category>checklist</category><category>architecture</category></item><item><title>Cost Observability: Build Dashboards That Show Waste Before Finance Finds It</title><link>https://rajivonai.com/blog/2024-11-19-cost-observability-database-dashboards/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-11-19-cost-observability-database-dashboards/</guid><description>How to expand monitoring beyond uptime by building dashboards that expose underutilized RDS instances, EBS io2 waste, and backup retention drift.</description><pubDate>Tue, 19 Nov 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;If the first time engineering hears about a database cost spike is during a monthly finance review, your observability stack is fundamentally incomplete.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Database engineering traditionally focuses on two metrics: availability and latency. As long as the database is up and queries are fast, the system is considered healthy. However, in the cloud era, infrastructure is elastic, and cost is the hidden third metric. Managed database services like Amazon RDS, Aurora, and DynamoDB make it incredibly easy to spin up massive, highly available clusters. They also make it incredibly easy to bleed tens of thousands of dollars in hidden waste.&lt;/p&gt;
&lt;p&gt;Most monitoring dashboards ignore cost entirely. Engineers look at CPU utilization to ensure it isn’t too high, but they rarely look at CPU utilization to ensure it isn’t too low. When observability is decoupled from cost, teams routinely run development environments on &lt;code&gt;db.r6g.4xlarge&lt;/code&gt; instances, leave obsolete manual snapshots sitting in S3 for years, and over-provision EBS IOPS for workloads that no longer need them.&lt;/p&gt;
&lt;h2 id=&quot;symptoms&quot;&gt;Symptoms&lt;/h2&gt;
&lt;p&gt;Cost inefficiency in cloud databases rarely triggers an immediate outage. Instead, it manifests as silent financial degradation. The symptoms include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The Idle Giant:&lt;/strong&gt; A massive database instance sits at 2% CPU utilization and 5% memory usage 24/7.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The IOPS Over-Provision:&lt;/strong&gt; A database is running on an &lt;code&gt;io2&lt;/code&gt; Block Express volume provisioned for 20,000 IOPS, but CloudWatch shows it has never exceeded 1,000 IOPS in the past month.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Snapshot Hoard:&lt;/strong&gt; The AWS bill shows RDS backup storage costs exceeding the actual running instance costs due to years of manual, un-expired snapshots.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Multi-AZ Dev Environment:&lt;/strong&gt; Non-production environments are running with Multi-AZ redundancy enabled, doubling the compute cost for workloads that can tolerate an hour of downtime.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;first-five-checks&quot;&gt;First Five Checks&lt;/h2&gt;
&lt;p&gt;To integrate cost into your operational posture, build a dedicated “Cost Triage” dashboard with these five checks:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check Peak CPU and Connection Counts (30-Day Window):&lt;/strong&gt;
If an instance has not exceeded 20% CPU utilization and 10% connection pool usage during its highest peak over a 30-day window, it is a prime candidate for downsizing.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Evaluate Provisioned IOPS vs. Consumed IOPS:&lt;/strong&gt;
Compare the &lt;code&gt;VolumeReadOps&lt;/code&gt; and &lt;code&gt;VolumeWriteOps&lt;/code&gt; against the provisioned IOPS limit. If consumption is a fraction of the limit, migrate from &lt;code&gt;io2&lt;/code&gt; to &lt;code&gt;gp3&lt;/code&gt; or lower the provisioned &lt;code&gt;io2&lt;/code&gt; ceiling.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Audit Multi-AZ Deployments by Environment Tag:&lt;/strong&gt;
Query your infrastructure state (via AWS Config or your IaC state file) to find any instance tagged &lt;code&gt;env:dev&lt;/code&gt; or &lt;code&gt;env:staging&lt;/code&gt; that has &lt;code&gt;MultiAZ&lt;/code&gt; set to &lt;code&gt;true&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Review Manual Snapshot Age:&lt;/strong&gt;
List all manual RDS snapshots without an expiration tag. Automated backups age out naturally; manual snapshots taken “just in case” before a migration live forever and incur continuous S3 storage costs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Track CloudWatch Log Ingestion and Retention:&lt;/strong&gt;
Database audit logs, slow query logs, and error logs pushed to CloudWatch Logs can become extremely expensive. Check the retention policies—logs kept indefinitely instead of aging out to S3 Glacier drive up costs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;decision-tree&quot;&gt;Decision Tree&lt;/h2&gt;
&lt;p&gt;When evaluating a database for cost optimization, use this triage flow to determine the safest remediation path.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Database Identified as High Cost] --&gt; B{Is it Production?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|No| C[Check High-Availability Config]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; C1{Is Multi-AZ Enabled?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C1 --&gt;|Yes| C2[Disable Multi-AZ]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C1 --&gt;|No| C3[Check Uptime Needs]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C3 --&gt;|Can be stopped| C4[Implement Nightly Stop/Start Schedule]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    &lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt;|Yes| D[Check Utilization Metrics]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; D1{Is Peak CPU &amp;#x3C; 20%?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D1 --&gt;|Yes| D2[Downsize Instance Type]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D1 --&gt;|No| D3[Check Storage Configuration]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D3 --&gt; D4{Using Provisioned IOPS io1/io2?}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D4 --&gt;|Yes| D5[Evaluate Migration to gp3]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;remediation-options&quot;&gt;Remediation Options&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Instance Downsizing (High Impact, Low Risk):&lt;/strong&gt;
Scaling an RDS instance down to a smaller instance class halves the compute cost.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; This requires a brief interruption of service (failover). Ensure the application is resilient to connection drops.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Migrating &lt;code&gt;io1/io2&lt;/code&gt; to &lt;code&gt;gp3&lt;/code&gt; (High Impact, Zero Downtime):&lt;/strong&gt;
Modern &lt;code&gt;gp3&lt;/code&gt; volumes offer baseline performance of 3,000 IOPS and can be scaled up to 16,000 IOPS, which covers 90% of database workloads at a fraction of the cost of &lt;code&gt;io2&lt;/code&gt;. Storage type modifications can be done online.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; Modifying a large volume can take days to complete in the background, during which performance may be slightly degraded.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Automated Start/Stop for Dev Environments (Medium Impact, Zero Cost Risk):&lt;/strong&gt;
Using AWS Instance Scheduler to shut down dev databases at 6 PM and start them at 8 AM reduces compute costs by over 60%.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Tradeoff:&lt;/em&gt; Engineers working off-hours will need self-service access to manually restart their environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;rollback-plan&quot;&gt;Rollback Plan&lt;/h2&gt;
&lt;p&gt;When downsizing a database, always monitor application latency immediately following the cutover. If the smaller instance lacks the CPU cache or memory to serve queries efficiently, the rollback plan is to immediately initiate another modify instance command to scale back up. Because scaling up requires a reboot/failover, expect an additional 30-60 seconds of disruption.&lt;/p&gt;
&lt;h2 id=&quot;automation-opportunity&quot;&gt;Automation Opportunity&lt;/h2&gt;
&lt;p&gt;Deploy a Lambda function triggered by EventBridge that runs weekly. The function should scan all RDS snapshots, identify any manual snapshot older than 90 days that does not have a &lt;code&gt;Compliance&lt;/code&gt; or &lt;code&gt;LegalHold&lt;/code&gt; tag, and automatically delete it. This prevents the “snapshot hoard” from silently inflating the AWS bill over time.&lt;/p&gt;
&lt;h2 id=&quot;leadership-summary&quot;&gt;Leadership Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cost is an Engineering Metric:&lt;/strong&gt; Do not treat cost as an external business constraint. Expose cloud costs directly alongside CPU and memory on your engineering dashboards.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tagging is Operations:&lt;/strong&gt; You cannot optimize what you cannot identify. Strict enforcement of &lt;code&gt;Environment&lt;/code&gt;, &lt;code&gt;Team&lt;/code&gt;, and &lt;code&gt;Service&lt;/code&gt; tags is the prerequisite for all cost observability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Cloud is Elastic, Use It:&lt;/strong&gt; A database that runs 24/7 at 5% utilization is a failure of cloud architecture. Build your environments to scale down or shut off entirely when not in use.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; When observability is decoupled from cost, teams routinely over-provision dev environments on &lt;code&gt;db.r6g.4xlarge&lt;/code&gt;, hoard manual snapshots for years, and leave &lt;code&gt;io2&lt;/code&gt; volumes provisioned at 20,000 IOPS for workloads that never exceed 1,000 — none of which triggers an availability alert until the finance review.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Build a “Database Waste” dashboard ranking instances by lowest peak CPU and highest storage cost, then automate weekly scans for Multi-AZ dev environments and snapshots older than 90 days without a compliance tag.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Identify one non-production database with Multi-AZ enabled, disable it via Terraform, and show the projected yearly savings — this is the first concrete signal that cost observability is surfacing real waste before finance does.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Run the five checks above against your current RDS fleet this week. Any dev instance at sub-20% peak CPU with Multi-AZ enabled is an immediate win: disable Multi-AZ and schedule a nightly stop/start via Instance Scheduler.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>cloud</category><category>architecture</category><category>checklist</category></item><item><title>Consistency Models Your Application Actually Needs</title><link>https://rajivonai.com/blog/2024-03-12-consistency-models-your-application-actually-needs/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-03-12-consistency-models-your-application-actually-needs/</guid><description>The difference between read committed, repeatable read, and serializable isolation in operational terms — and why most applications are running with weaker guarantees than engineers assume.</description><pubDate>Tue, 12 Mar 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Most applications are running on Read Committed isolation. Most engineers assume Serializable. The gap between these two assumptions is where race conditions, double-bookings, and phantom reads live in production — problems that appear intermittently and are nearly impossible to reproduce in testing.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL supports four isolation levels: Read Uncommitted (aliased to Read Committed in PostgreSQL), Read Committed, Repeatable Read, and Serializable. MySQL InnoDB supports the same four. The ANSI SQL standard defines these levels by which anomalies they prevent.&lt;/p&gt;
&lt;p&gt;Most applications use the database default — Read Committed in PostgreSQL and MySQL — without explicitly choosing it. Most engineers do not know what anomalies Read Committed allows.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;An application manages event ticket inventory. Two users request the last ticket simultaneously. The application reads the remaining count (1), decides both can proceed, and issues two inserts. Both succeed. The event is now oversold. This is a lost update anomaly — and it happens at Read Committed because the two transactions each read a consistent snapshot of the row before either write committed.&lt;/p&gt;
&lt;p&gt;Read Committed is not wrong. It is the right choice for most workloads. But using it for inventory, financial balances, or any counter where two concurrent writers can conflict requires explicit application-level locking to compensate.&lt;/p&gt;
&lt;p&gt;What does each isolation level actually prevent, and how do you know which one your application needs?&lt;/p&gt;
&lt;h2 id=&quot;the-isolation-levels&quot;&gt;The Isolation Levels&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Read Committed&lt;/strong&gt; (PostgreSQL default): each statement in a transaction reads the latest committed data at the moment that statement executes. A second SELECT in the same transaction may return different rows than the first if another transaction committed between them. Prevents: dirty reads. Does NOT prevent: non-repeatable reads, phantom reads, lost updates.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Repeatable Read&lt;/strong&gt;: each statement in a transaction reads the same snapshot established at the beginning of the transaction. A second SELECT will return the same rows as the first, even if another transaction committed between them. Prevents: non-repeatable reads. Does NOT prevent: phantom reads (in standard SQL; PostgreSQL’s implementation also prevents most phantoms). Does NOT prevent: lost updates if two transactions modify the same row concurrently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Serializable&lt;/strong&gt; (SSI): transactions execute as if they ran one at a time, in some serial order. If two transactions have read/write dependencies that would cause an anomaly in any serial order, PostgreSQL aborts one of them with a serialization failure. Prevents: all standard anomalies including phantoms and write skew. Cost: serialization failures require application retry logic.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Set isolation level for a transaction&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BEGIN&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ISOLATION&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LEVEL&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; REPEATABLE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; READ&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- or&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BEGIN&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ISOLATION&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LEVEL&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SERIALIZABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check current transaction isolation&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;SHOW transaction_isolation;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Ticket inventory pattern with explicit locking at Read Committed:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BEGIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; quantity &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tickets &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; event_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 42&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; FOR&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; UPDATE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Only one transaction proceeds past this point concurrently&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;UPDATE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tickets &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; quantity &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; quantity &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; event_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 42&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; quantity &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;COMMIT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;SELECT ... FOR UPDATE&lt;/code&gt; adds an explicit row lock — it is the correct pattern for counter decrement operations at Read Committed isolation, because it prevents the lost update anomaly that Read Committed otherwise allows.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s documented behavior for Serializable Snapshot Isolation (SSI) uses predicate locking and dependency tracking to detect serialization conflicts at commit time rather than at statement time. This means serialization failures appear as commit errors, not as blocked statements — the application must catch &lt;code&gt;ERROR: could not serialize access&lt;/code&gt; and retry the transaction.&lt;/p&gt;
&lt;p&gt;The documented anomalies that SSI prevents but Repeatable Read does not: write skew (two transactions each read a condition that the other’s write will violate) and phantom reads that involve write dependencies. The canonical write skew example: two doctors each check whether at least one doctor is on call, find yes, and both go off call — leaving no coverage. At Repeatable Read, both succeed. At Serializable, one is aborted.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Anomaly&lt;/th&gt;&lt;th&gt;Isolation level needed&lt;/th&gt;&lt;th&gt;Pattern&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Lost update (concurrent increment/decrement)&lt;/td&gt;&lt;td&gt;Read Committed + &lt;code&gt;FOR UPDATE&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Explicit locking on the row being modified&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Non-repeatable read (read same row twice, get different value)&lt;/td&gt;&lt;td&gt;Repeatable Read&lt;/td&gt;&lt;td&gt;Long read transactions that must see consistent data&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Write skew (two transactions each invalidate the other’s assumption)&lt;/td&gt;&lt;td&gt;Serializable&lt;/td&gt;&lt;td&gt;Doctor on-call, seat booking, any “check then act” pattern&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Phantom read (new rows appear in range query)&lt;/td&gt;&lt;td&gt;Repeatable Read (PostgreSQL)&lt;/td&gt;&lt;td&gt;Reporting queries with range conditions&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Applications running at Read Committed default isolation are exposed to lost updates and non-repeatable reads that appear as intermittent data inconsistencies under concurrent load.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Identify the data entities where concurrent writes conflict (counters, balances, inventory, slots) and add &lt;code&gt;SELECT ... FOR UPDATE&lt;/code&gt; or switch to Serializable isolation with retry logic.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After adding &lt;code&gt;FOR UPDATE&lt;/code&gt; to your inventory decrement pattern, the oversell scenario cannot occur — the second transaction blocks until the first commits, then re-evaluates the quantity condition.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Find the one place in your application where two concurrent users can write to the same row without coordination — that is your lost update risk — and verify whether you have explicit locking or rely on application-level checks that the database does not enforce.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category></item><item><title>SIMD vs SIMT Explained for Database Engineers</title><link>https://rajivonai.com/blog/2024-03-03-simd-vs-simt-for-database-engineers/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-03-03-simd-vs-simt-for-database-engineers/</guid><description>A DBA-friendly explanation of SIMD and SIMT using query execution, vectorized processing, and GPU mental models instead of hardware jargon.</description><pubDate>Sun, 03 Mar 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A lot of GPU and vectorized execution discussions get confusing because people jump straight into terms like lanes, warps, thread blocks, and vector units, leaving database engineers to translate hardware jargon into query plans.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;As analytical workloads grow and latency SLAs shrink, relying solely on row-by-row CPU execution is no longer viable. The industry has firmly shifted toward hardware acceleration for query execution. Systems are increasingly utilizing both CPU vector extensions (like AVX-512) and GPU offloading to process massive datasets faster. A lot of CPU-side gains in modern analytical engines come from vectorized execution and cache-friendly data layouts, while GPUs drive high throughput by maintaining massive thread pools for regular operations.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;When teams transition to hardware-accelerated databases, they often struggle to predict which workloads will actually benefit. A query that screams on a GPU might crawl if slightly modified, and CPU vectorization sometimes fails to engage at all due to data layout or branch-heavy logic. This unpredictability stems from treating “acceleration” as a black box without understanding the fundamental differences in how CPUs and GPUs parallelize work. If we don’t understand the execution model—specifically what gets parallelized and how branching affects the pipeline—how can we design schemas and write queries that actually leverage the hardware?&lt;/p&gt;
&lt;h2 id=&quot;core-concept&quot;&gt;Core Concept&lt;/h2&gt;
&lt;p&gt;To understand the mechanics, we need to look at how a single operation is applied over large amounts of data. If you already understand vectorized query execution, row-at-a-time vs batch-at-a-time processing, and scan-heavy analytics, you already understand most of SIMD and SIMT.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[Query Operator] --&gt; B[SIMD CPU Execution]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A --&gt; C[SIMT GPU Execution]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; D[Single worker — Wide vector registers]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; E[Batch of rows processed in one instruction]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt; F[Thousands of lightweight workers]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; G[Each thread handles a slice concurrently]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SIMD (Single Instruction, Multiple Data):&lt;/strong&gt; This is vertical widening inside the CPU. A single CPU worker uses wide vector registers to apply one instruction across a batch of values simultaneously. If a standard engine evaluates a filter one row at a time, a SIMD-enabled vectorized executor processes a batch (for example, 1024 rows) in a single CPU instruction step. SIMD usually helps with vectorized scans, arithmetic-heavy expressions, and batched comparisons.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SIMT (Single Instruction, Multiple Threads):&lt;/strong&gt; This is horizontal scaling inside a GPU. The hardware runs the same logical program across thousands of independent threads simultaneously. Instead of widening one worker, SIMT spawns a massive grid of lightweight workers, each applying the same operation to different data slices. SIMT usually helps with large scans, parallel filtering, aggregations, and vector similarity calculations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you remember one principle, remember this: SIMD widens a worker, whereas SIMT multiplies workers.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;We can observe how these execution models dictate database behavior in production systems. The documented pattern is that databases exhibit wildly different performance profiles depending on how their execution engine maps to the underlying hardware.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example 1: CPU-friendly vectorized query (SIMD)&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; SUM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(price)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; fact_sales&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; date_key &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BETWEEN&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20240101&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 20240131&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;ClickHouse and SIMD:&lt;/em&gt; The documented pattern is that ClickHouse heavily utilizes SIMD instructions (like SSE4.2 and AVX-512) for this type of query. By storing data in contiguous columnar blocks, ClickHouse feeds vector registers directly. A single core filters thousands of integers in a handful of clock cycles, relying on vectorized predicate evaluation and batched accumulation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example 2: GPU-friendly scan and aggregate (SIMT)&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; country, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;SUM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(revenue)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; events&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; country;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;HEAVY.AI and SIMT:&lt;/em&gt; For GPU-native systems like HEAVY.AI (formerly OmniSci), the engine compiles SQL queries into LLVM IR and then to PTX code for NVIDIA GPUs. The SIMT model excels here because the massive scan volume and repeated per-row work maps perfectly to millions of GPU threads executing the partial aggregations in parallel.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example 3: Bad acceleration candidate&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; users&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; user_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 42&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;PostgreSQL and Row-at-a-Time:&lt;/em&gt; PostgreSQL historically processes queries row-by-row. While ideal for tiny indexed lookups where latency dominates, applying hardware acceleration here is counterproductive. Neither SIMD nor SIMT helps with single-row lookups because there is no batched data to widen and no parallel work to distribute.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;
&lt;p&gt;Both models improve performance but have strict constraints, particularly around branching. CPUs handle irregular control flow well, but hardware accelerators lose efficiency when logic diverges.&lt;/p&gt;




















&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Execution Model&lt;/th&gt;&lt;th&gt;Strength&lt;/th&gt;&lt;th&gt;Failure Mode&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;SIMD (CPU)&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Highly efficient for contiguous columnar scans with simple, repetitive predicates.&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Branch Divergence:&lt;/strong&gt; Performance collapses if the data requires complex, unpredictable &lt;code&gt;IF — ELSE&lt;/code&gt; branching. The vector pipeline must evaluate both sides and mask out unused lanes, wasting CPU cycles.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;SIMT (GPU)&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Massive throughput for large aggregations, parallel joins, and heavy vector math.&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Thread Divergence:&lt;/strong&gt; If threads in the same hardware group take different execution paths, the GPU serializes execution, destroying performance. Additionally, tiny indexed lookups suffer heavily due to PCIe data transfer latency.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Unpredictable performance when migrating standard analytical workloads to accelerated database engines due to a mismatch between query logic and hardware execution models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Map the workload shape to the hardware—use SIMD-optimized columnar stores for general, batch-oriented analytics, and SIMT-based GPU engines for massive, regular, math-heavy scans.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Systems like ClickHouse achieve their speed through rigorous SIMD utilization on contiguous columnar data, while GPU databases like HEAVY.AI leverage SIMT to brute-force billion-row aggregates through parallel thread pools.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Audit slow analytical queries for heavy branching or scattered memory access. Refactor schema layouts to be columnar and contiguous, and replace row-at-a-time loop logic with vector-friendly bulk operations.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>cpu</category><category>gpu</category><category>performance</category></item><item><title>CPU vs GPU vs TPU Explained for Database Engineers</title><link>https://rajivonai.com/blog/2024-03-02-cpu-vs-gpu-vs-tpu-for-database-engineers/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-03-02-cpu-vs-gpu-vs-tpu-for-database-engineers/</guid><description>How CPU, GPU, and TPU architectures differ in ways that matter for databases and AI workloads — and which compute class to reach for when adding vector search, embedding generation, or GPU-accelerated analytics.</description><pubDate>Sat, 02 Mar 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Database infrastructure conversations are breaking down the moment hardware enters the room because engineers are asking the wrong question.&lt;/strong&gt; “Which is faster — CPU, GPU, or TPU?” is the wrong frame. The right question is the same one you already apply to query plans: what execution pattern does this workload need, and what hardware is optimized for that pattern?&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;OLTP systems are adding vector similarity, analytical aggregates, and AI inference to their workloads. Infrastructure teams are being asked to provision GPU instances without a framework for deciding when a GPU is the right choice versus a larger CPU instance or a purpose-built accelerator. The same confusion that once surrounded row-store vs column-store has returned at the hardware layer.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Engineers who treat CPU, GPU, and TPU as a linear performance hierarchy make the wrong call in both directions: they over-provision GPUs for workloads that remain CPU-bound (transactions, connection management, control flow), and they under-provision accelerators for workloads that are genuinely scan-heavy or tensor-heavy. The result is either wasted capacity or incorrect assumptions that “the GPU is faster” without a workload-specific basis.&lt;/p&gt;
&lt;p&gt;If you already understand OLTP vs OLAP, row vs column execution, and latency vs throughput, you already have the right mental model for this hardware decision.&lt;/p&gt;
&lt;h2 id=&quot;matching-execution-patterns-to-hardware&quot;&gt;Matching Execution Patterns to Hardware&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://rajivonai.com/diagrams/accelerated-data-systems/cpu-vs-gpu-vs-tpu-for-dbas.svg&quot; alt=&quot;CPU vs GPU vs TPU mental model&quot;&gt;&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Hardware&lt;/th&gt;&lt;th&gt;DBA Mental Model&lt;/th&gt;&lt;th&gt;Best At&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;CPU&lt;/td&gt;&lt;td&gt;OLTP execution brain&lt;/td&gt;&lt;td&gt;Branching, coordination, transactions, mixed workloads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GPU&lt;/td&gt;&lt;td&gt;Parallel analytics engine&lt;/td&gt;&lt;td&gt;Scans, filters, joins, aggregations, vector math&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;TPU&lt;/td&gt;&lt;td&gt;Matrix math appliance&lt;/td&gt;&lt;td&gt;Dense AI tensor operations and model inference/training&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;What a CPU Is&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A CPU is designed to be general-purpose. It handles many instruction types efficiently: branching, pointer chasing, transaction logic, conditional execution, scheduling and interrupts, complex control flow.&lt;/p&gt;
&lt;p&gt;Think of a CPU as a traditional relational engine running OLTP traffic.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; *&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; customer_id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 123&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AND&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;SHIPPED&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is CPU-friendly because it involves index lookups, branching, and low-latency response patterns.&lt;/p&gt;
&lt;p&gt;CPUs win when the workload is transactional, branch-heavy, latency-sensitive, coordination-heavy, or dominated by smaller irregular queries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What a GPU Is&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A GPU is not a faster CPU. It is built for repeating the same operation across massive data volumes in parallel.&lt;/p&gt;
&lt;p&gt;Think of a GPU as a massively parallel analytics engine optimized for huge scans, repeated arithmetic, columnar execution, vector operations, and parallel filtering.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; SUM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(price &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; quantity)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; sales;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With billions of rows, this operation is repetitive and parallelizable — it maps well to GPU threads. GPUs win when the workload is scan-heavy, arithmetic-heavy, batch-oriented, highly parallelizable, or throughput-driven.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What a TPU Is&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A TPU is more specialized than CPU or GPU. It is designed for dense matrix and tensor math used heavily in neural networks. Think of a TPU as a purpose-built model-math execution appliance.&lt;/p&gt;
&lt;p&gt;TPUs are not general database accelerators. They are strongest when model computation itself is the bottleneck: neural network training, large-scale inference, dense tensor operations, and repeated matrix multiplications with regular shapes.&lt;/p&gt;
&lt;table class=&quot;compare-table&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Dimension&lt;/th&gt;
      &lt;th&gt;&lt;span class=&quot;hw-pill hw-cpu&quot;&gt;CPU&lt;/span&gt;&lt;/th&gt;
      &lt;th&gt;&lt;span class=&quot;hw-pill hw-gpu&quot;&gt;GPU&lt;/span&gt;&lt;/th&gt;
      &lt;th&gt;&lt;span class=&quot;hw-pill hw-tpu&quot;&gt;TPU&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Flexibility&lt;/td&gt;
      &lt;td&gt;Highest&lt;/td&gt;
      &lt;td&gt;Medium&lt;/td&gt;
      &lt;td&gt;Lowest&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Best workload&lt;/td&gt;
      &lt;td&gt;Mixed/general-purpose&lt;/td&gt;
      &lt;td&gt;Parallel analytics&lt;/td&gt;
      &lt;td&gt;AI tensor math&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Latency&lt;/td&gt;
      &lt;td&gt;Strong&lt;/td&gt;
      &lt;td&gt;Moderate&lt;/td&gt;
      &lt;td&gt;Workload-specific&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Throughput&lt;/td&gt;
      &lt;td&gt;Moderate&lt;/td&gt;
      &lt;td&gt;Very high&lt;/td&gt;
      &lt;td&gt;Very high for AI&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Branch-heavy logic&lt;/td&gt;
      &lt;td&gt;Excellent&lt;/td&gt;
      &lt;td&gt;Weak&lt;/td&gt;
      &lt;td&gt;Poor fit&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;OLTP&lt;/td&gt;
      &lt;td&gt;Best&lt;/td&gt;
      &lt;td&gt;Poor&lt;/td&gt;
      &lt;td&gt;Poor&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Analytics&lt;/td&gt;
      &lt;td&gt;Decent&lt;/td&gt;
      &lt;td&gt;Excellent&lt;/td&gt;
      &lt;td&gt;General mismatch&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;ML inference&lt;/td&gt;
      &lt;td&gt;Decent&lt;/td&gt;
      &lt;td&gt;Strong&lt;/td&gt;
      &lt;td&gt;Excellent&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Matrix multiplication&lt;/td&gt;
      &lt;td&gt;Okay&lt;/td&gt;
      &lt;td&gt;Strong&lt;/td&gt;
      &lt;td&gt;Best&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s execution model runs on CPUs — its buffer manager, lock manager, and MVCC machinery are built around sequential per-backend processing with branching logic. The documented behavior when you add GPU-accelerated extensions (such as PG-Strom for vectorized scan offload) is that the optimizer continues to handle query planning on CPU while the GPU handles the data-parallel scan and aggregation phases. This division of labor — CPU for control, GPU for data movement — is the documented design pattern for heterogeneous database systems.&lt;/p&gt;
&lt;p&gt;NVIDIA’s RAPIDS cuDF library (Apache 2.0, documented at &lt;a href=&quot;https://developer.nvidia.com/rapids&quot;&gt;developer.nvidia.com/rapids&lt;/a&gt;) processes Pandas-like DataFrame operations on GPU. The documented design note is that data transfer between CPU memory and GPU memory (PCIe bandwidth) is the dominant latency cost for small-to-medium datasets, making GPU acceleration ineffective until the working set exceeds what the transfer overhead amortizes.&lt;/p&gt;
&lt;p&gt;Google’s TPU documentation is explicit that TPUs are optimized for matrix multiplications with regular, statically-shaped tensors, and that irregular control flow, sparse operations, and dynamic shapes fall back to CPU or GPU. This boundary is the same boundary a DBA understands as the difference between a full table scan (GPU-friendly) and a complex multi-join query plan (CPU-friendly).&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;GPU for OLTP&lt;/td&gt;&lt;td&gt;Latency increases, no throughput gain&lt;/td&gt;&lt;td&gt;GPU launch overhead and PCIe transfer cost exceed the per-request compute savings&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;CPU for large scans&lt;/td&gt;&lt;td&gt;Query runs 10–100x slower than GPU equivalent&lt;/td&gt;&lt;td&gt;CPU cannot parallelize the same scan operation across thousands of cores simultaneously&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;TPU for database workloads&lt;/td&gt;&lt;td&gt;Misfit — most DB operations are not dense tensor math&lt;/td&gt;&lt;td&gt;TPU lacks general-purpose branching and irregular memory access support&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Heterogeneous system with small working set&lt;/td&gt;&lt;td&gt;GPU transfer overhead dominates&lt;/td&gt;&lt;td&gt;PCIe bandwidth makes GPU offload slower than in-memory CPU execution until data volume is large enough&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Assuming GPU = faster for all AI workloads&lt;/td&gt;&lt;td&gt;Inference latency spikes at low concurrency&lt;/td&gt;&lt;td&gt;TPU is faster for batched dense inference; GPU wins for moderate concurrency; CPU wins for single-request light inference&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Adding GPU or TPU infrastructure without a workload-to-hardware mapping wastes capacity on the wrong execution pattern.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Classify hot paths by execution pattern before choosing hardware — transactions and coordination stay on CPU, scan-heavy analytics move to GPU, dense model math goes to TPU.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run your heaviest analytical query on a GPU-enabled instance with a columnar execution engine (DuckDB, RAPIDS, or a GPU database) and compare elapsed time and I/O throughput against the same query on your current CPU-only setup — the gap narrows or disappears for CPU-bound query shapes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: This week, identify the three highest-CPU-cost queries in your monitoring dashboard and classify each as branch-heavy (CPU-bound) or scan-heavy (GPU candidate). That classification determines whether GPU provisioning is justified.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>architecture</category><category>ai-engineering</category></item><item><title>CAP Theorem in Operational Terms</title><link>https://rajivonai.com/blog/2024-01-09-cap-theorem-in-operational-terms/</link><guid isPermaLink="true">https://rajivonai.com/blog/2024-01-09-cap-theorem-in-operational-terms/</guid><description>What CAP theorem actually says about distributed database tradeoffs, why the CP vs AP framing is more useful than the theory, and what it means for your system when the network fails.</description><pubDate>Tue, 09 Jan 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;CAP theorem is not an academic curiosity. It tells you what your distributed database will do when the network between its nodes fails — and that is exactly when the wrong answer causes data loss or an outage. Most engineers have heard of CAP and most have the wrong mental model for applying it.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;CAP theorem, stated by Eric Brewer in 2000 and proved by Gilbert and Lynch in 2002, says that a distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance. In practice, network partitions happen — so every distributed system must choose between consistency and availability when a partition occurs.&lt;/p&gt;
&lt;p&gt;This is the trade-off that matters operationally: when two nodes in your database cluster cannot communicate, what does the system do?&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Engineers designing distributed systems often say “we chose a CP database” or “we chose an AP database” without being able to answer a concrete operational question: if two of your five Cassandra nodes lose connectivity to the other three, what happens to reads and writes? What does a “consistent” or “available” choice mean in practice during a partial outage?&lt;/p&gt;
&lt;p&gt;CAP is only useful if you can translate it into a failure scenario answer.&lt;/p&gt;
&lt;h2 id=&quot;cp-vs-ap-in-operational-terms&quot;&gt;CP vs AP in Operational Terms&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;CP (Consistency + Partition Tolerance)&lt;/strong&gt;: During a partition, the system refuses to serve reads or writes that could return stale data or lose acknowledged writes. This means the system becomes unavailable for some or all operations during the partition. Correctness is preserved; availability is sacrificed.&lt;/p&gt;
&lt;p&gt;Examples of CP systems: PostgreSQL with synchronous replication (primary refuses writes if the synchronous standby is unreachable), etcd, ZooKeeper, HBase (when configured conservatively).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AP (Availability + Partition Tolerance)&lt;/strong&gt;: During a partition, the system continues to serve reads and writes from whichever nodes are reachable, accepting that different nodes may diverge and return different data. After the partition heals, the system reconciles the divergent state (using last-write-wins, vector clocks, or application-level conflict resolution). Availability is preserved; consistency is sacrificed temporarily.&lt;/p&gt;
&lt;p&gt;Examples of AP systems: Cassandra (by default with eventual consistency), DynamoDB (with eventual consistency reads), CouchDB.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Partition occurs between Node A and Node B&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;CP system:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;  - Node A: &quot;I cannot confirm my data is consistent — refusing reads/writes&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;  - Clients: receive errors or timeouts&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;AP system:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;  - Node A: &quot;I&apos;ll serve what I have&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;  - Node B: &quot;I&apos;ll serve what I have&quot;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;  - Clients: may get different answers from A and B&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;  - After partition heals: A and B reconcile (last-write-wins or merge)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s documented behavior during replication failure depends on &lt;code&gt;synchronous_commit&lt;/code&gt; setting. With &lt;code&gt;synchronous_commit = on&lt;/code&gt; and a synchronous standby, the primary will not acknowledge writes that have not been confirmed by the standby — this is CP behavior. If the standby disconnects, the primary waits for &lt;code&gt;wal_sender_timeout&lt;/code&gt; before giving up and continuing without the standby. During that wait, writes are blocked — the system chooses consistency over availability.&lt;/p&gt;
&lt;p&gt;Cassandra’s documented consistency levels operationalize the tradeoff explicitly: &lt;code&gt;QUORUM&lt;/code&gt; reads and writes require a majority of replicas to respond — this provides a stronger consistency guarantee but will fail if too many nodes are unreachable. &lt;code&gt;ONE&lt;/code&gt; reads and writes require only one replica to respond — maximizing availability at the cost of potentially reading stale data.&lt;/p&gt;
&lt;p&gt;The practical insight from Brewer’s later work (CAP Twelve Years Later, 2012): most distributed systems are not purely CP or AP — they allow the tradeoff to be tuned per-operation. This is the more useful mental model.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;CP choice&lt;/th&gt;&lt;th&gt;AP choice&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Payment processing&lt;/td&gt;&lt;td&gt;Correct — cannot accept double-spend or lost payment&lt;/td&gt;&lt;td&gt;Dangerous — inconsistent state during partition&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;User session data&lt;/td&gt;&lt;td&gt;Usually unnecessary — stale session is acceptable&lt;/td&gt;&lt;td&gt;Correct — availability matters more than freshness&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Inventory count&lt;/td&gt;&lt;td&gt;Depends — over-selling may be acceptable; negative inventory is not&lt;/td&gt;&lt;td&gt;Risky without application-level conflict resolution&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Distributed counter&lt;/td&gt;&lt;td&gt;CP is expensive (coordination cost); AP requires conflict resolution&lt;/td&gt;&lt;td&gt;Use CRDT or centralized counter&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Distributed databases make different choices during network partitions, and engineers must understand those choices before selecting a database for a use case — not after a partition happens in production.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: For each data entity in your system, ask: during a 60-second network partition, is it acceptable for two nodes to return different answers? If no, you need CP semantics for that entity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: Run a partition test in staging — use &lt;code&gt;tc netem&lt;/code&gt; to drop packets between nodes — and observe whether your database returns errors (CP) or potentially stale data (AP).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Identify the one table in your system where a consistency failure would cause the most business harm, and verify that your database’s consistency configuration matches the requirement you assumed it had.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category><category>architecture</category></item><item><title>Caches, Queues, and Databases: When to Use Each</title><link>https://rajivonai.com/blog/2023-11-14-caches-queues-databases-when-to-use-each/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-11-14-caches-queues-databases-when-to-use-each/</guid><description>The decision framework for choosing between a cache, a queue, and a database — including the failure modes that appear when engineers use the wrong one for the job.</description><pubDate>Tue, 14 Nov 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A cache is not a database. A queue is not a cache. These three structures have different guarantees about durability, ordering, and access patterns — and using the wrong one for the job produces failure modes that are hard to diagnose because the system works correctly under normal load.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most production systems use all three: a relational database (PostgreSQL, MySQL) as the system of record, a cache (Redis, Memcached) for hot read paths, and a queue (Kafka, SQS, RabbitMQ) for asynchronous processing. Engineers frequently reach for a cache when they should use a queue, or use a database where a queue would serve better.&lt;/p&gt;
&lt;p&gt;The confusion is understandable — Redis can act as both a cache and a queue; PostgreSQL can be used as a queue with &lt;code&gt;SKIP LOCKED&lt;/code&gt;; a queue can replay events that look like a cache. But the operational guarantees differ, and those differences matter at failure time.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;A system uses Redis as a work queue: tasks are pushed to a list, workers pop and process them. Under normal load, it works. During a Redis restart, all in-flight tasks are lost — because Redis’s default persistence does not guarantee durability across restarts, and “pop” removes the item before the worker confirms it processed successfully. The engineers chose a cache for a job that required queue semantics.&lt;/p&gt;
&lt;p&gt;What are the actual guarantees each structure provides, and when does each one break?&lt;/p&gt;
&lt;h2 id=&quot;the-decision-framework&quot;&gt;The Decision Framework&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use a cache when&lt;/strong&gt;: you need to accelerate reads of data that already exists in a durable store, and the cost of a cache miss is a slower read (not a lost operation). Caches are explicitly lossy by design — eviction, expiry, and cold restarts all produce misses. The system must work (slower) without the cache.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use a queue when&lt;/strong&gt;: you need work items to survive producer/consumer failures, be processed exactly once (or at least once), and be consumed in order or at a controlled rate. Queues guarantee delivery in the face of consumer failures. A message that is consumed but not acknowledged is redelivered. This is fundamentally different from a cache’s eviction behavior.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use a database when&lt;/strong&gt;: you need durable, queryable state with transactional consistency. Databases provide ACID guarantees, support complex queries, and allow multiple processes to read and write shared state correctly.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Cache:    READ-HEAVY, TOLERATE MISS, LOSSY OK&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Queue:    WRITE-ONCE, CONSUME-ONCE, DURABILITY REQUIRED&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Database: SHARED MUTABLE STATE, QUERYABLE, ACID REQUIRED&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL supports queue-like patterns with &lt;code&gt;SELECT ... FOR UPDATE SKIP LOCKED&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Dequeue pattern using PostgreSQL as a job queue&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;BEGIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; id, payload &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; job_queue&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pending&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; created_at&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 1&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FOR&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; UPDATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; SKIP&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; LOCKED;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- After processing:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;UPDATE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; job_queue &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;done&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; id &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; $&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;COMMIT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This gives ACID guarantees for job dequeue — a crashed worker leaves the job in &lt;code&gt;FOR UPDATE&lt;/code&gt; lock, which releases when the transaction rolls back, making the job visible to the next worker. PostgreSQL is documented as a valid job queue for low-to-moderate throughput (thousands of jobs/sec). Kafka or SQS are more appropriate for high-throughput, high-fan-out, or replay-required patterns.&lt;/p&gt;
&lt;p&gt;Redis used as a queue requires AOF persistence (&lt;code&gt;appendonly yes&lt;/code&gt;) and careful handling of the race between &lt;code&gt;RPOP&lt;/code&gt; and worker failure. Without these, messages are lost on crash. Redis Streams (&lt;code&gt;XADD&lt;/code&gt;, &lt;code&gt;XREADGROUP&lt;/code&gt;) provide consumer-group semantics with acknowledgment — closer to a proper queue, but still lacks the transactional guarantees of a relational database.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Anti-pattern&lt;/th&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Correct tool&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Cache used as queue (Redis list + RPOP)&lt;/td&gt;&lt;td&gt;Items lost on crash or before worker acks&lt;/td&gt;&lt;td&gt;Proper queue (Kafka, SQS) or PostgreSQL with SKIP LOCKED&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Database used as message bus for high throughput&lt;/td&gt;&lt;td&gt;Lock contention and table bloat under load&lt;/td&gt;&lt;td&gt;Dedicated queue&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Queue used as state store&lt;/td&gt;&lt;td&gt;No queryability; ordering not preserved for concurrent consumers&lt;/td&gt;&lt;td&gt;Database&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cache without TTL on mutable data&lt;/td&gt;&lt;td&gt;Stale reads served indefinitely; no invalidation&lt;/td&gt;&lt;td&gt;Add TTL; or use cache-aside with explicit invalidation&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Using a cache for work items or a database for high-throughput messaging produces failure modes that only appear under load or during restarts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Apply the framework: durable work items require a queue; hot read acceleration requires a cache; shared mutable state with queries requires a database.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After switching from Redis list to PostgreSQL SKIP LOCKED or a proper queue, job loss during worker restarts disappears from your error monitoring.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Audit your current Redis usage today — identify any Redis list or set being used as a work queue, and verify that AOF persistence is enabled and that worker failures cannot lose items.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category><category>architecture</category></item><item><title>Cardinality Estimation: Why the Query Planner Gets It Wrong</title><link>https://rajivonai.com/blog/2023-09-12-cardinality-estimation-why-the-query-planner-gets-it-wrong/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-09-12-cardinality-estimation-why-the-query-planner-gets-it-wrong/</guid><description>How PostgreSQL estimates row counts, why those estimates are wrong for correlated columns and skewed distributions, and what engineers can do when the planner picks a bad plan.</description><pubDate>Tue, 12 Sep 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The query planner is a cost-based optimizer, and its cost estimates are only as good as its row count estimates. When the planner picks the wrong join strategy or uses the wrong index, the root cause is almost always a cardinality estimation error — not a missing index.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s query planner uses statistics — stored in &lt;code&gt;pg_statistic&lt;/code&gt; and surfaced via &lt;code&gt;pg_stats&lt;/code&gt; — to estimate how many rows each condition will match. These estimates drive the choice of join algorithm (hash join vs nested loop vs merge join), the order of joins, and the index selection decision. Bad estimates produce bad plans.&lt;/p&gt;
&lt;p&gt;The planner makes estimates using histograms, most-common-value lists, and correlation statistics collected by &lt;code&gt;ANALYZE&lt;/code&gt;. For a single table with a single condition, estimates are usually accurate. For multiple conditions on the same table, or joins across multiple tables, estimation errors compound.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;A query joins three tables and filters on two columns in the same table. The query is slow. &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; shows that the planner estimated 12 rows from one step but got back 450,000 rows — a 37,000x underestimate. The hash join built on that estimate is catastrophically undersized and spilled to disk.&lt;/p&gt;
&lt;p&gt;Why did the planner get it so wrong, and what can engineers actually do about it?&lt;/p&gt;
&lt;h2 id=&quot;how-estimation-fails&quot;&gt;How Estimation Fails&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Column correlation&lt;/strong&gt;: PostgreSQL’s default statistics assume predicate conditions on different columns are independent. If you filter &lt;code&gt;WHERE region = &apos;West&apos; AND product_category = &apos;Electronics&apos;&lt;/code&gt;, the planner multiplies the selectivity of each condition separately. If region and category are correlated (all Electronics orders come from West), the actual row count is much higher than the product of individual selectivities would suggest. This is the most common source of large estimation errors.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Stale statistics&lt;/strong&gt;: After bulk inserts, large updates, or schema changes, the statistics in &lt;code&gt;pg_statistic&lt;/code&gt; no longer reflect the actual data distribution. Autovacuum runs &lt;code&gt;ANALYZE&lt;/code&gt; automatically, but if writes are faster than autovacuum can keep up, the statistics become stale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Skewed distributions&lt;/strong&gt;: The histogram has a fixed number of buckets (default: 100 per column). If a value appears in 40% of rows, the histogram captures this well. But if values are extremely skewed — 0.001% of rows match a specific condition — the histogram bucket resolution may be too coarse to estimate accurately.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check statistics freshness&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; relname, last_analyze, last_autoanalyze, n_mod_since_analyze&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_user_tables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_mod_since_analyze &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;&gt;&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; n_mod_since_analyze &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- View column statistics&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; attname, n_distinct, correlation, most_common_vals, most_common_freqs&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stats&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Force fresh statistics&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Increase statistics target for a skewed column&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; TABLE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ALTER&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; COLUMN region &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SET&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; STATISTICS&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 500&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented PostgreSQL fix for correlated column estimation errors is extended statistics, available since PostgreSQL 10:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Create extended statistics for correlated columns&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; STATISTICS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders_region_category &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; region, product_category &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;ANALYZE orders;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Verify the stats object exists&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; stxname, stxkeys, stxkind &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_statistic_ext;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Extended statistics teach the planner that &lt;code&gt;region&lt;/code&gt; and &lt;code&gt;product_category&lt;/code&gt; are correlated, allowing it to estimate multi-column conditions accurately. Without extended statistics, the independence assumption produces systematically wrong estimates for correlated columns.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;default_statistics_target&lt;/code&gt; parameter (default: 100) controls how many values the histogram tracks per column. Increasing it to 500 for columns with highly skewed distributions improves estimation accuracy at the cost of slower &lt;code&gt;ANALYZE&lt;/code&gt; runs.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Estimation failure&lt;/th&gt;&lt;th&gt;Symptom in EXPLAIN ANALYZE&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Correlated columns&lt;/td&gt;&lt;td&gt;&lt;code&gt;rows=5 actual rows=200000&lt;/code&gt; on multi-column filter&lt;/td&gt;&lt;td&gt;Create extended statistics on the correlated columns&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Stale statistics&lt;/td&gt;&lt;td&gt;&lt;code&gt;rows=1000 actual rows=9000000&lt;/code&gt; after bulk load&lt;/td&gt;&lt;td&gt;Run &lt;code&gt;ANALYZE&lt;/code&gt; manually; tune autovacuum for high-write tables&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Skewed distribution&lt;/td&gt;&lt;td&gt;Planner ignores partial index that should be selective&lt;/td&gt;&lt;td&gt;Increase &lt;code&gt;default_statistics_target&lt;/code&gt; for the column&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Join order wrong&lt;/td&gt;&lt;td&gt;Outer join processes more rows than inner&lt;/td&gt;&lt;td&gt;&lt;code&gt;SET join_collapse_limit = 1&lt;/code&gt; and reorder joins manually to test&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Cardinality estimation errors cause the planner to pick wrong join strategies and wrong indexes, and the errors are invisible without reading &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; output carefully.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Compare estimated vs actual row counts in &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; — any 10x divergence is a signal to investigate statistics quality.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After adding extended statistics on correlated columns, re-run &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; — the estimated rows should match actual rows within a factor of 2–3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Find your slowest query, run &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt;, and find the node where estimated rows diverges most from actual rows — that node is where the plan went wrong.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category></item><item><title>Index Selectivity: Why Cardinality Changes Everything</title><link>https://rajivonai.com/blog/2023-07-11-index-selectivity-why-cardinality-changes-everything/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-07-11-index-selectivity-why-cardinality-changes-everything/</guid><description>Why a low-cardinality index is often worse than no index, how the query planner uses selectivity estimates, and when to build a partial index instead.</description><pubDate>Tue, 11 Jul 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;An index on a boolean column does not help. An index on a status column with three values probably does not help either. Index selectivity — how many distinct values a column has relative to the total row count — determines whether the planner will choose the index or ignore it entirely.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Database engineers add indexes to slow queries by instinct — the query filters on &lt;code&gt;status&lt;/code&gt;, so create an index on &lt;code&gt;status&lt;/code&gt;. When the index does not improve performance or is ignored by the planner, the engineer is confused. The planner is not wrong. A low-selectivity index is genuinely worse than a sequential scan for most queries, and the planner knows it.&lt;/p&gt;
&lt;p&gt;Selectivity is the fraction of rows a condition matches. A condition that matches 1% of rows has high selectivity (the index is useful). A condition that matches 60% of rows has low selectivity (a sequential scan is likely faster).&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;A table has 10 million orders. Engineers add an index on &lt;code&gt;status&lt;/code&gt; to speed up a query filtering for &lt;code&gt;status = &apos;pending&apos;&lt;/code&gt;. The query uses the index in development (where the table has 1,000 rows and 200 are pending). In production (where 7 million of 10 million orders are pending), the query ignores the index and does a sequential scan. The planner is right both times.&lt;/p&gt;
&lt;p&gt;How does the planner decide whether an index is worth using, and when is a low-cardinality index harmful?&lt;/p&gt;
&lt;h2 id=&quot;selectivity-and-the-cost-model&quot;&gt;Selectivity and the Cost Model&lt;/h2&gt;
&lt;p&gt;The planner estimates the cost of an index scan as: (rows matched by the condition) × (random page read cost). If matched rows is large, random reads add up quickly. Sequential scans read data in order and benefit from operating system read-ahead; random index lookups do not.&lt;/p&gt;
&lt;p&gt;For &lt;code&gt;status = &apos;pending&apos;&lt;/code&gt; on a table where 70% of rows are pending:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Estimated index scan cost: 7,000,000 × 4 (random_page_cost) = 28,000,000 cost units&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Estimated seq scan cost:   table_pages × 1 (seq_page_cost)  ≈ 50,000 cost units&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The sequential scan wins by a large margin. Adding the index did not slow the query — but it did add write overhead and storage cost for zero benefit.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check distinct values and cardinality for a column&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;as&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; row_count,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;       round&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 100&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;0&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; /&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; sum&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;count&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;(&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;*&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;over&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; (), &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;2&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;as&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pct&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;GROUP BY&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; row_count &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- What statistics does the planner have?&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; attname, n_distinct, correlation&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stats&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; tablename &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;orders&apos;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; AND&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; attname &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;status&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;n_distinct = 3&lt;/code&gt; means the planner knows there are 3 distinct status values. With 10 million rows, each value has ~3.3 million rows on average. No single value is selective enough to make the index useful for queries that match a large fraction of rows.&lt;/p&gt;
&lt;h2 id=&quot;when-low-cardinality-indexes-work&quot;&gt;When Low-Cardinality Indexes Work&lt;/h2&gt;
&lt;p&gt;A partial index solves this by indexing only the rare values that are actually selective:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Instead of a full index on status:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;CREATE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; INDEX&lt;/span&gt;&lt;span style=&quot;color:#B392F0&quot;&gt; idx_orders_pending&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; ON&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders (created_at)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; status&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;pending&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If only 0.5% of orders are pending at any given time, this partial index covers a small fraction of rows and is highly selective. The planner will use it for &lt;code&gt;WHERE status = &apos;pending&apos;&lt;/code&gt; queries. It is smaller, faster to update, and more selective than a full index on &lt;code&gt;status&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s documented statistics collection (&lt;code&gt;ANALYZE&lt;/code&gt;) builds histograms and most-common-value lists for each column. The planner uses these to estimate how many rows a condition will return. When statistics are stale — because a table has had many inserts or updates since the last ANALYZE — estimates are wrong and the planner may make a bad choice. PostgreSQL’s autovacuum runs ANALYZE automatically, but on very high-write tables it may not keep up.&lt;/p&gt;
&lt;p&gt;The correlation value in &lt;code&gt;pg_stats&lt;/code&gt; measures how well the physical order of rows in the heap matches the sort order of the column. A high correlation (near 1.0) means the column’s values are physically ordered and index scans are efficient; a correlation near 0 means index scans require many random reads.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;Problem&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Index on low-cardinality column&lt;/td&gt;&lt;td&gt;Planner ignores the index; write overhead remains&lt;/td&gt;&lt;td&gt;Drop index; use partial index on the rare, selective values&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Stale statistics on skewed data&lt;/td&gt;&lt;td&gt;Planner underestimates matching rows; bad plan&lt;/td&gt;&lt;td&gt;Run &lt;code&gt;ANALYZE&lt;/code&gt; manually; tune &lt;code&gt;default_statistics_target&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Index exists but has wrong correlation&lt;/td&gt;&lt;td&gt;Index used but causes excessive random I/O&lt;/td&gt;&lt;td&gt;Run &lt;code&gt;CLUSTER&lt;/code&gt; on the table; or accept the random I/O as the cost of index use&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Low-cardinality indexes add write overhead and storage cost without improving read performance for queries that match a large fraction of rows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Check &lt;code&gt;pg_stats.n_distinct&lt;/code&gt; before creating an index; for low-cardinality columns, consider a partial index on the selective values only.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: A partial index on pending orders will appear in &lt;code&gt;EXPLAIN&lt;/code&gt; output for &lt;code&gt;WHERE status = &apos;pending&apos;&lt;/code&gt; queries and be ignored for &lt;code&gt;WHERE status = &apos;shipped&apos;&lt;/code&gt; queries — exactly the right selectivity-aware behavior.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run &lt;code&gt;SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes ORDER BY idx_scan ASC LIMIT 20;&lt;/code&gt; today and find your least-used indexes — candidates for review or removal.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category></item><item><title>Reading a Query Plan Without Getting Lost</title><link>https://rajivonai.com/blog/2023-05-09-reading-a-query-plan/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-05-09-reading-a-query-plan/</guid><description>How to read PostgreSQL EXPLAIN output, what seq scan vs index scan actually means in practice, and the three numbers that matter most in any query plan.</description><pubDate>Tue, 09 May 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The query plan is the database’s answer to a question you did not explicitly ask: given the data distribution I know about and the resources available, what is the cheapest path to your result? Reading that answer correctly means knowing which nodes cost the most, not which nodes appear first.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s &lt;code&gt;EXPLAIN&lt;/code&gt; and &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; are the primary tools for diagnosing slow queries. Every engineer who works with databases reads query plans eventually. Most read them wrong — scanning from top to bottom, treating the first node as the first operation, and ignoring the difference between estimated and actual row counts.&lt;/p&gt;
&lt;p&gt;The plan is a tree. Execution starts at the leaf nodes (innermost indentation) and flows up toward the root. The root node produces the final output.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;A query is slower than expected. &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; shows a plan with a Seq Scan, an Index Scan, a Hash Join, and a Sort. Which node is the problem? Without understanding how to read the plan, the engineer focuses on the Seq Scan — which may be entirely appropriate for a small table — while missing the Hash Join that is processing 10 million rows due to a bad row count estimate.&lt;/p&gt;
&lt;p&gt;What are the three numbers that matter in every query plan, and how do you use them to find the slow node?&lt;/p&gt;
&lt;h2 id=&quot;the-three-numbers&quot;&gt;The Three Numbers&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;1. Rows (estimated vs actual)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Every node in the plan shows &lt;code&gt;rows=N&lt;/code&gt; in the EXPLAIN output and, after ANALYZE, the actual row count alongside it. When these diverge significantly, the query planner made a bad estimate — which usually means a subsequent join or aggregation was sized incorrectly, causing it to use the wrong strategy.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Cost&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The cost is expressed as &lt;code&gt;cost=startup..total&lt;/code&gt; where both numbers are in abstract “cost units” (proportional to disk page reads). The startup cost is the cost before the first row is returned; the total cost is the cost to return all rows. Compare total costs across nodes to find the expensive one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Actual time (from ANALYZE)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;actual time=startup..total&lt;/code&gt; in milliseconds. This is the real measurement. A node with a high estimated cost but a low actual time is fine. A node with a low estimated cost but a high actual time indicates a bad estimate or a resource problem (I/O, locking, network).&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Always use ANALYZE BUFFERS for real diagnosis&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;EXPLAIN (ANALYZE, BUFFERS, FORMAT &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;TEXT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; o&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;id&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;o&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;status&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;name&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; orders o&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;JOIN&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; customers c &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ON&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; o&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;customer_id&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; c&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; o&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;.&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt;created_at&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; &gt;&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; interval &lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt;&apos;30 days&apos;&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;BUFFERS&lt;/code&gt; option shows how many shared buffer hits vs disk reads each node required. A node with &lt;code&gt;shared read=10000&lt;/code&gt; and &lt;code&gt;shared hit=0&lt;/code&gt; is reading entirely from disk — a cache miss problem, not an index problem.&lt;/p&gt;
&lt;h2 id=&quot;reading-the-plan&quot;&gt;Reading the Plan&lt;/h2&gt;
&lt;p&gt;In the plan output, each node shows its operation (Seq Scan, Index Scan, Hash Join, Sort, etc.) and its target. Read from the most-indented line outward:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;Hash Join  (cost=1200..5600 rows=4500 width=48) (actual time=45.2..89.3 rows=4312 loops=1)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;  -&gt;  Seq Scan on customers c  (cost=0..350 rows=12000 width=24) (actual time=0.1..8.2 rows=12000 loops=1)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;  -&gt;  Hash  (cost=900..900 rows=24000 width=24) (actual time=38.1..38.1 rows=23890 loops=1)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;        -&gt;  Index Scan using orders_created_at_idx on orders o  (actual time=0.2..22.4 rows=23890 loops=1)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;Seq Scan on customers&lt;/code&gt; runs first. Its 12,000 rows feed the &lt;code&gt;Hash&lt;/code&gt; node. The &lt;code&gt;Index Scan on orders&lt;/code&gt; runs in parallel and its rows are probed against the hash. The &lt;code&gt;Hash Join&lt;/code&gt; produces the result. The expensive node here is the Hash (38ms) — the Seq Scan on customers is cheap because it returns all 12,000 rows directly.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s query planner documentation describes the cost model as based on sequential page reads (cost unit ≈ 1 seq page read) with random reads costing &lt;code&gt;random_page_cost&lt;/code&gt; times more (default: 4). An SSD changes this ratio significantly — &lt;code&gt;random_page_cost = 1.1&lt;/code&gt; is appropriate for SSDs and often causes the planner to prefer index scans that it would otherwise avoid.&lt;/p&gt;
&lt;p&gt;The documented signal for a missing index: a &lt;code&gt;Seq Scan&lt;/code&gt; with &lt;code&gt;rows=N&lt;/code&gt; where N is large and a &lt;code&gt;Filter: (condition)&lt;/code&gt; that eliminates most rows. The database is scanning the whole table to find a few rows — a clear candidate for an index on the filter column.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Plan symptom&lt;/th&gt;&lt;th&gt;What it means&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;rows=1 actual rows=50000&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Severe row count underestimate; bad join strategy&lt;/td&gt;&lt;td&gt;&lt;code&gt;ANALYZE&lt;/code&gt; the table; check for stale statistics&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Seq Scan&lt;/code&gt; on large table with filter&lt;/td&gt;&lt;td&gt;No index on filter column, or index not used&lt;/td&gt;&lt;td&gt;Create index; or lower &lt;code&gt;random_page_cost&lt;/code&gt; for SSD&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Sort&lt;/code&gt; with &lt;code&gt;Disk: true&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Sort spilled to disk; &lt;code&gt;work_mem&lt;/code&gt; too small&lt;/td&gt;&lt;td&gt;Increase &lt;code&gt;work_mem&lt;/code&gt; per session for large queries&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;Nested Loop&lt;/code&gt; with millions of rows&lt;/td&gt;&lt;td&gt;Planner underestimated join size&lt;/td&gt;&lt;td&gt;Force join strategy with &lt;code&gt;SET enable_nestloop = off&lt;/code&gt; for testing&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Slow queries cannot be diagnosed without reading the plan, and most plans are misread because engineers focus on node type rather than actual time and row estimate accuracy.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Always use &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt; for slow query diagnosis; find the node with the highest actual time; check if actual rows match estimated rows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After running EXPLAIN ANALYZE on your five slowest queries, at least one will show a row count divergence that explains the poor plan choice.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Take your slowest query today and run &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)&lt;/code&gt; — find the node where actual rows diverges most from estimated rows, then run &lt;code&gt;ANALYZE table_name&lt;/code&gt; on the relevant table.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category></item><item><title>Connection Pooling Explained</title><link>https://rajivonai.com/blog/2023-03-14-connection-pooling-explained/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-03-14-connection-pooling-explained/</guid><description>Why PostgreSQL connections are expensive, what a connection pool actually does, and the difference between session mode, transaction mode, and statement mode in PgBouncer.</description><pubDate>Tue, 14 Mar 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Every PostgreSQL connection spawns a process, allocates memory, and holds shared resources. A web application that opens a connection per request is not slow because of network latency — it is slow because it is paying the cost of process creation on every HTTP request. Connection pooling solves this, but the mode you choose changes what SQL you can run.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL uses a process-per-connection model. Each client connection forks a backend process that consumes 5–10MB of memory for its own stack, buffers, and per-session state. On a server with 8GB of RAM dedicated to PostgreSQL, this limits you to roughly 800 concurrent connections before memory pressure begins — and most production systems become resource-constrained well before that.&lt;/p&gt;
&lt;p&gt;Web applications under load open and close connections constantly. At 500 requests per second, establishing a new PostgreSQL connection for each request adds 1–10ms of connection setup time per request — a latency floor that cannot be optimized away without pooling.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;A production database receiving connection errors under load is often not at its query processing limit — it is at its connection count limit. The fix is not always “increase &lt;code&gt;max_connections&lt;/code&gt;” because that consumes more memory and can destabilize the database. The correct fix is a connection pool between the application and the database.&lt;/p&gt;
&lt;p&gt;What does a connection pool actually do, and why does the pooling mode matter?&lt;/p&gt;
&lt;h2 id=&quot;what-a-pool-does&quot;&gt;What a Pool Does&lt;/h2&gt;
&lt;p&gt;A connection pool maintains a set of long-lived PostgreSQL connections and lends them to application requests. The application connects to the pool (which is fast — TCP to a local process), and the pool forwards queries over an existing backend connection. When the application is done, the connection returns to the pool rather than being closed.&lt;/p&gt;
&lt;p&gt;PgBouncer is the standard choice for PostgreSQL. It operates in three modes that differ in when the connection is returned to the pool:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Session mode&lt;/strong&gt;: the backend connection is held for the entire application session. Equivalent to a direct connection — no query-level multiplexing. Useful for applications that rely on session-level state (&lt;code&gt;SET&lt;/code&gt;, &lt;code&gt;LISTEN&lt;/code&gt;, prepared statements that persist across transactions).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Transaction mode&lt;/strong&gt;: the backend connection is returned to the pool after each transaction. One backend connection can serve multiple application sessions sequentially. Most OLTP applications work in this mode.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Statement mode&lt;/strong&gt;: the backend connection is returned after each individual statement. Incompatible with multi-statement transactions. Rarely used.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;# PgBouncer config (pgbouncer.ini)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;[databases]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;mydb = host=127.0.0.1 port=5432 dbname=mydb&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;[pgbouncer]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;pool_mode = transaction&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;max_client_conn = 1000&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;default_pool_size = 25&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;min_pool_size = 5&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;server_idle_timeout = 600&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this config: 1,000 application connections share 25 backend connections, in transaction mode.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PgBouncer’s documented transaction mode limitation is that per-session PostgreSQL features are broken: prepared statements created with &lt;code&gt;PREPARE&lt;/code&gt;, advisory locks, &lt;code&gt;SET LOCAL&lt;/code&gt; (which only persists for a transaction), and &lt;code&gt;LISTEN&lt;/code&gt;/&lt;code&gt;NOTIFY&lt;/code&gt;. Applications that use &lt;code&gt;SET search_path&lt;/code&gt; outside a transaction will find their setting lost when the backend connection is returned to the pool. These are documented constraints, not bugs — transaction-mode pooling fundamentally cannot preserve session state between pool handoffs.&lt;/p&gt;
&lt;p&gt;The common production pattern for applications using an ORM: switch from session mode to transaction mode, then fix the resulting errors one by one. The errors typically involve prepared statement handling (some ORMs cache prepared statements per connection) and search path assumptions.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure&lt;/th&gt;&lt;th&gt;Cause&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;ERROR: prepared statement does not exist&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Prepared statement created in a previous transaction on a now-different backend&lt;/td&gt;&lt;td&gt;Disable prepared statements in the ORM; or use session mode&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Advisory lock released unexpectedly&lt;/td&gt;&lt;td&gt;Advisory lock tied to session, returned to pool&lt;/td&gt;&lt;td&gt;Use transaction-scoped advisory locks or session mode&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;SET&lt;/code&gt; variables lost between queries&lt;/td&gt;&lt;td&gt;Session state not preserved across pool handoffs&lt;/td&gt;&lt;td&gt;Move SET into transaction blocks; or use session mode for that use case&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Pool exhausted under load&lt;/td&gt;&lt;td&gt;&lt;code&gt;default_pool_size&lt;/code&gt; too small&lt;/td&gt;&lt;td&gt;Increase; but also check for long-running transactions blocking pool return&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Applications that open a PostgreSQL connection per request pay process-creation cost on every request and hit &lt;code&gt;max_connections&lt;/code&gt; under load.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Put PgBouncer in front of PostgreSQL in transaction mode; set &lt;code&gt;default_pool_size&lt;/code&gt; to 20–50 depending on core count and query duration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After adding PgBouncer, &lt;code&gt;SELECT count(*) FROM pg_stat_activity&lt;/code&gt; should show a stable, small number of backend connections even under peak load.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run &lt;code&gt;SELECT count(*), state FROM pg_stat_activity GROUP BY state;&lt;/code&gt; today — if &lt;code&gt;idle&lt;/code&gt; connections exceed 20% of &lt;code&gt;max_connections&lt;/code&gt;, you are holding connections open unnecessarily and a pool would immediately free that capacity.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category></item><item><title>Replication Lag Explained</title><link>https://rajivonai.com/blog/2023-01-10-replication-lag-explained/</link><guid isPermaLink="true">https://rajivonai.com/blog/2023-01-10-replication-lag-explained/</guid><description>What replication lag actually measures in PostgreSQL, the three distinct lag components that most monitoring tools conflate, and which one matters for your RPO.</description><pubDate>Tue, 10 Jan 2023 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Replication lag is not one number — it is three. Write lag, flush lag, and replay lag measure different things, fail in different ways, and require different interventions. Monitoring only total lag means you cannot tell whether the standby is slow to receive, slow to confirm, or slow to apply.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s &lt;code&gt;pg_stat_replication&lt;/code&gt; view exposes three lag components for each connected standby: &lt;code&gt;write_lag&lt;/code&gt;, &lt;code&gt;flush_lag&lt;/code&gt;, and &lt;code&gt;replay_lag&lt;/code&gt;. Most monitoring systems expose only the largest — typically &lt;code&gt;replay_lag&lt;/code&gt; — and alert on it as a single number. That number is correct but incomplete.&lt;/p&gt;
&lt;p&gt;Replication lag is the delay between a change being committed on the primary and being available on the standby. But “available” means different things depending on what you are protecting against.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;An alert fires: replication lag on the standby has reached 45 seconds. The on-call engineer does not know: is the primary sending WAL slowly? Is the standby receiving but not flushing? Is the standby flushing but not replaying? Each has a different root cause and a different fix. Without understanding the three components, you cannot triage the alert correctly.&lt;/p&gt;
&lt;p&gt;What do the three lag components actually measure, and which one is relevant to your RPO?&lt;/p&gt;
&lt;h2 id=&quot;the-three-components&quot;&gt;The Three Components&lt;/h2&gt;
&lt;p&gt;PostgreSQL measures lag as the time between a change being committed on the primary and each stage completing on the standby:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Write lag&lt;/strong&gt;: time between commit on primary and the standby confirming it has written the WAL record to its own WAL buffer (in memory). This measures network latency and standby receive throughput.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Flush lag&lt;/strong&gt;: time between commit on primary and the standby confirming it has flushed the WAL record to disk. This measures the standby’s I/O performance for WAL writes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Replay lag&lt;/strong&gt;: time between commit on primary and the standby confirming it has applied the WAL record to its data files. This measures the standby’s ability to apply changes — which can fall behind under high write volume or during long-running queries on the standby that hold recovery locks.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- On the primary: all three lag components per standby&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; application_name,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       write_lag,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       flush_lag,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       replay_lag,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;       state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       sync_state&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_replication&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; replay_lag &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; NULLS&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; LAST&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- On the standby: time since last replay&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_last_xact_replay_timestamp() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; replication_lag;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For RPO purposes, &lt;code&gt;replay_lag&lt;/code&gt; is what matters — it is the measure of how much committed data could be lost if the primary fails right now and you promote the standby.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;The documented PostgreSQL behavior for physical streaming replication is that &lt;code&gt;write_lag&lt;/code&gt; and &lt;code&gt;flush_lag&lt;/code&gt; are typically small (milliseconds in a well-connected environment) and &lt;code&gt;replay_lag&lt;/code&gt; is the dominant component. Replay lag grows when: the standby is I/O constrained applying data pages; the standby has long-running read queries that block recovery (hot standby conflict); or the primary is generating WAL faster than the standby can replay.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;synchronous_commit = remote_apply&lt;/code&gt; causes the primary to wait until &lt;code&gt;replay_lag&lt;/code&gt; reaches zero before acknowledging a commit — at the cost of commit latency equal to the standby’s replay time. &lt;code&gt;synchronous_commit = remote_write&lt;/code&gt; waits only for &lt;code&gt;write_lag&lt;/code&gt; to clear, providing weaker durability guarantees but lower commit latency.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Lag component growing&lt;/th&gt;&lt;th&gt;Root cause&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Write lag&lt;/td&gt;&lt;td&gt;Network congestion or bandwidth saturation&lt;/td&gt;&lt;td&gt;Investigate network path; consider WAL compression&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Flush lag&lt;/td&gt;&lt;td&gt;Standby I/O pressure (disk writes slow)&lt;/td&gt;&lt;td&gt;Upgrade standby storage; separate WAL to faster device&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replay lag&lt;/td&gt;&lt;td&gt;Long-running queries on standby causing hot standby conflicts&lt;/td&gt;&lt;td&gt;&lt;code&gt;max_standby_streaming_delay&lt;/code&gt;; cancel conflicting queries&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;All three&lt;/td&gt;&lt;td&gt;Primary generating WAL faster than standby can process&lt;/td&gt;&lt;td&gt;Vertical scale of standby; reduce primary write throughput&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Monitoring a single lag number does not distinguish between a network problem, a standby I/O problem, and a replay conflict — three very different operational responses.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Monitor all three components separately; alert on &lt;code&gt;replay_lag &gt; RPO_threshold&lt;/code&gt; for durability; alert on &lt;code&gt;flush_lag &gt; write_lag * 5&lt;/code&gt; to detect standby I/O problems specifically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After adding per-component monitoring, lag spikes will clearly show which component is growing, cutting triage time from minutes to seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run the &lt;code&gt;pg_stat_replication&lt;/code&gt; query above right now on your primary and capture the three lag values as your baseline — if you have never looked at them before, you likely do not know which component your standby’s lag comes from.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category></item><item><title>Checkpoint and Flush: What Your Database Does Before It Can Rest</title><link>https://rajivonai.com/blog/2022-10-11-checkpoint-and-flush-what-your-database-does-before-it-can-rest/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-10-11-checkpoint-and-flush-what-your-database-does-before-it-can-rest/</guid><description>What a checkpoint actually does in PostgreSQL, why dirty page flush matters for recovery time, and what engineers should monitor to avoid checkpoint pressure.</description><pubDate>Tue, 11 Oct 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;A checkpoint is not a pause — it is the database settling its accounts. Everything written to the buffer cache since the last checkpoint must be flushed to disk so that crash recovery has a known starting point. Getting checkpoint timing wrong turns a 30-second restart into a 20-minute recovery.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;PostgreSQL and most other ACID databases use checkpoints to bound crash recovery time. Between checkpoints, the database accumulates dirty pages in the buffer cache — pages that have been modified in memory but not yet written to their data files on disk. At a checkpoint, all dirty pages are flushed.&lt;/p&gt;
&lt;p&gt;After a crash, the database only needs to replay WAL records that were written after the last successful checkpoint. If checkpoints are frequent, less WAL needs to be replayed. If checkpoints are infrequent, recovery takes longer.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Engineers often observe I/O spikes on their database hosts that correlate with checkpoint activity and assume something is wrong. The database is not misbehaving — it is doing its job. But poorly tuned checkpoints create two distinct problems: if too frequent, the database constantly flushes dirty pages and saturates I/O; if too infrequent, crash recovery takes too long and dirty pages accumulate in the buffer cache past useful limits.&lt;/p&gt;
&lt;p&gt;What is actually happening during a checkpoint, and what parameters control it?&lt;/p&gt;
&lt;h2 id=&quot;what-a-checkpoint-does&quot;&gt;What a Checkpoint Does&lt;/h2&gt;
&lt;p&gt;When PostgreSQL triggers a checkpoint, it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Records the current WAL position as the checkpoint LSN.&lt;/li&gt;
&lt;li&gt;Identifies all dirty pages in the shared buffer cache.&lt;/li&gt;
&lt;li&gt;Writes those pages to their data files on disk, spread across the checkpoint interval.&lt;/li&gt;
&lt;li&gt;Flushes the WAL up to the checkpoint LSN.&lt;/li&gt;
&lt;li&gt;Updates &lt;code&gt;pg_control&lt;/code&gt; to record the checkpoint as complete.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The spreading is controlled by &lt;code&gt;checkpoint_completion_target&lt;/code&gt; (default: 0.9), which tells PostgreSQL to spread dirty page writes over 90% of the checkpoint interval. This prevents a large I/O burst at the start of each checkpoint.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- See checkpoint activity since last restart&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; checkpoints_timed, checkpoints_req,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       buffers_checkpoint, buffers_clean, buffers_backend,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;       checkpoint_write_time, checkpoint_sync_time&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_bgwriter;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- checkpoints_req being high means checkpoints are being forced by WAL volume,&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- not by time — usually means max_wal_size is too small&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;checkpoints_req&lt;/code&gt; being significantly higher than &lt;code&gt;checkpoints_timed&lt;/code&gt; is a signal that &lt;code&gt;max_wal_size&lt;/code&gt; is too small and the database is triggering emergency checkpoints to prevent WAL from exceeding the limit.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s documented guidance is that &lt;code&gt;checkpoint_timeout&lt;/code&gt; should be long enough that checkpoint I/O does not saturate the storage system, but short enough that recovery after a crash completes within the acceptable window. The relationship: worst-case recovery time ≈ &lt;code&gt;checkpoint_timeout&lt;/code&gt; × write throughput. For a database writing 500MB/min of WAL with a 10-minute checkpoint timeout, recovery could replay up to 5GB of WAL.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;buffers_backend&lt;/code&gt; in &lt;code&gt;pg_stat_bgwriter&lt;/code&gt; counts pages that were written directly by backend processes rather than the background writer. A high &lt;code&gt;buffers_backend&lt;/code&gt; count means the background writer is not keeping up with dirty page accumulation — backends are being forced to flush their own dirty pages before the checkpointer gets to them. This creates latency spikes for application queries.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Symptom&lt;/th&gt;&lt;th&gt;Cause&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;I/O spike every N minutes&lt;/td&gt;&lt;td&gt;Checkpoint spreading not working; &lt;code&gt;checkpoint_completion_target&lt;/code&gt; too low&lt;/td&gt;&lt;td&gt;Increase &lt;code&gt;checkpoint_completion_target&lt;/code&gt; to 0.9&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code&gt;checkpoints_req&lt;/code&gt; high&lt;/td&gt;&lt;td&gt;WAL volume exceeds &lt;code&gt;max_wal_size&lt;/code&gt; limit&lt;/td&gt;&lt;td&gt;Increase &lt;code&gt;max_wal_size&lt;/code&gt;; or reduce write throughput&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;High &lt;code&gt;buffers_backend&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Background writer not keeping up&lt;/td&gt;&lt;td&gt;Tune &lt;code&gt;bgwriter_lru_maxpages&lt;/code&gt; and &lt;code&gt;bgwriter_delay&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Long crash recovery&lt;/td&gt;&lt;td&gt;Checkpoint interval too long&lt;/td&gt;&lt;td&gt;Reduce &lt;code&gt;checkpoint_timeout&lt;/code&gt; to 5 minutes&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Checkpoint timing that is either too aggressive or too infrequent creates I/O spikes or long recovery windows — both are preventable with correct parameter tuning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Set &lt;code&gt;checkpoint_timeout = 5min&lt;/code&gt;, &lt;code&gt;checkpoint_completion_target = 0.9&lt;/code&gt;, and &lt;code&gt;max_wal_size&lt;/code&gt; to a value that allows at least 2–3 checkpoint intervals of WAL accumulation without forcing early checkpoints.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After tuning, &lt;code&gt;checkpoints_req&lt;/code&gt; should approach zero and &lt;code&gt;checkpoint_write_time&lt;/code&gt; should show smooth, gradual I/O rather than spikes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run &lt;code&gt;SELECT checkpoints_timed, checkpoints_req FROM pg_stat_bgwriter;&lt;/code&gt; today — if &lt;code&gt;checkpoints_req&lt;/code&gt; is more than 20% of &lt;code&gt;checkpoints_timed&lt;/code&gt;, your &lt;code&gt;max_wal_size&lt;/code&gt; is undersized.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category></item><item><title>Redo vs Undo: How Databases Recover from Crashes</title><link>https://rajivonai.com/blog/2022-08-09-redo-vs-undo-how-databases-recover-from-crashes/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-08-09-redo-vs-undo-how-databases-recover-from-crashes/</guid><description>The two mechanisms databases use to survive crashes — redo brings committed changes forward, undo rolls back uncommitted ones — and why the distinction matters operationally.</description><pubDate>Tue, 09 Aug 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;When a database crashes mid-transaction, it has two problems: replay every committed change that did not make it to disk, and remove every uncommitted change that did. These are solved by redo and undo, and conflating them is how engineers misread crash recovery timelines.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Every ACID database must survive a crash and return to a consistent state. After a crash, some committed transactions may not have flushed their data pages to disk (they were in the buffer cache). Some uncommitted transactions may have partially written data pages. The recovery process must handle both cases.&lt;/p&gt;
&lt;p&gt;The standard model — used by PostgreSQL, Oracle, MySQL InnoDB, and SQL Server — divides recovery into two phases: redo and undo.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Engineers monitoring a database restart after a crash often see recovery take longer than expected and cannot explain why. They see log messages about “replaying WAL” or “applying redo records” and assume that means the database is restoring from backup. It is not. It is doing normal crash recovery — and understanding the two phases explains why the timeline is what it is.&lt;/p&gt;
&lt;p&gt;How long should crash recovery take, and what is the database actually doing during that time?&lt;/p&gt;
&lt;h2 id=&quot;redo-bring-committed-changes-forward&quot;&gt;Redo: Bring Committed Changes Forward&lt;/h2&gt;
&lt;p&gt;Redo uses the write-ahead log (WAL in PostgreSQL, redo log in Oracle/MySQL) to replay every change since the last checkpoint, in log sequence order. The checkpoint is a known consistent point — all data pages at the checkpoint are guaranteed to be on disk.&lt;/p&gt;
&lt;p&gt;After a crash, the database scans forward from the last checkpoint and replays each WAL record: insert a row here, update a column there, allocate a page. This brings data files forward to the state they would have been in if the crash had not happened. Redo does not distinguish between committed and uncommitted transactions — it applies all log records first.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- PostgreSQL: see recovery progress during startup (from another session or log)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Check pg_waldump for log record analysis post-crash:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- pg_waldump -p /var/lib/postgresql/data/pg_wal -s 0/1234ABCD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- After recovery, confirm the database recovered to the right LSN:&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_current_wal_lsn();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Redo is deterministic and bounded: it replays records from the checkpoint LSN to the end of the WAL. Recovery time is proportional to how far the WAL advanced past the last checkpoint — which is controlled by &lt;code&gt;checkpoint_timeout&lt;/code&gt; and &lt;code&gt;max_wal_size&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;undo-roll-back-uncommitted-changes&quot;&gt;Undo: Roll Back Uncommitted Changes&lt;/h2&gt;
&lt;p&gt;After redo, the database contains a mix of committed and uncommitted changes. Undo scans the log in reverse and removes every change made by transactions that were not committed at the time of the crash. In PostgreSQL, this is handled implicitly by MVCC — uncommitted transaction row versions are simply invisible to new readers because their &lt;code&gt;xmin&lt;/code&gt; was never marked committed. In InnoDB and Oracle, a separate undo log stores the before-images of rows that were modified by uncommitted transactions.&lt;/p&gt;
&lt;p&gt;The operational implication: in InnoDB, recovery time includes the undo phase, which can be significant if a long-running uncommitted transaction modified many rows. PostgreSQL’s MVCC approach means undo is lazy — the dead rows persist and are cleaned up by vacuum later, trading immediate undo cost for deferred cleanup cost.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s documented recovery model confirms that crash recovery replays WAL records from the last checkpoint. The time to recover is bounded by &lt;code&gt;checkpoint_timeout&lt;/code&gt; (default: 5 minutes) and how aggressively the database was writing past the checkpoint. Oracle’s documented recovery model uses a dedicated undo tablespace where before-images are stored for rollback; the undo tablespace must be sized for the longest running uncommitted transaction.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure&lt;/th&gt;&lt;th&gt;Cause&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Crash recovery takes 20+ minutes&lt;/td&gt;&lt;td&gt;Long checkpoint interval; heavy WAL generation past last checkpoint&lt;/td&gt;&lt;td&gt;Lower &lt;code&gt;checkpoint_timeout&lt;/code&gt;; ensure checkpoints complete before the next starts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;InnoDB recovery stuck on undo&lt;/td&gt;&lt;td&gt;Large uncommitted transaction at time of crash&lt;/td&gt;&lt;td&gt;Cannot be accelerated; undo must complete before DB opens&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL bloat after crash&lt;/td&gt;&lt;td&gt;Uncommitted dead tuples not cleaned up&lt;/td&gt;&lt;td&gt;Normal — autovacuum will reclaim after recovery; no action needed&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Long crash recovery is almost always a checkpoint tuning problem — the database is redoing too much WAL because checkpoints were too infrequent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Set &lt;code&gt;checkpoint_timeout&lt;/code&gt; to 5 minutes or less; monitor &lt;code&gt;pg_stat_bgwriter.checkpoints_timed&lt;/code&gt; vs &lt;code&gt;checkpoints_req&lt;/code&gt; to confirm checkpoints complete on schedule.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After tuning, crash recovery tests in staging should complete in under 2 minutes for typical OLTP loads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Check your current &lt;code&gt;checkpoint_timeout&lt;/code&gt; and calculate the worst-case redo window: &lt;code&gt;SHOW checkpoint_timeout; SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), &apos;0/0&apos;));&lt;/code&gt; — this bounds your maximum recovery time.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category></item><item><title>B-tree vs LSM Tree: The Storage Engine Tradeoff</title><link>https://rajivonai.com/blog/2022-06-14-btree-vs-lsm-tree-the-storage-engine-tradeoff/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-06-14-btree-vs-lsm-tree-the-storage-engine-tradeoff/</guid><description>Why PostgreSQL and MySQL use B-trees while Cassandra and RocksDB use LSM trees — the read/write tradeoff that determines which storage engine fits your workload.</description><pubDate>Tue, 14 Jun 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The storage engine is the most consequential architectural decision in a database, and the core tradeoff has not changed in fifty years: B-trees are fast to read; LSM trees are fast to write. Your workload determines which penalty you can afford.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most engineers working with relational databases have never chosen a storage engine — PostgreSQL uses a B-tree heap by default, and the choice was made for them. Engineers working with Cassandra, RocksDB, or FoundationDB are using LSM trees, often without knowing why the database was designed that way.&lt;/p&gt;
&lt;p&gt;The two structures dominate modern database storage: B-trees (balanced tree indexes used in PostgreSQL, MySQL InnoDB, Oracle) and LSM trees (log-structured merge trees used in Cassandra, LevelDB, RocksDB, and HBase). Each trades read performance for write performance in a different direction.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Choosing or operating a database without understanding the storage engine’s read/write tradeoffs leads to predictable operational failures. A B-tree database under sustained high-write workloads shows write amplification and fragmentation. An LSM-tree database that is read-heavy shows read amplification as the engine scans multiple levels of sorted files. You cannot tune your way out of the wrong structural choice.&lt;/p&gt;
&lt;p&gt;What is the actual tradeoff, and when does each structure win?&lt;/p&gt;
&lt;h2 id=&quot;the-structures&quot;&gt;The Structures&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;B-trees&lt;/strong&gt; store data in a balanced tree of fixed-size pages, typically 8KB in PostgreSQL. An UPDATE modifies the page in place after finding it via the tree. Reads are efficient: traverse from root to leaf, read the page. Writes require finding the right page, potentially splitting it (causing write amplification), and updating parent pointers. B-trees are random-write structures — every update touches disk in place.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LSM trees&lt;/strong&gt; never update in place. Writes go to an in-memory buffer (memtable), which is periodically flushed to an immutable sorted file (SSTable) on disk. Reads must check the memtable and potentially multiple SSTable levels to find the current version. Background compaction merges SSTables, reclaiming space and reducing the number of levels to check. LSM trees are sequential-write structures — disk writes are always sequential appends.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;B-tree read:  O(log n) — traverse tree, read page&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;B-tree write: O(log n) — find page, modify in place (random I/O)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;LSM write:    O(1) amortized — append to memtable, flush sequentially&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;LSM read:     O(L) — check L levels of SSTables for latest version&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Attribute&lt;/th&gt;&lt;th&gt;B-tree&lt;/th&gt;&lt;th&gt;LSM tree&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Write path&lt;/td&gt;&lt;td&gt;Random in-place page modification&lt;/td&gt;&lt;td&gt;Sequential append to memtable → SSTable flush&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Read path&lt;/td&gt;&lt;td&gt;Tree traversal, one disk read at leaf&lt;/td&gt;&lt;td&gt;Multi-level SSTable scan (read amplification)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Write throughput&lt;/td&gt;&lt;td&gt;Good for balanced workloads&lt;/td&gt;&lt;td&gt;Excellent; consistently low write latency&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Read throughput&lt;/td&gt;&lt;td&gt;Excellent for point lookups and range scans&lt;/td&gt;&lt;td&gt;Moderate; degrades as SSTable level count grows&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Space overhead&lt;/td&gt;&lt;td&gt;Fragmentation accumulates; autovacuum reclaims&lt;/td&gt;&lt;td&gt;Space amplification during compaction windows&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Background work&lt;/td&gt;&lt;td&gt;Autovacuum, checkpoint, bgwriter&lt;/td&gt;&lt;td&gt;Compaction (CPU and I/O intensive at peak)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Best workload&lt;/td&gt;&lt;td&gt;OLTP: balanced reads/writes, point lookups, range scans&lt;/td&gt;&lt;td&gt;Write-heavy: IoT, time-series, event streams&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Databases&lt;/td&gt;&lt;td&gt;PostgreSQL, MySQL InnoDB, Oracle, SQLite&lt;/td&gt;&lt;td&gt;Cassandra, RocksDB, HBase, FoundationDB&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s documented design uses heap files with B-tree indexes. The B-tree is the correct structure for OLTP workloads with balanced reads and writes, point lookups, and range scans. PostgreSQL’s MVCC model (dead tuples in the heap) means writes also accumulate page fragmentation that autovacuum must reclaim — the cost of in-place updates.&lt;/p&gt;
&lt;p&gt;Cassandra’s documented design uses an LSM tree (via SSTables). Cassandra is optimized for write-heavy workloads: time-series, IoT, event streams, and any pattern where writes vastly outnumber reads. The tradeoff is that reads are more expensive (scanning multiple SSTables), and compaction consumes I/O bandwidth during which read latency can increase.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Workload&lt;/th&gt;&lt;th&gt;B-tree result&lt;/th&gt;&lt;th&gt;LSM result&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;High write throughput&lt;/td&gt;&lt;td&gt;Write amplification; page splits; fragmentation&lt;/td&gt;&lt;td&gt;Sequential append; consistent write latency&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Point lookups (read-heavy)&lt;/td&gt;&lt;td&gt;Fast; single tree traversal&lt;/td&gt;&lt;td&gt;Slower; must check multiple SSTable levels&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Range scans&lt;/td&gt;&lt;td&gt;Fast; sorted pages&lt;/td&gt;&lt;td&gt;Moderate; sorted within SSTables, merge across levels&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Compaction pressure&lt;/td&gt;&lt;td&gt;Autovacuum reclaims dead tuples continuously&lt;/td&gt;&lt;td&gt;Background compaction spikes I/O; read latency degrades&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Operating a write-heavy workload on a B-tree engine or a read-heavy workload on an LSM engine produces predictable performance degradation that cannot be tuned away.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Classify your workload by read/write ratio, access pattern (point vs range), and acceptable latency variance before selecting an engine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: On a B-tree database, measure write amplification via &lt;code&gt;pg_stat_bgwriter&lt;/code&gt;; on an LSM database, measure read amplification via SSTable level counts in the engine’s metrics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Identify your top three most write-intensive tables today and measure their dead tuple ratio — that is the B-tree’s write tax showing up as storage overhead.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category><category>architecture</category></item><item><title>Read-After-Write Consistency: The UX Bug That Becomes a Database Bug</title><link>https://rajivonai.com/blog/2022-04-26-read-after-write-consistency-the-ux-bug-that-becomes-a-database-bug/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-04-26-read-after-write-consistency-the-ux-bug-that-becomes-a-database-bug/</guid><description>Acknowledging a write before the system knows where the next read will land turns a clean product experience into a staleness bug that looks like data loss — how read-after-write consistency works and where it breaks under replica lag.</description><pubDate>Tue, 26 Apr 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The fastest way to turn a clean product experience into an incident is to acknowledge a write before the system knows where the next read will land.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Modern applications rarely read from the same place they write.&lt;/p&gt;
&lt;p&gt;A user updates a profile, changes a permission, uploads a document, or submits a payment method. The write goes to the primary database, an event stream, a cache invalidation queue, a search indexer, a read replica, and sometimes a regional projection. The UI receives &lt;code&gt;200 OK&lt;/code&gt;, closes the modal, and immediately asks for the updated screen.&lt;/p&gt;
&lt;p&gt;That second request is where the architecture is exposed.&lt;/p&gt;
&lt;p&gt;If it reads from a lagging replica, a stale cache, or a denormalized projection that has not consumed the event yet, the user sees the old value. They retry. They refresh. They submit again. Support calls it a UX bug. Product calls it confusing. Engineering eventually discovers that the interface made a stronger consistency promise than the storage path could honor.&lt;/p&gt;
&lt;p&gt;Read-after-write consistency is not a database feature you either have or lack. It is a contract between a mutation path, a read path, and a user session.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The common failure is treating all reads as equivalent.&lt;/p&gt;
&lt;p&gt;A homepage feed can tolerate eventual freshness. A billing confirmation page cannot. A search result can lag behind a create operation if the UI says indexing is pending. A permission check after an admin change cannot quietly read old state from a replica and let the wrong access decision through.&lt;/p&gt;
&lt;p&gt;The bug appears when the system does not distinguish these cases. The write path says, “committed.” The read router says, “nearest healthy replica.” The cache says, “still inside TTL.” The UI says, “saved.” Each component is locally reasonable, but the composition violates the user’s mental model.&lt;/p&gt;
&lt;p&gt;The hard question is not, “Should every read be strongly consistent?” That answer is usually no. The better question is: &lt;strong&gt;which user-visible workflows require monotonic session reads, and how does the system prove that the next read observes the write it just acknowledged?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;session-causal-read-path&quot;&gt;Session-Causal Read Path&lt;/h2&gt;
&lt;p&gt;A practical architecture starts by carrying causality across the request boundary. The write response should return a commit marker: a database LSN, version, timestamp, entity revision, or application sequence number. The client or backend session stores the highest marker it has observed. Subsequent reads include that marker, and the read path must choose a source that has caught up.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[client mutation — save settings] --&gt; B[write gateway — validate command]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; C[primary store — commit new version]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; D[commit marker — session version]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; E[client session — remember marker]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt; F[replication stream — apply changes]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt; G[read replica — report replay position]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; H[read gateway — require observed version]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  G --&gt; H&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  H --&gt; I{replica caught up}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  I --&gt; J[replica read — normal latency]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  I --&gt; K[primary read — consistency fallback]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  H --&gt; L[cache policy — bypass stale entry]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  J --&gt; M[response — shows committed state]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  K --&gt; M&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pattern keeps most reads cheap while making the consistency requirement explicit. The gateway does not need to serialize the whole application. It only needs to answer a narrow question: can this read source prove it has observed at least the version the session already saw?&lt;/p&gt;
&lt;p&gt;There are several implementation variants.&lt;/p&gt;
&lt;p&gt;For single-primary relational systems, the marker can be the primary’s log position. For Dynamo-style systems, it can be an item version or vector-derived revision. For event-driven projections, it can be the event offset applied by the projection. For caches, it can be a versioned key or a rule that bypasses cache entries older than the session marker.&lt;/p&gt;
&lt;p&gt;The important design choice is that “read your own write” becomes a routed behavior, not a hope.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;h3 id=&quot;context&quot;&gt;Context&lt;/h3&gt;
&lt;p&gt;Amazon’s Dynamo paper describes a system designed for high availability, where updates are propagated asynchronously and conflicts are handled using object versioning and application-assisted resolution. The documented pattern is explicit: the data store exposes versions because the application may have the semantic knowledge required to merge divergent updates. See &lt;a href=&quot;https://www.amazon.science/publications/dynamo-amazons-highly-available-key-value-store&quot;&gt;Dynamo: Amazon’s Highly Available Key-value Store&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;action&quot;&gt;Action&lt;/h3&gt;
&lt;p&gt;Dynamo’s lesson is not that every product should accept stale reads. It is that consistency policy has to be part of the application contract. If the domain is a shopping cart, preserving writes and resolving conflicts later may be acceptable. If the domain is access control, inventory reservation, or payment confirmation, conflict surfacing is not enough. The read path must either go to an authoritative source or wait until the replica can prove it is current enough.&lt;/p&gt;
&lt;p&gt;AWS DynamoDB exposes this tradeoff directly. Its documentation says eventually consistent reads are the default and may not reflect a recently completed write, while strongly consistent reads can be requested for tables and local secondary indexes. It also documents that global secondary indexes and streams are eventually consistent. See &lt;a href=&quot;https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html&quot;&gt;DynamoDB read consistency&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;result&quot;&gt;Result&lt;/h3&gt;
&lt;p&gt;The result is a useful rule: a successful write acknowledgement is not the same thing as global read visibility. DynamoDB can durably accept a write and still require the caller to choose the correct read mode for the next operation. That is not a contradiction; it is a contract boundary.&lt;/p&gt;
&lt;p&gt;PostgreSQL shows another version of the same issue. With synchronous replication and &lt;code&gt;synchronous_commit = remote_apply&lt;/code&gt;, commits wait until synchronous standbys have replayed the transaction, making it visible to standby queries. The PostgreSQL documentation notes that this can allow load balancing with causal consistency in simple cases. See &lt;a href=&quot;https://www.postgresql.org/docs/current/warm-standby.html&quot;&gt;PostgreSQL log-shipping standby servers&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;learning&quot;&gt;Learning&lt;/h3&gt;
&lt;p&gt;The learning is that read-after-write consistency can be purchased in different currencies: higher write latency, higher read latency, reduced replica choice, more expensive read modes, or more application complexity.&lt;/p&gt;
&lt;p&gt;Google Spanner makes a more global tradeoff. Its external consistency model uses TrueTime and replication protocols so transaction ordering respects real-time ordering across distributed infrastructure. The documented architecture spends coordination and clock uncertainty management to make the database provide a stronger contract. See &lt;a href=&quot;https://research.google/pubs/pub39966&quot;&gt;Spanner: Google’s Globally-Distributed Database&lt;/a&gt; and &lt;a href=&quot;https://cloud.google.com/spanner/docs/true-time-external-consistency&quot;&gt;Spanner TrueTime and external consistency&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Most systems do not need Spanner’s full contract for every request. But they do need to name which requests depend on that contract.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;





















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Approach&lt;/th&gt;&lt;th&gt;Works Well For&lt;/th&gt;&lt;th&gt;Failure Mode&lt;/th&gt;&lt;th&gt;Operational Cost&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Always read from primary after writes&lt;/td&gt;&lt;td&gt;Account settings, billing, admin workflows&lt;/td&gt;&lt;td&gt;Primary becomes read bottleneck under broad use&lt;/td&gt;&lt;td&gt;Higher primary load and cross-region latency&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Sticky session to primary for a short window&lt;/td&gt;&lt;td&gt;User-facing confirmation flows&lt;/td&gt;&lt;td&gt;Session affinity breaks across devices or services&lt;/td&gt;&lt;td&gt;Routing state and fallback logic&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Version-aware replica reads&lt;/td&gt;&lt;td&gt;High-read systems with measurable replica lag&lt;/td&gt;&lt;td&gt;Requires reliable replay position reporting&lt;/td&gt;&lt;td&gt;More gateway complexity&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cache bypass after mutation&lt;/td&gt;&lt;td&gt;Pages with aggressive caching&lt;/td&gt;&lt;td&gt;Bypass rules drift from mutation semantics&lt;/td&gt;&lt;td&gt;Cache policy ownership burden&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Projection pending state&lt;/td&gt;&lt;td&gt;Search, analytics, feeds, async enrichment&lt;/td&gt;&lt;td&gt;Users may see incomplete state longer&lt;/td&gt;&lt;td&gt;Product must expose honest state&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Strong read mode per request&lt;/td&gt;&lt;td&gt;DynamoDB-style point reads&lt;/td&gt;&lt;td&gt;Unsupported on some indexes or projections&lt;/td&gt;&lt;td&gt;Higher read cost and explicit call-site discipline&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Global external consistency&lt;/td&gt;&lt;td&gt;Cross-region transactional systems&lt;/td&gt;&lt;td&gt;Overkill for low-value freshness paths&lt;/td&gt;&lt;td&gt;Coordination cost and vendor constraints&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Find the workflows where the UI says “saved” and then immediately reads the same entity, permission, balance, or derived view.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Add a session-visible commit marker to mutation responses and make read routing honor that marker with replica catch-up, cache bypass, or primary fallback.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Test with forced replica lag, delayed cache invalidation, and slow projection consumers. The confirmation path should still show the committed state or an explicit pending state.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Classify reads as stale-tolerant, session-causal, or globally consistent. Make that classification visible in code so future engineers cannot accidentally route a confirmation read through an eventually consistent path.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>system-design</category><category>cloud</category></item><item><title>Rate Limiting Is a Product Contract, Not Just a Redis Counter</title><link>https://rajivonai.com/blog/2022-04-11-rate-limiting-is-a-product-contract-not-just-a-redis-counter/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-04-11-rate-limiting-is-a-product-contract-not-just-a-redis-counter/</guid><description>Rate limiting fails when the platform enforces one behavior while the product promised another to clients. The technical mechanism matters less than treating rate limits as a documented contract with defined scope, limits, and error semantics.</description><pubDate>Mon, 11 Apr 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;The failure mode is not that too many requests reached Redis. The failure mode is that the product promised one behavior, the platform enforced another, and clients learned the difference in production.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Rate limiting usually enters the design review as an infrastructure problem. Someone draws a gateway, a Redis cluster, a token bucket, and a &lt;code&gt;429 Too Many Requests&lt;/code&gt; response. That is a useful mechanism, but it is not the architecture.&lt;/p&gt;
&lt;p&gt;The architecture starts earlier: who is entitled to do what, at what cost, under which plan, from which identity, and with what recovery semantics when they exceed the boundary. A free user sending ten expensive export jobs is not the same as an enterprise tenant sending ten cheap metadata reads. A customer retrying after a timeout is not the same as a bot scanning every endpoint. A batch integration that can wait is not the same as a checkout path that must preserve latency.&lt;/p&gt;
&lt;p&gt;Modern APIs are product surfaces. Their limits shape customer onboarding, billing, abuse protection, fairness between tenants, and incident blast radius. Once customers automate against the limit, the limit becomes part of the contract whether the team wrote it down or not.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The common implementation is deceptively simple: increment a key in Redis, set an expiry, reject when the count crosses a threshold. It works for a single endpoint, a single identity model, and a single failure budget. It collapses when the system needs to express product reality.&lt;/p&gt;
&lt;p&gt;The first break is identity. Is the unit of fairness an API key, OAuth app, user, tenant, IP address, organization, workload, or billing account? If the limiter uses the wrong key, one noisy integration can starve an entire customer, or one customer can bypass protection by fanning out credentials.&lt;/p&gt;
&lt;p&gt;The second break is cost. One request is not one unit of work. A cache hit, a paginated search, a graph expansion, and a report generation path may all look like HTTP requests while consuming radically different CPU, database, queue, and third-party quota.&lt;/p&gt;
&lt;p&gt;The third break is communication. If clients only receive &lt;code&gt;429&lt;/code&gt;, they do not know whether to retry in one second, one hour, with a smaller page size, with a different credential, or never. Bad limit responses create retry storms. Good limit responses create coordinated backpressure.&lt;/p&gt;
&lt;p&gt;The fourth break is operations. During an incident, teams need to lower limits for one route, exempt one tenant, shed one class of work, and observe which contracts are being enforced. A hard-coded Redis counter gives the operator a knob. A contract-oriented limiter gives the operator a control plane.&lt;/p&gt;
&lt;p&gt;The question is not “which rate limiting algorithm should we use?” The question is: &lt;strong&gt;what product contract should the platform enforce when demand exceeds safe capacity?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;make-the-limit-a-contract&quot;&gt;Make the Limit a Contract&lt;/h2&gt;
&lt;p&gt;A rate limit contract has five parts: identity, budget, scope, response, and observability.&lt;/p&gt;
&lt;p&gt;Identity defines who owns the budget. Budget defines the allowed cost over time. Scope defines where the budget applies: route, method, feature, tenant, region, or dependency. Response defines what the client can rely on when it is throttled. Observability proves whether the contract is fair, effective, and safe.&lt;/p&gt;
&lt;p&gt;The implementation can still use token buckets, leaky buckets, fixed windows, sliding windows, or distributed counters. Those are enforcement details. The durable design decision is to separate policy from enforcement.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[product plan — entitlement] --&gt; B[policy compiler — routes and budgets]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; C[edge gateway — cheap rejection]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; D[global limiter — shared quota]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; E[service guardrail — expensive work]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt;|allow| F[request handler — business path]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt;|allow| F&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt;|allow| F&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt;|deny| G[limit response — status and reset]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt;|deny| G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt;|deny| G&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt; H[response contract — headers and retry]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  G --&gt; H&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt;|events| I[observability — tenant and route]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt;|events| I&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt;|events| I&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The edge gateway should reject obviously over-budget traffic before it consumes expensive resources. The global limiter should coordinate shared tenant or account budgets across regions and workers. The service guardrail should protect the scarce dependency the gateway cannot understand: a database connection pool, a model inference queue, an export worker, or a search cluster.&lt;/p&gt;
&lt;p&gt;The response contract matters as much as the rejection. Clients need stable status codes, remaining budget headers where appropriate, reset information, and retry guidance. Some limits should be documented as hard product limits. Others should be documented as protective limits that may vary during abuse or incidents.&lt;/p&gt;
&lt;p&gt;The contract should also admit hierarchy. A platform may need an account-level daily quota, a per-route burst limit, a concurrency cap for expensive jobs, and an emergency regional drain rule. Treating all of that as “requests per minute” hides the product decision inside infrastructure syntax.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; GitHub’s REST API documentation describes primary rate limits, secondary rate limits, response headers such as remaining quota, and &lt;code&gt;403&lt;/code&gt; or &lt;code&gt;429&lt;/code&gt; behavior when limits are exceeded. The documented pattern is that client-visible limits are not just counters; they are part of the API behavior clients must code against. &lt;a href=&quot;https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api&quot;&gt;GitHub REST API rate limits&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; A contract-oriented design copies that separation. Primary limits express the normal entitlement. Secondary limits protect platform health when behavior is abusive, highly concurrent, or expensive even if the primary quota is not exhausted.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The client can reason about normal consumption while the provider keeps room for protective enforcement. That is a better contract than pretending every unsafe behavior can be captured by a single remaining counter.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Publish the steady-state budget, but reserve an explicitly documented protective layer for overload and abuse. If the protective layer is invisible, customers experience it as randomness.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; AWS API Gateway usage plans associate API keys with throttling and quota settings, and AWS documents that throttling and quota limits for usage plans are applied across stages within a usage plan. AWS also documents method-level throttling for usage plans. &lt;a href=&quot;https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-api-usage-plans.html&quot;&gt;API Gateway usage plans&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The useful pattern is plan-driven policy, not merely gateway-side rejection. Product packaging, API identity, route-level cost, and operational throttling meet in one control surface.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Teams can express different budgets for different customers and methods without forcing every backend service to rediscover the commercial model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Put product policy in a place where product, platform, and operations can all inspect it. If the policy only exists as scattered constants, no one owns the contract.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Kubernetes API Priority and Fairness controls API server behavior under overload by classifying requests and managing fairness between flows. The documented pattern is load shedding with priority, not undifferentiated rejection. &lt;a href=&quot;https://kubernetes.io/docs/concepts/cluster-administration/flow-control/&quot;&gt;Kubernetes API Priority and Fairness&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Apply the same idea to product APIs. Separate interactive reads, background sync, admin operations, and bulk exports into classes with different queues, concurrency, and rejection behavior.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; A batch customer job can be slowed without taking down a latency-sensitive operational path. The system fails by policy instead of by accident.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Fairness is a product and reliability decision. A limiter that cannot distinguish work classes will eventually protect the wrong thing.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;













































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;What happens&lt;/th&gt;&lt;th&gt;Design response&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Wrong identity key&lt;/td&gt;&lt;td&gt;One integration starves a tenant, or one tenant bypasses limits&lt;/td&gt;&lt;td&gt;Model budgets around the accountable product entity&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Flat request pricing&lt;/td&gt;&lt;td&gt;Cheap reads and expensive jobs consume the same quota&lt;/td&gt;&lt;td&gt;Charge budget by cost class, not only request count&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Hidden protective limits&lt;/td&gt;&lt;td&gt;Clients see random throttling and retry harder&lt;/td&gt;&lt;td&gt;Document secondary limits and retry behavior&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Single enforcement point&lt;/td&gt;&lt;td&gt;Gateway allows work that later melts a dependency&lt;/td&gt;&lt;td&gt;Add service-level guardrails near scarce resources&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;No emergency controls&lt;/td&gt;&lt;td&gt;Incident response requires code deploys&lt;/td&gt;&lt;td&gt;Keep runtime policy overrides with audit trails&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Poor observability&lt;/td&gt;&lt;td&gt;Operators cannot explain who was throttled or why&lt;/td&gt;&lt;td&gt;Emit decision events by tenant, route, class, and rule&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Over-strict consistency&lt;/td&gt;&lt;td&gt;Limiter becomes a global latency dependency&lt;/td&gt;&lt;td&gt;Use approximate distributed enforcement where exactness is not worth the availability cost&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; A Redis counter answers “how many requests arrived,” but the product needs to answer “which customer, plan, route, and work class is allowed to consume scarce capacity.”&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Define the rate limit contract first: identity, budget, scope, response, and observability. Then choose enforcement algorithms that fit each layer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Public systems such as GitHub, AWS API Gateway, and Kubernetes expose the same pattern in different forms: documented limits, plan-aware throttling, and fairness under overload.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Inventory every public and internal API limit. For each one, write down the accountable identity, the cost model, the client response, the operational override, and the dashboard that proves enforcement is behaving as intended.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>system-design</category><category>cloud</category></item><item><title>Consistent Hashing: What It Solves and What It Does Not</title><link>https://rajivonai.com/blog/2022-03-27-consistent-hashing-what-it-solves-and-what-it-does-not/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-03-27-consistent-hashing-what-it-solves-and-what-it-does-not/</guid><description>Consistent hashing is a damage-control mechanism for cluster membership change, not a general scalability strategy — what it limits during node additions and removals, and the tradeoffs that make it unsuitable as a universal sharding approach.</description><pubDate>Sun, 27 Mar 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Consistent hashing is not a scalability strategy by itself; it is a damage-control mechanism for membership change.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Distributed systems keep getting pushed toward elastic capacity. Databases add nodes. Caches scale out during traffic spikes. Storage clusters replace failed machines. Multi-tenant platforms rebalance load as customers grow unevenly.&lt;/p&gt;
&lt;p&gt;The simple answer is to partition data. Take a key, hash it, choose a machine, and route the request. When the number of machines is stable, this works well enough. The system has deterministic placement, every client can compute where a key belongs, and no central router has to remember every object.&lt;/p&gt;
&lt;p&gt;The problem starts when the fleet changes.&lt;/p&gt;
&lt;p&gt;With naive modulo partitioning, placement usually looks like this:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;text&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;node = hash(key) mod number_of_nodes&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That line is attractive because it is simple. It is also operationally brutal. If the cluster grows from 10 nodes to 11, most keys now map to a different node. The cluster does not just add capacity; it creates a large data movement event. Caches go cold. Databases rebalance huge ranges. Storage systems saturate disks and networks. Tail latency rises exactly when the team is trying to recover or scale.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The operational failure is not that hashing distributes keys. It does. The failure is that the placement function is tightly coupled to cluster size.&lt;/p&gt;
&lt;p&gt;A small membership change should cause small data movement. Adding one node should move roughly that node’s fair share of keys. Removing one node should move the keys owned by that node, not reshuffle the world. Operators need a placement scheme where the blast radius of change is proportional to the change itself.&lt;/p&gt;
&lt;p&gt;That requirement matters because real systems change under pressure. A node fails while traffic is high. A cache tier scales out during a launch. A database cluster adds capacity after a customer import. A storage system replaces hardware during maintenance. In each case, the routing algorithm becomes part of the incident response path.&lt;/p&gt;
&lt;p&gt;The core question is: how do you distribute keys across a changing set of nodes without turning every membership change into a full-cluster migration?&lt;/p&gt;
&lt;h2 id=&quot;the-answer-is-bounded-reassignment&quot;&gt;The Answer Is Bounded Reassignment&lt;/h2&gt;
&lt;p&gt;Consistent hashing solves the reassignment problem by separating key placement from the raw count of nodes.&lt;/p&gt;
&lt;p&gt;Instead of mapping a key to &lt;code&gt;hash(key) mod N&lt;/code&gt;, both keys and nodes are hashed into the same token space. You can picture that token space as a ring. A key belongs to the first node encountered clockwise from the key’s token. When a node joins, it takes responsibility for nearby token ranges. When a node leaves, its ranges move to neighboring owners.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;A[request key] --&gt; B[hash key to token]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;B --&gt; C[token ring]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;C --&gt; D[first owning node clockwise]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;D --&gt; E[replica set by preference list]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;F[membership change] --&gt; G[move affected token ranges]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;G --&gt; H[rebalance data]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The important property is not the ring shape. The important property is bounded reassignment. A membership change only affects adjacent ownership ranges in the token space.&lt;/p&gt;
&lt;p&gt;In practice, production systems rarely use one token per physical node. That can produce uneven load because the random placement of nodes on the ring may leave some nodes with larger ranges than others. Systems usually use virtual nodes or many tokens per physical node. A physical node owns multiple smaller ranges, which smooths distribution and makes rebalancing more granular.&lt;/p&gt;
&lt;p&gt;This is where consistent hashing earns its keep:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It limits key movement during membership change.&lt;/li&gt;
&lt;li&gt;It lets clients or routers compute placement deterministically.&lt;/li&gt;
&lt;li&gt;It supports incremental rebalancing instead of global reshuffling.&lt;/li&gt;
&lt;li&gt;It gives operators a vocabulary for ownership ranges, replicas, and repair.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But it does not make the rest of the system correct. It only answers one question: given this membership view and this key, which node or replica set should own it?&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;h3 id=&quot;context&quot;&gt;Context&lt;/h3&gt;
&lt;p&gt;The documented pattern appears in the Amazon Dynamo paper, which describes using consistent hashing to distribute load across storage hosts and reduce disruption when nodes join or leave. Dynamo also uses virtual nodes so each physical host can own multiple points in the token space, improving distribution and recovery behavior.&lt;/p&gt;
&lt;p&gt;Apache Cassandra inherited a related token-ring model. Cassandra’s architecture assigns data to nodes by partitioner tokens and replicates data according to a configured replication strategy. Its public documentation describes token ownership, vnode configuration, and operational procedures such as repair and bootstrap. The important lesson is that consistent hashing is part of a larger data placement system, not the whole database architecture.&lt;/p&gt;
&lt;p&gt;Distributed cache clients have used the same pattern for years. Memcached client libraries commonly support consistent hashing so adding or removing cache servers does not invalidate nearly the entire cache keyspace. The result is not zero cache churn; it is bounded cache churn.&lt;/p&gt;
&lt;h3 id=&quot;action&quot;&gt;Action&lt;/h3&gt;
&lt;p&gt;The architectural action is to replace cluster-size-dependent placement with token-range ownership.&lt;/p&gt;
&lt;p&gt;A system adopting the pattern typically does four things.&lt;/p&gt;
&lt;p&gt;First, it defines a stable hash space for keys. The hash must be deterministic and well distributed, because placement quality depends on it.&lt;/p&gt;
&lt;p&gt;Second, it assigns nodes to many positions in that space. Those positions may be random tokens, calculated tokens, or operator-controlled ranges.&lt;/p&gt;
&lt;p&gt;Third, it routes each key to an owner and, in replicated systems, to a replica set. This requires a membership view. If clients disagree about membership, they may route the same key to different owners.&lt;/p&gt;
&lt;p&gt;Fourth, it builds operational workflows around movement. Bootstrap, decommission, repair, anti-entropy, hinted handoff, cache warming, and backpressure become the mechanisms that make the placement scheme survivable.&lt;/p&gt;
&lt;h3 id=&quot;result&quot;&gt;Result&lt;/h3&gt;
&lt;p&gt;The result is controlled disruption. Adding a node moves only some ranges. Removing a node transfers ownership rather than forcing a complete rehash. Cache hit rates degrade locally instead of collapsing globally. Storage systems can stream bounded ranges instead of rewriting the entire cluster.&lt;/p&gt;
&lt;p&gt;But the result is not perfect balance. Hot keys can still overload one partition. Large tenants can still dominate a range. Replication can still be misconfigured. A bad membership view can still route traffic incorrectly. A slow rebalance can still compete with foreground reads and writes.&lt;/p&gt;
&lt;p&gt;Consistent hashing reduces one class of operational failure. It does not remove the need for admission control, observability, repair, load shedding, or capacity planning.&lt;/p&gt;
&lt;h3 id=&quot;learning&quot;&gt;Learning&lt;/h3&gt;
&lt;p&gt;The documented pattern is that consistent hashing is most useful when membership changes are common and object movement is expensive.&lt;/p&gt;
&lt;p&gt;It is less valuable when the data set is small, the cluster rarely changes, or a central coordinator already owns placement decisions. It can also be the wrong abstraction when placement must account for hardware tiers, tenant isolation, compliance boundaries, or workload shape. In those cases, range assignment or directory-based placement may be easier to reason about.&lt;/p&gt;
&lt;p&gt;The staff-engineering lesson is to treat consistent hashing as a primitive. It is a good primitive, but it is still a primitive.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;













































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Why consistent hashing does not solve it&lt;/th&gt;&lt;th&gt;What the architecture still needs&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Hot keys&lt;/td&gt;&lt;td&gt;A popular key maps to one owner or replica set&lt;/td&gt;&lt;td&gt;Request coalescing, caching, sharding inside the value, or workload-specific routing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Uneven node capacity&lt;/td&gt;&lt;td&gt;The ring assumes comparable nodes unless weighted&lt;/td&gt;&lt;td&gt;Weighted tokens, capacity-aware placement, or separate pools&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Membership disagreement&lt;/td&gt;&lt;td&gt;Different clients may compute different owners&lt;/td&gt;&lt;td&gt;Gossip convergence, strongly managed membership, or routing through coordinators&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Rebalance overload&lt;/td&gt;&lt;td&gt;Moving less data can still saturate disks and networks&lt;/td&gt;&lt;td&gt;Throttling, scheduling, progress tracking, and rollback plans&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Replica inconsistency&lt;/td&gt;&lt;td&gt;Placement does not guarantee write agreement&lt;/td&gt;&lt;td&gt;Quorums, read repair, anti-entropy, and conflict handling&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tenant isolation&lt;/td&gt;&lt;td&gt;Hashing spreads keys without understanding business boundaries&lt;/td&gt;&lt;td&gt;Placement constraints, quotas, and tenant-aware partitioning&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Disaster recovery&lt;/td&gt;&lt;td&gt;A ring does not define regional failure behavior&lt;/td&gt;&lt;td&gt;Replication topology, failover policy, and recovery objectives&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; If node changes cause widespread cache misses or data movement, inspect whether placement depends directly on the number of nodes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Use consistent hashing or token-range ownership to bound reassignment during membership change.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Validate with a simulation before production: add one node, remove one node, measure key movement, range size distribution, and hot partition behavior.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Design the operational layer around the hash ring: membership, throttled rebalancing, repair, observability, and explicit failure drills.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>system-design</category><category>cloud</category></item><item><title>WAL Explained for Database Engineers</title><link>https://rajivonai.com/blog/2022-03-15-wal-explained-for-database-engineers/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-03-15-wal-explained-for-database-engineers/</guid><description>What write-ahead logging is, why every ACID database uses it, and what engineers need to know about LSN ordering, crash recovery, and replication lag.</description><pubDate>Tue, 15 Mar 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Most database failures are not storage failures — they are sequence failures. The write-ahead log is the mechanism that enforces the right sequence, survives crashes, and underpins every form of replication.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Every write to a PostgreSQL, MySQL, or Oracle database passes through a write-ahead log before touching any data file. In PostgreSQL it is called the WAL. In Oracle and MySQL it is called the redo log. These are not backups. They are an ordered, append-only record of every change the database intends to make, written before the change is applied to data pages.&lt;/p&gt;
&lt;p&gt;The WAL exists because durable writes and fast writes are in tension. Flushing a modified data page to disk on every commit is slow because pages are scattered across disk. Flushing a sequential log record is fast. The WAL lets the database acknowledge a commit once the log record is flushed, then write data pages asynchronously.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Engineers who manage production databases often treat the WAL as a background detail — something that creates disk pressure and replication lag but is otherwise invisible. That assumption fails at the worst time: during crash recovery, when a replica falls behind, or when a restore from backup fails because the WAL sequence is incomplete.&lt;/p&gt;
&lt;p&gt;Why does the WAL exist at the level of protocol, not just implementation — and what does a database engineer actually need to understand to reason about durability and replication?&lt;/p&gt;
&lt;h2 id=&quot;the-durability-contract&quot;&gt;The Durability Contract&lt;/h2&gt;
&lt;p&gt;The WAL is a promise: if the log record is flushed to disk, the change survives any subsequent crash. The database can lose the in-memory copy and the unflushed data page. The log record is enough to reconstruct both.&lt;/p&gt;
&lt;p&gt;Each record in the WAL has a position — PostgreSQL calls it the LSN (log sequence number), Oracle calls it the SCN. Everything in the database is ordered by this position. Crash recovery replays WAL records in LSN order to bring data files forward from the last checkpoint to the point of failure.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- PostgreSQL: current WAL write position&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_current_wal_lsn();&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Gap between what has been written and what has been flushed&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_wal_lsn_diff(pg_current_wal_lsn(), pg_current_wal_flush_lsn()) &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; unflushed_bytes;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Replication lag for each standby (on the primary)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; application_name, write_lag, flush_lag, replay_lag&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_replication;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Replication works because the WAL is a complete, ordered record of every change. Physical streaming replication ships WAL records from primary to standby, where they are replayed in LSN order. Logical replication decodes those records into SQL operations for cross-version or filtered replication.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s documented behavior confirms that the WAL flush — not the data page flush — is what makes a commit durable. The &lt;code&gt;synchronous_commit&lt;/code&gt; parameter controls this tradeoff explicitly: at &lt;code&gt;on&lt;/code&gt;, a commit waits for WAL flush to replica; at &lt;code&gt;local&lt;/code&gt;, it waits only for the local flush; at &lt;code&gt;off&lt;/code&gt;, it returns before any flush, accepting a small window of data loss on crash. AWS Aurora’s architecture eliminates the data page shipping problem entirely — the primary sends only WAL records to the shared distributed storage layer, which handles durability across six copies without requiring physical standbys to apply full pages.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure&lt;/th&gt;&lt;th&gt;Cause&lt;/th&gt;&lt;th&gt;Fix&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Replication lag grows&lt;/td&gt;&lt;td&gt;WAL produced faster than standby replays&lt;/td&gt;&lt;td&gt;Tune standby I/O; investigate long-running transactions on primary&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Disk full on primary&lt;/td&gt;&lt;td&gt;Inactive replication slot retaining WAL&lt;/td&gt;&lt;td&gt;Drop or advance the stale slot: &lt;code&gt;SELECT pg_drop_replication_slot(&apos;name&apos;)&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Crash recovery takes hours&lt;/td&gt;&lt;td&gt;Checkpoint interval too long&lt;/td&gt;&lt;td&gt;Lower &lt;code&gt;checkpoint_timeout&lt;/code&gt;; verify &lt;code&gt;checkpoint_completion_target&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: WAL accumulation and replication lag are the same upstream pressure: writes that the WAL pipeline cannot drain fast enough.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Monitor LSN delta between primary and each standby; alert when the gap exceeds your RPO budget in bytes or time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: After adding WAL lag monitoring, lag spikes will correlate with bulk loads, ETL jobs, and autovacuum catch-up cycles.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run &lt;code&gt;SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained FROM pg_replication_slots;&lt;/code&gt; today and confirm no inactive slot is silently accumulating WAL on your primary.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>databases</category><category>fundamentals</category></item><item><title>Idempotency Keys: The Small Table That Saves Distributed Systems</title><link>https://rajivonai.com/blog/2022-03-12-idempotency-keys-the-small-table-that-saves-distributed-systems/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-03-12-idempotency-keys-the-small-table-that-saves-distributed-systems/</guid><description>The most reliable distributed systems depend on an unimpressive table with a unique constraint and a saved response — how idempotency keys prevent double charges, duplicate events, and retry amplification at the database layer.</description><pubDate>Sat, 12 Mar 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The most reliable distributed systems often depend on an unimpressive table with a unique constraint, a request hash, and a saved response.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Distributed systems no longer fail as single, clean transactions. A client submits a payment, the API times out, the load balancer retries, the worker restarts, the message broker redelivers, and the user refreshes the page. Each component is doing something reasonable. Together, they can charge twice, create duplicate orders, send duplicate emails, or enqueue the same downstream workflow more than once.&lt;/p&gt;
&lt;p&gt;Retries are now part of the contract. Cloud SDKs retry transient failures. Queue consumers retry failed messages. Frontends retry after ambiguous network errors. Operators replay jobs after incidents. The system has to assume that a request may arrive again even after the original request succeeded.&lt;/p&gt;
&lt;p&gt;This is why idempotency is not a payment feature. It is a control plane pattern for uncertainty.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The dangerous failure is not a clean error. The dangerous failure is an unknown result.&lt;/p&gt;
&lt;p&gt;A client sends &lt;code&gt;POST /charges&lt;/code&gt;. The service writes the charge to the payment processor. Before the response reaches the client, the connection drops. From the client’s point of view, nothing happened. From the service’s point of view, the side effect may already be committed.&lt;/p&gt;
&lt;p&gt;If the client retries a normal &lt;code&gt;POST&lt;/code&gt;, the service cannot tell whether this is a new business action or the same action arriving again. Timestamps do not solve it. Request bodies do not solve it by themselves. “Check whether a similar row exists” usually becomes a race condition under concurrency.&lt;/p&gt;
&lt;p&gt;The core question is: how can a service make retries safe when it cannot know whether the previous attempt succeeded?&lt;/p&gt;
&lt;h2 id=&quot;the-idempotency-ledger&quot;&gt;The Idempotency Ledger&lt;/h2&gt;
&lt;p&gt;The answer is to turn each client intent into a named operation.&lt;/p&gt;
&lt;p&gt;An idempotency key is a caller-provided identifier for one logical command. The server records that key before or during execution, associates it with a canonical request hash, and returns the same final result for repeated attempts with the same key.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[client sends command — idempotency key] --&gt; B[api validates request — canonical hash]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; C[idempotency table — unique key]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt;|new key| D[execute side effect — payment order message]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  D --&gt; E[store final response — status and body]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; F[return cached response — same key]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt;|seen key| F&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt;|hash mismatch| G[reject mismatch — same key different request]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What this diagram shows:&lt;/strong&gt; The client sends a command with an idempotency key. The API hashes it and checks the idempotency table. A new key executes the side effect and caches the response. A duplicate key returns the cached response without re-executing. A mismatched key — same idempotency key, different request body — is rejected, preventing the subtle class of double-execution bugs that occur when clients change payloads on retry.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The table is small, but the contract is strong:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;idempotency_key&lt;/code&gt;: unique per caller scope.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;request_hash&lt;/code&gt;: canonical representation of the intended command.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;status&lt;/code&gt;: &lt;code&gt;processing&lt;/code&gt;, &lt;code&gt;succeeded&lt;/code&gt;, or &lt;code&gt;failed&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;response_code&lt;/code&gt; and &lt;code&gt;response_body&lt;/code&gt;: what the caller should receive on replay.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;resource_id&lt;/code&gt;: optional pointer to the created domain object.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;expires_at&lt;/code&gt;: retention boundary for operational cleanup.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The important detail is that idempotency is not deduplication after the fact. It is a write path protocol. The service must reserve the key with an atomic operation, usually a unique constraint, before allowing duplicate execution.&lt;/p&gt;
&lt;p&gt;A typical flow looks like this:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Validate the request enough to build a stable hash.&lt;/li&gt;
&lt;li&gt;Insert the key into the idempotency table.&lt;/li&gt;
&lt;li&gt;If insert succeeds, execute the command.&lt;/li&gt;
&lt;li&gt;Persist the final response against the key.&lt;/li&gt;
&lt;li&gt;If insert conflicts, compare the stored hash.&lt;/li&gt;
&lt;li&gt;If the hash matches, return the stored result or wait for the in-flight operation.&lt;/li&gt;
&lt;li&gt;If the hash differs, reject the request as a key reuse error.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This lets the client retry until it receives a response. The system stops treating retry as a suspicious event and starts treating it as normal recovery behavior.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; Stripe documents idempotency keys for &lt;code&gt;POST&lt;/code&gt; requests and stores the resulting status code and body for a key, including failures. Their public guidance says subsequent requests with the same key return the same result, and that keys should be unique and removable after a retention window.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The architectural pattern is to bind the key to the parameters of the original request. Stripe’s documentation says the idempotency layer compares incoming parameters with the original request and errors if they differ. That prevents a client from accidentally reusing &lt;code&gt;order-123&lt;/code&gt; for a different charge.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The retry contract becomes simple. If the original request succeeded but the response was lost, a retry receives the original success. If the original request failed after execution produced a stored failure response, the retry receives the same failure. The client no longer has to guess whether it should issue a second business command.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; The key is not just a cache key. It is evidence of caller intent. A good implementation protects both sides: the client can retry safely, and the server can reject ambiguous reuse.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; AWS APIs commonly expose client tokens for idempotent requests. The Amazon EC2 API documentation describes client tokens as a way to make mutating calls idempotent, so retries do not create duplicate resources when the original result is unknown.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The caller supplies a token when creating resources such as instances. The service uses that token to identify retries of the same operation within the idempotency scope defined by the API.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Resource creation becomes safer under network failures, SDK retries, and operator replays. The caller can repeat the same command with the same token instead of building custom duplicate detection around resource names, tags, or timing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Idempotency belongs at the API boundary because only the caller can reliably name the logical command. The server can enforce uniqueness, but the caller supplies intent.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context:&lt;/strong&gt; PostgreSQL unique constraints and &lt;code&gt;INSERT ... ON CONFLICT&lt;/code&gt; provide the database behavior needed for an idempotency ledger. The documented behavior is that a unique index prevents two committed rows from holding the same key.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Use a unique constraint on &lt;code&gt;(tenant_id, idempotency_key)&lt;/code&gt; and reserve the key inside the same transactional boundary used to coordinate command execution metadata.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Concurrent duplicate requests collapse into one winner and one conflict path. Without the unique constraint, two workers can both observe “no existing request” and execute the side effect.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Idempotency is only as strong as the atomicity of the reservation. A table without a uniqueness guarantee is an audit log, not a concurrency control mechanism.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;


















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;Why it happens&lt;/th&gt;&lt;th&gt;Mitigation&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Key reused for a different command&lt;/td&gt;&lt;td&gt;Client generates predictable or coarse keys&lt;/td&gt;&lt;td&gt;Store a canonical request hash and reject mismatches&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Duplicate side effect before key reservation&lt;/td&gt;&lt;td&gt;Service performs work before the atomic insert&lt;/td&gt;&lt;td&gt;Reserve the key before side effects&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;In-flight retry sees &lt;code&gt;processing&lt;/code&gt; forever&lt;/td&gt;&lt;td&gt;Worker crashes after reserving the key&lt;/td&gt;&lt;td&gt;Add leases, heartbeats, timeout recovery, or reconciliation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Response body changes across deployments&lt;/td&gt;&lt;td&gt;Replay recomputes the response from current code&lt;/td&gt;&lt;td&gt;Persist the original response or stable resource reference&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Retention window too short&lt;/td&gt;&lt;td&gt;Client retries after cleanup&lt;/td&gt;&lt;td&gt;Align expiration with retry policies, queue retention, and dispute windows&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Downstream system is not idempotent&lt;/td&gt;&lt;td&gt;Your boundary is safe but the next one is not&lt;/td&gt;&lt;td&gt;Pass idempotency keys downstream or create a local outbox&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Global key namespace collision&lt;/td&gt;&lt;td&gt;Multiple tenants or clients use the same key&lt;/td&gt;&lt;td&gt;Scope uniqueness by tenant, account, or caller&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Treating all failures as final&lt;/td&gt;&lt;td&gt;Transient infrastructure failure gets cached as a permanent response&lt;/td&gt;&lt;td&gt;Decide which failures are stored and which keep the operation retryable&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The hardest case is the gap between reserving the key and committing the external side effect. If the service calls a payment provider and crashes before recording the response, the ledger may say &lt;code&gt;processing&lt;/code&gt; while the payment may exist. That is not solved by idempotency alone. It needs reconciliation: query the downstream provider by its own idempotency key, repair the local state, and then complete the original response.&lt;/p&gt;
&lt;p&gt;For message-driven systems, pair the idempotency table with an outbox. The command handler records intent and emits work from a durable table. Consumers also need idempotency at their boundary, because brokers usually promise at-least-once delivery, not exactly-once business effects.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Retries turn ambiguous outcomes into duplicate side effects when a service cannot distinguish a new command from a repeated one.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Require idempotency keys on mutating API calls, reserve them with a unique constraint, bind them to a request hash, and replay the stored result.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Stripe’s idempotency-key contract, AWS client-token APIs, and PostgreSQL uniqueness behavior all support the same pattern: name the intent, reserve it atomically, and make retries converge.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Add an idempotency ledger to the write paths where duplicate execution is expensive, externally visible, or difficult to reverse. Start with payments, orders, provisioning, notifications, and workflow launches.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>system-design</category><category>cloud</category></item><item><title>MVCC Explained Like a Database Engineer</title><link>https://rajivonai.com/blog/2022-02-14-mvcc-explained-like-a-database-engineer/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-02-14-mvcc-explained-like-a-database-engineer/</guid><description>How multi-version concurrency control lets readers and writers run without blocking each other — and why misunderstanding it causes table bloat, undo log growth, and stalled vacuums.</description><pubDate>Mon, 14 Feb 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Most engineers know that MVCC means “readers don’t block writers.” What they miss is the operational consequence: those non-blocking reads are paid for with storage, and if you stop collecting the debt, the database starts degrading in ways that look nothing like a concurrency problem.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;MVCC — Multi-Version Concurrency Control — is the concurrency model used by PostgreSQL, MySQL InnoDB, Oracle, CockroachDB, and most other production-grade relational databases. Inside a transaction, the database does not show you the current physical state of the rows; it shows a consistent snapshot as it existed at the moment your transaction started.&lt;/p&gt;
&lt;p&gt;Engineers rely on this without thinking about it. The property they care about — “I can run a long analytical query on a busy OLTP table without blocking inserts” — comes directly from MVCC. But few have thought through what has to be true at the storage level for that property to hold.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The concrete failure mode is table bloat in PostgreSQL after a heavy &lt;code&gt;UPDATE&lt;/code&gt; or &lt;code&gt;DELETE&lt;/code&gt; workload. Engineers see a table that is 40 GB on disk with only 8 GB of live data and conclude something is wrong with storage. The actual cause is MVCC: every &lt;code&gt;UPDATE&lt;/code&gt; leaves the old version in place; every &lt;code&gt;DELETE&lt;/code&gt; marks the row dead without removing it. Old versions accumulate until &lt;code&gt;VACUUM&lt;/code&gt; reclaims them.&lt;/p&gt;
&lt;p&gt;The less visible failure is more dangerous: a long-running read transaction — a reporting query left open, a replication slot that fell behind — prevents &lt;code&gt;VACUUM&lt;/code&gt; from advancing. PostgreSQL can eventually hit transaction ID wraparound, an emergency that takes the cluster offline.&lt;/p&gt;
&lt;p&gt;Where is the cost of “free” snapshot isolation actually hidden?&lt;/p&gt;
&lt;h2 id=&quot;how-mvcc-works&quot;&gt;How MVCC Works&lt;/h2&gt;
&lt;p&gt;When a transaction writes a row, the database does not overwrite the existing bytes. It writes a new version stamped with the writer’s transaction ID, leaving the old version in place. Concurrent readers see the version that was current at transaction start. Snapshot isolation without locking — but two systems store those versions very differently, and the difference shapes every operational concern that follows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PostgreSQL&lt;/strong&gt; stores all versions — live and dead — directly in the heap files alongside current rows. &lt;code&gt;UPDATE&lt;/code&gt; leaves the old version in the page; &lt;code&gt;DELETE&lt;/code&gt; flags it dead but does not remove it. &lt;code&gt;VACUUM&lt;/code&gt; (or &lt;code&gt;AUTOVACUUM&lt;/code&gt;) scans the heap and marks dead tuples as reclaimable. It cannot advance past any row version that is still visible to an open transaction.&lt;/p&gt;
&lt;p&gt;You can inspect the version metadata directly. &lt;code&gt;xmin&lt;/code&gt; is the transaction ID that created the row; &lt;code&gt;xmax&lt;/code&gt; is the transaction ID that deleted or updated it (0 if the row is live). &lt;code&gt;ctid&lt;/code&gt; is the physical location in the heap file:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D&quot;&gt;-- Inspect row versions in PostgreSQL&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; xmin, xmax, ctid, id&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; your_table&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;LIMIT&lt;/span&gt;&lt;span style=&quot;color:#79B8FF&quot;&gt; 10&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After a series of updates, you will see multiple heap entries for the same logical row — old versions with non-zero &lt;code&gt;xmax&lt;/code&gt;, new versions with &lt;code&gt;xmax = 0&lt;/code&gt;. These are the dead tuples VACUUM is responsible for reclaiming.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MySQL InnoDB&lt;/strong&gt; keeps only the current version in the clustered index. Old versions go to the undo log; when a reader needs an older snapshot, InnoDB reconstructs it by applying undo entries in reverse. A background purge thread reclaims undo space once no active transaction needs those versions. The same pressure applies: long-running reads block the purge thread.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Oracle&lt;/strong&gt; uses a dedicated undo tablespace. The &lt;code&gt;undo_retention&lt;/code&gt; parameter sets a fixed consistency window — simpler cleanup at the cost of a hard expiry (&lt;code&gt;ORA-01555: snapshot too old&lt;/code&gt;).&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Database&lt;/th&gt;&lt;th&gt;Where old versions live&lt;/th&gt;&lt;th&gt;Cleanup mechanism&lt;/th&gt;&lt;th&gt;Risk when cleanup stalls&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;PostgreSQL&lt;/td&gt;&lt;td&gt;Heap files (table data)&lt;/td&gt;&lt;td&gt;VACUUM — explicit or autovacuum&lt;/td&gt;&lt;td&gt;Table bloat, transaction ID wraparound&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MySQL InnoDB&lt;/td&gt;&lt;td&gt;Undo log segments&lt;/td&gt;&lt;td&gt;Background purge thread&lt;/td&gt;&lt;td&gt;Undo log growth, purge lag&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Oracle&lt;/td&gt;&lt;td&gt;Undo tablespace&lt;/td&gt;&lt;td&gt;Automatic undo management&lt;/td&gt;&lt;td&gt;ORA-01555 snapshot too old&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;PostgreSQL’s MVCC documentation (chapter 13, “Concurrency Control”) states directly that dead tuples are not reclaimed until &lt;code&gt;VACUUM&lt;/code&gt; runs, and that &lt;code&gt;VACUUM&lt;/code&gt; cannot remove a dead tuple if any transaction older than that tuple is still open — the documented mechanism behind bloat from long-running transactions.&lt;/p&gt;
&lt;p&gt;MySQL’s InnoDB documentation (“InnoDB Multi-Versioning”) states that the purge thread deletes undo log records no longer needed by any consistent read, and that history list length — in &lt;code&gt;SHOW ENGINE INNODB STATUS&lt;/code&gt; — grows when the purge thread falls behind.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;What breaks&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Long-running read in PostgreSQL&lt;/td&gt;&lt;td&gt;Table bloat; VACUUM cannot advance past the open snapshot&lt;/td&gt;&lt;td&gt;PostgreSQL keeps every row version visible to any active transaction&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Long-running read in MySQL InnoDB&lt;/td&gt;&lt;td&gt;Undo log grows; purge thread stalls&lt;/td&gt;&lt;td&gt;Purge thread cannot remove records still needed by open transactions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Transaction ID wraparound in PostgreSQL&lt;/td&gt;&lt;td&gt;Cluster enters emergency read-only mode&lt;/td&gt;&lt;td&gt;32-bit XID wraps after ~2 billion transactions; VACUUM must freeze rows before the counter laps&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Long-running transactions block VACUUM and the InnoDB purge thread, causing table bloat and undo log growth that degrades the database without any concurrency alarm firing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Set &lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt; in PostgreSQL; monitor InnoDB history list length in &lt;code&gt;SHOW ENGINE INNODB STATUS&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof&lt;/strong&gt;: In PostgreSQL, &lt;code&gt;pg_stat_activity&lt;/code&gt; shows open transactions with &lt;code&gt;state = &apos;idle in transaction&apos;&lt;/code&gt;; in InnoDB, a rising history list length during write traffic is the direct signal.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Run this query on your PostgreSQL instances this week to surface any sessions holding open transactions without actively executing:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;SELECT&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pid, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;now&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;() &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;-&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; query_start &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;AS&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; duration, &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;state&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;, query&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;FROM&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; pg_stat_activity&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;WHERE&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; state&lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#9ECBFF&quot;&gt; &apos;idle in transaction&apos;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#F97583&quot;&gt;ORDER BY&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt; duration &lt;/span&gt;&lt;span style=&quot;color:#F97583&quot;&gt;DESC&lt;/span&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;MVCC teaches the same lesson as most database internals: reads that look free are paid for somewhere. Knowing where is what lets you diagnose degradation instead of just observing it.&lt;/p&gt;</content:encoded><category>databases</category><category>architecture</category></item><item><title>Caches Do Not Remove Database Load Unless You Design the Miss Path</title><link>https://rajivonai.com/blog/2022-02-10-caches-do-not-remove-database-load-unless-you-design-the-miss-path/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-02-10-caches-do-not-remove-database-load-unless-you-design-the-miss-path/</guid><description>A cache is not a shield around the database — it is a second traffic control system whose failure mode is a synchronized stampede back to the database. How to design the miss path so cache failures don&apos;t become database incidents.</description><pubDate>Thu, 10 Feb 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A cache is not a shield around the database; it is a second traffic control system whose failure mode is often a synchronized stampede back to the database.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most production systems add caching after the database becomes visibly expensive. Read latency climbs, connection pools saturate, replica lag grows, and product teams discover that many requests ask for the same objects repeatedly. The obvious response is to place Redis, Memcached, CDN edge storage, or an application-local cache in front of the hot read path.&lt;/p&gt;
&lt;p&gt;That response is directionally correct. Caches reduce repeated work when the same value is requested many times within a useful freshness window. They also change the shape of the system. The database is no longer serving every read, but it is now serving cache misses, cache refreshes, cold starts, evictions, invalidations, and retry storms.&lt;/p&gt;
&lt;p&gt;The first architecture review usually asks whether the cache hit rate is high enough. The better review asks what happens when the hit rate suddenly drops.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;A cache hit is the easy path. The hard path begins when the value is missing, stale, evicted, expired, invalidated, or never warmed.&lt;/p&gt;
&lt;p&gt;If every application instance handles a miss by immediately querying the database, the cache has only moved the load problem. Under normal traffic, a 95 percent hit rate may look excellent. Under correlated expiration, deployment cold start, regional failover, or key eviction, that same system can convert thousands of concurrent user requests into thousands of identical database queries.&lt;/p&gt;
&lt;p&gt;This is why cache-aside implementations often fail under precisely the conditions where the database most needs protection. The cache removes load only when it is warm and healthy. The miss path decides what happens when it is not.&lt;/p&gt;
&lt;p&gt;The core question is not, “Should we cache this?” The core question is, “Who is allowed to miss, how fast may they miss, and what happens while the value is being recovered?”&lt;/p&gt;
&lt;h2 id=&quot;the-answer-is-a-governed-miss-path&quot;&gt;The Answer Is a Governed Miss Path&lt;/h2&gt;
&lt;p&gt;A resilient cache architecture treats misses as a controlled workflow, not as an exception buried inside a request handler.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  A[client request] --&gt; B[application read path]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  B --&gt; C{cache lookup}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt;|hit| D[return cached value]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  C --&gt;|miss| E[miss coordinator]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; F{refresh already running}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt;|yes| G[wait briefly or serve stale value]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  F --&gt;|no| H[acquire refresh lease]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  H --&gt; I[load from database with budget]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  I --&gt; J[write cache with jittered ttl]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  J --&gt; K[return fresh value]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  I --&gt;|budget exhausted| L[serve stale value or fail closed]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  E --&gt; M[miss metrics and admission control]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;  M --&gt; N[rate limits and circuit breakers]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The important component is not the cache. It is the miss coordinator.&lt;/p&gt;
&lt;p&gt;At minimum, that coordinator should provide request coalescing, so one cache miss per key becomes one database read, not one read per caller. It should enforce a per-key refresh lease so that only one worker repopulates a hot key at a time. It should use bounded wait times so callers do not pile up indefinitely behind a slow database query. It should support stale serving for values where slightly old data is better than taking the system down. It should apply jitter to expirations so hot keys do not all expire at the same second.&lt;/p&gt;
&lt;p&gt;The database call itself needs a budget. A miss should not receive unlimited retries simply because the cache missed. Retries on the miss path multiply load exactly when the database is already exposed. Prefer short deadlines, limited attempts, and explicit fallback behavior.&lt;/p&gt;
&lt;p&gt;This also means cache keys require ownership. A key is not just a string. It has a freshness contract, a rebuild cost, an invalidation source, and a blast radius. Keys that are cheap to rebuild can expire aggressively. Keys that are expensive to rebuild need warming, stale reads, or asynchronous refresh.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Context.&lt;/strong&gt; Facebook’s published Memcache architecture describes caches as a distributed system with operational problems around consistency, thundering herds, regional topology, and invalidation. The documented pattern is that large-scale caching requires coordination around misses and invalidations, not merely inserting Memcached between application servers and storage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action.&lt;/strong&gt; The Facebook Memcache design uses mechanisms such as leases to reduce stale sets and control concurrent regeneration. A lease lets the cache tell a client that it has permission to compute and fill a missing value. Other clients do not all independently regenerate the same object at full speed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result.&lt;/strong&gt; The documented result is a cache layer that can absorb high read traffic while reducing redundant backend work. The key lesson is not that Memcache is special. The lesson is that the miss path is part of the cache protocol.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning.&lt;/strong&gt; The architectural pattern is request coalescing with ownership of regeneration. Without that ownership, every caller treats itself as responsible for recovery, and the database becomes the coordination mechanism by accident.&lt;/p&gt;
&lt;p&gt;A second documented pattern appears in Amazon’s public guidance on caching and service resilience. The Builders Library discusses cache behavior in terms of timeouts, retries, overload, and dependency protection. The relevant lesson is that retries and cache refreshes must be limited by budgets, because uncontrolled recovery traffic can become worse than the original user traffic.&lt;/p&gt;
&lt;p&gt;PostgreSQL also illustrates the same point at the storage layer. Its buffer cache improves repeated access to pages already in memory, but a cache miss still becomes physical or operating-system-backed I/O. If many sessions miss on the same expensive query shape, PostgreSQL does not magically make that application-level work disappear. The documented behavior is that caching changes where repeated reads are served from; it does not eliminate the need to control concurrency, query cost, or admission.&lt;/p&gt;
&lt;p&gt;The pattern across these systems is consistent: caching is effective when the recovery path is engineered. A cache without miss governance is a performance optimization during calm periods and a load amplifier during incidents.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;













































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Failure mode&lt;/th&gt;&lt;th&gt;What happens&lt;/th&gt;&lt;th&gt;Design response&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Cold start&lt;/td&gt;&lt;td&gt;New instances have empty local caches and all query the database&lt;/td&gt;&lt;td&gt;Warm critical keys and use shared cache before local cache&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Correlated expiration&lt;/td&gt;&lt;td&gt;Many hot keys expire together&lt;/td&gt;&lt;td&gt;Add TTL jitter and refresh before expiry&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Hot key miss&lt;/td&gt;&lt;td&gt;One popular key triggers many identical database reads&lt;/td&gt;&lt;td&gt;Use per-key leases and request coalescing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cache outage&lt;/td&gt;&lt;td&gt;All traffic bypasses cache at once&lt;/td&gt;&lt;td&gt;Add database rate limits and fail closed for noncritical reads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slow database recovery&lt;/td&gt;&lt;td&gt;Misses wait, retry, and consume application threads&lt;/td&gt;&lt;td&gt;Use short deadlines and bounded retry budgets&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Over-broad invalidation&lt;/td&gt;&lt;td&gt;One write invalidates too much cached data&lt;/td&gt;&lt;td&gt;Use precise keys and versioned invalidation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Silent cache bloat&lt;/td&gt;&lt;td&gt;Low-value keys evict high-value keys&lt;/td&gt;&lt;td&gt;Add admission control and track hit rate by key class&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The uncomfortable tradeoff is that a safer miss path sometimes returns stale data or partial results. That is often the right choice. For many product surfaces, a profile count that is thirty seconds old is better than a database outage caused by thousands of simultaneous refreshes.&lt;/p&gt;
&lt;p&gt;The other tradeoff is complexity. A governed miss path adds leases, metrics, deadlines, fallback rules, and operational runbooks. But that complexity already exists in the system. If it is not explicit in the cache layer, it is implicit in the database, the connection pool, and the incident channel.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Measure misses as first-class production events, not as the inverse of hit rate. Break them down by key class, caller, latency, database query, and retry count.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Put a miss coordinator in the read path. Start with per-key request coalescing, refresh leases, TTL jitter, and stale serving for safe data classes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Load test cold cache, hot key expiration, cache outage, and database slowdown. The database query rate during each test is the real measure of cache design quality.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action:&lt;/strong&gt; Pick the ten most expensive cached objects in the system and write down their freshness contract, rebuild cost, invalidation source, and failure behavior. If those answers are unclear, the cache is not yet protecting the database.&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>system-design</category><category>cloud</category></item><item><title>Load Balancers: The Hidden State Machine in Front of Your App</title><link>https://rajivonai.com/blog/2022-01-26-load-balancers-the-hidden-state-machine-in-front-of-your-app/</link><guid isPermaLink="true">https://rajivonai.com/blog/2022-01-26-load-balancers-the-hidden-state-machine-in-front-of-your-app/</guid><description>A load balancer is not a pipe — it is a distributed state machine making routing and health decisions on stale, partial evidence. Its configuration choices propagate directly into application availability and failure modes.</description><pubDate>Wed, 26 Jan 2022 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A load balancer is not a pipe; it is a distributed state machine making safety decisions on stale, partial, and sometimes misleading evidence.&lt;/p&gt;
&lt;h2 id=&quot;situation&quot;&gt;Situation&lt;/h2&gt;
&lt;p&gt;Most application teams treat load balancers as infrastructure furniture. You define a listener, point it at a target group, add a health check, and move on to the application. The mental model is simple: clients arrive, the load balancer picks a backend, bad instances are removed, good instances receive traffic.&lt;/p&gt;
&lt;p&gt;That model works until production starts changing faster than the control plane can agree on what is true.&lt;/p&gt;
&lt;p&gt;Deployments drain connections. Autoscaling adds cold targets. Health checks pass while real requests fail. TLS handshakes saturate a node before CPU alarms fire. A single dependency outage makes every backend return the same error at the same time. Suddenly the component that was supposed to be boring is deciding whether to retry, eject, drain, panic, fail open, or send traffic to a target everyone believes is unhealthy.&lt;/p&gt;
&lt;p&gt;The important shift is this: modern load balancers are not just traffic distributors. They encode policy, memory, timers, thresholds, and recovery behavior. They remember which endpoints were recently bad. They delay removal to avoid flapping. They preserve long connections while moving new requests elsewhere. They may intentionally route to unhealthy hosts when the alternative is a total outage.&lt;/p&gt;
&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The common failure is not that the load balancer makes one wrong routing decision. The failure is that application teams design their services as if the load balancer were stateless.&lt;/p&gt;
&lt;p&gt;A stateless router can be reasoned about request by request. A load balancer cannot. Its current decision depends on previous health checks, previous errors, configured thresholds, slow-start windows, connection draining state, availability zone policy, retry budgets, outlier detection, and how many targets remain eligible.&lt;/p&gt;
&lt;p&gt;That hidden state creates several production traps.&lt;/p&gt;
&lt;p&gt;First, health is sampled, not known. A target can pass &lt;code&gt;/health&lt;/code&gt; while the application path that performs authentication, database access, or queue writes is broken. The load balancer sees green. Users see failure.&lt;/p&gt;
&lt;p&gt;Second, removal is delayed by design. Health thresholds exist to prevent one transient miss from ejecting a healthy server. That same protection means a badly deployed instance may continue receiving traffic for several probe intervals.&lt;/p&gt;
&lt;p&gt;Third, recovery is also delayed. A fixed health check interval and healthy threshold can turn a thirty-second application recovery into a multi-minute traffic recovery.&lt;/p&gt;
&lt;p&gt;Fourth, all-target failure is special. Some systems fail closed, returning an error because no target is safe. Others fail open, sending traffic to all targets because every target being unhealthy may mean the health signal is wrong or the system is in a regional failure mode.&lt;/p&gt;
&lt;p&gt;So the real question is not “Which load balancing algorithm should we use?” The better question is: what state machine are we placing in front of the application, and have we designed the application to survive its transitions?&lt;/p&gt;
&lt;h2 id=&quot;the-load-balancer-state-machine&quot;&gt;The Load Balancer State Machine&lt;/h2&gt;
&lt;p&gt;A useful architecture starts by making the implicit state explicit. The load balancer has at least six states for a backend: unknown, warming, healthy, suspect, draining, and ejected. Different products use different names, but the operational pattern is consistent.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;mermaid&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;flowchart TD&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    A[client request — arrives] --&gt; B[listener — protocol policy]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    B --&gt; C{route decision — match rules}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    C --&gt;|rule match| D[target group — weighted pool]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    D --&gt; E{endpoint state — healthy enough}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|healthy| F[backend — receive request]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|draining| G[connection draining — finish or timeout]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    E --&gt;|unhealthy| H[outlier set — remove from pool]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    H --&gt; I{panic rule — too few healthy targets}&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt;|normal mode| J[return failure — no safe target]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    I --&gt;|fail open| F&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    F --&gt; K[feedback — latency errors resets]&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#E1E4E8&quot;&gt;    K --&gt; D&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The application architecture should treat this state machine as part of the serving path.&lt;/p&gt;
&lt;p&gt;The health endpoint should be intentionally boring, but not meaningless. It should verify that the process can serve the cheapest representative request, not that every dependency in the universe is perfect. A health check that fails on any downstream blip can evacuate the entire fleet during a dependency incident. A health check that only returns “process is alive” can keep broken application instances in rotation.&lt;/p&gt;
&lt;p&gt;Readiness should be separated from liveness. A process can be alive while not ready to receive traffic. During startup, schema migration, cache warmup, model loading, or connection pool initialization, the correct state is not dead. It is warming.&lt;/p&gt;
&lt;p&gt;Draining should be designed as an application behavior, not only an infrastructure setting. When a target is removed from rotation, new requests should stop, but existing work should have a bounded chance to finish. That means request deadlines, idempotency keys, retry-safe handlers, and shutdown hooks that stop accepting work before terminating the process.&lt;/p&gt;
&lt;p&gt;Retries must be budgeted against the same pool the load balancer is protecting. If every client retries twice, and the load balancer also retries, a partial outage can become an amplification system. Retry policy belongs in the architecture diagram, not in a library default no one reviews.&lt;/p&gt;
&lt;p&gt;Finally, observability should expose state transitions, not only request totals. You need to see healthy host count, ejection count, target response codes, load balancer generated errors, backend generated errors, connection age, drain duration, and retry attempts. If those signals are split across five dashboards, incident response will reconstruct the state machine from symptoms.&lt;/p&gt;
&lt;h2 id=&quot;in-practice&quot;&gt;In Practice&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Context.&lt;/strong&gt; AWS documents a specific fail-open behavior for Application Load Balancer target groups: if all targets fail health checks in all enabled Availability Zones, the load balancer routes to all targets regardless of health status, according to its algorithm. See the AWS Elastic Load Balancing documentation on &lt;a href=&quot;https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html&quot;&gt;target group health checks&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action.&lt;/strong&gt; The architectural action is to treat “all targets unhealthy” as a first-class mode. Health checks should not depend on fragile shared dependencies unless removing every target is genuinely safer than serving degraded traffic. Applications should also emit a clear degraded response when dependency failure is known.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result.&lt;/strong&gt; The documented result is a changed failure mode: the load balancer may prefer attempting service over returning no service. That can be correct during health-check misconfiguration or probe-path failure, and dangerous when every backend is truly unable to serve.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning.&lt;/strong&gt; Do not assume unhealthy means isolated. In a systemic failure, load balancer behavior often shifts from protecting individual hosts to preserving some chance of availability.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context.&lt;/strong&gt; Google’s SRE material on &lt;a href=&quot;https://sre.google/sre-book/load-balancing-datacenter/&quot;&gt;load balancing in the datacenter&lt;/a&gt; describes load balancing as a capacity and overload-control problem, not merely a request distribution problem. It discusses health checking, backend overload, and algorithms that avoid sending additional traffic where capacity is already constrained.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action.&lt;/strong&gt; The architectural action is to feed the balancer signals that approximate serving capacity, not just binary process health. Concurrency, queue depth, latency, and overload responses can be better indicators than “port is open.”&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result.&lt;/strong&gt; The documented pattern is that load balancing becomes part of overload prevention. It steers demand away from constrained backends before total failure, but it requires trustworthy feedback from the serving systems.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning.&lt;/strong&gt; A load balancer cannot invent capacity. It can only allocate demand based on the signals it receives.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context.&lt;/strong&gt; Envoy documents &lt;a href=&quot;https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier&quot;&gt;outlier detection&lt;/a&gt; as a mechanism for detecting hosts behaving unlike others and ejecting them from the healthy load balancing set, with caveats around panic scenarios and active health checks that do not validate real data-plane behavior.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Action.&lt;/strong&gt; The architectural action is to distinguish active health checks from passive traffic evidence. If live requests fail while active probes pass, passive outlier detection can protect users faster than probe-only health.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Result.&lt;/strong&gt; The documented result is adaptive ejection based on observed behavior. It improves resilience to partial backend failure, but it introduces more state, timers, and re-entry behavior to understand.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Learning.&lt;/strong&gt; More intelligent load balancing increases the need for operational literacy. The system is safer only if engineers know when and why it ejects, restores, or panics.&lt;/p&gt;
&lt;h2 id=&quot;where-it-breaks&quot;&gt;Where It Breaks&lt;/h2&gt;























































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Design choice&lt;/th&gt;&lt;th&gt;What it protects&lt;/th&gt;&lt;th&gt;Where it fails&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Simple health check&lt;/td&gt;&lt;td&gt;Removes crashed processes&lt;/td&gt;&lt;td&gt;Misses broken application paths&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Deep dependency health check&lt;/td&gt;&lt;td&gt;Avoids serving known bad requests&lt;/td&gt;&lt;td&gt;Can evacuate the fleet during dependency incidents&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Aggressive ejection&lt;/td&gt;&lt;td&gt;Reduces user-visible errors quickly&lt;/td&gt;&lt;td&gt;Can shrink capacity during transient spikes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Slow ejection&lt;/td&gt;&lt;td&gt;Avoids flapping&lt;/td&gt;&lt;td&gt;Sends traffic to bad targets longer&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Fail closed&lt;/td&gt;&lt;td&gt;Prevents known-bad backends from serving&lt;/td&gt;&lt;td&gt;Turns probe failure into total outage&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Fail open&lt;/td&gt;&lt;td&gt;Preserves a chance of service&lt;/td&gt;&lt;td&gt;Sends traffic to unhealthy targets&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Sticky sessions&lt;/td&gt;&lt;td&gt;Preserves cache and session locality&lt;/td&gt;&lt;td&gt;Concentrates failure on unlucky clients&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Client retries&lt;/td&gt;&lt;td&gt;Masks isolated failures&lt;/td&gt;&lt;td&gt;Amplifies load during partial outages&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Connection draining&lt;/td&gt;&lt;td&gt;Protects in-flight work&lt;/td&gt;&lt;td&gt;Extends deploy and rollback windows&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The hardest production incidents happen when several of these choices interact. A deploy adds cold targets. Slow start is missing. Latency rises. Clients retry. Passive detection ejects a few hosts. Remaining hosts take more load. Health checks begin timing out. The balancer enters a different mode. By the time the application team looks at logs, the visible error is a generic gateway failure, but the root cause is a state transition cascade.&lt;/p&gt;
&lt;h2 id=&quot;what-to-do-next&quot;&gt;What to Do Next&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Treating the load balancer as stateless hides the real failure modes. Write down the backend states your platform supports: warming, healthy, suspect, draining, ejected, and fail-open or fail-closed behavior.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Design health, readiness, retries, and draining as one serving contract. The application should know when it is ready, when it is degraded, and when it must stop accepting new work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Proof:&lt;/strong&gt; Test the state machine directly. Kill one target, break the health endpoint, break the main request path while leaving health green, make every target unhealthy, and run a deploy while long requests are active.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; Add dashboards and alerts around transitions, not just traffic volume. Healthy target count, ejection events, retry rate, load balancer errors, backend errors, and drain duration should tell one coherent story during an incident.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content:encoded><category>architecture</category><category>system-design</category><category>cloud</category></item></channel></rss>