Cache Incident Workflow: Hit Rate Collapse, Stampede, TTLs, and Database Protection

A cache incident is not a cache problem; it is a database protection failure that happens to start in the cache layer.

Situation

Most production systems treat caching as a performance optimization until the first real incident proves otherwise. A healthy cache hides read amplification, expensive joins, remote API latency, and uneven traffic. When the cache is warm, the database looks calm. When hit rate collapses, the same database is suddenly asked to serve traffic it was never provisioned to absorb directly.

The modern version is worse because cache layers now sit in front of many different backends: relational databases, object stores, search indexes, vector databases, model gateways, feature stores, and third-party APIs. The cache is not only shaving milliseconds. It is often the only thing standing between normal traffic and cascading saturation.

The Problem

Cache incidents rarely begin with a clean outage. They begin with drift: hit rate drops from 96% to 88%, latency widens, backend queue depth rises, retry volume increases, and application workers hold connections longer. Then a TTL boundary, deploy, hot key, regional failover, or eviction event turns the drift into a cliff.

The failure modes compound:

Hit rate collapse moves traffic from cache to database.
Stampede causes many workers to recompute the same missing value.
TTL synchronization expires many keys at once.
Retries multiply backend pressure during the worst window.
Eviction churn removes useful keys faster than they can be refilled.
Database saturation turns slow misses into timeouts, which create more retries.

The core question is not “How do we restore the cache?” It is: how do we keep the database alive while the cache is wrong, cold, overloaded, or partially unavailable?

The Answer: Treat Cache Recovery as an Incident Workflow

A reliable cache architecture separates three control loops: request serving, cache regeneration, and database protection. The application should not let every miss become an immediate backend query. The cache layer needs guardrails that decide when to serve stale data, when to coalesce work, when to shed load, and when to slow callers before the database falls over.

flowchart TD
  A[request arrives] --> B{cache lookup}
  B -->|hit| C[return cached value]
  B -->|miss| D{single flight guard}
  D -->|leader exists| E[wait briefly or serve stale]
  D -->|leader elected| F{backend budget available}
  F -->|yes| G[query database]
  F -->|no| H[serve stale or bounded error]
  G --> I[refresh cache with jittered TTL]
  I --> J[return value]
  E --> J
  H --> K[protect database and emit incident signal]

The architecture has four practical requirements.

First, every expensive key path needs request coalescing. In Go this pattern is often called singleflight; in other stacks it appears as per-key locks, lease tokens, or refresh ownership. The point is simple: one worker regenerates a missing value while the rest wait briefly, serve stale, or fail fast. Without coalescing, one expired hot key can become thousands of identical database queries.

Second, TTLs need jitter and refresh policy. Fixed TTLs create synchronized expiration. Jitter spreads refreshes over time. Refresh-ahead can help for predictable hot keys, but it must be bounded; an aggressive refresh daemon can become its own incident. The cache should know the difference between a value that is absent, a value that is stale but usable, and a value that must not be served.

Third, the database needs an explicit miss budget. A miss path should pass through a limiter sized to what the backend can survive. That limiter can be per service, per shard, per tenant, or per key class. If the budget is exhausted, the application should serve stale data, return a controlled degraded response, or shed low-priority traffic. It should not keep adding concurrent database work until connection pools collapse.

Fourth, incident response needs cache-specific telemetry. Overall latency is too late. Useful signals include cache hit rate by route and key family, miss rate, fill latency, stale serve count, coalescing wait time, backend query rate from cache misses, eviction rate, hot key distribution, TTL age distribution, and database saturation. The incident dashboard should answer: which keys are missing, why they are missing, who is regenerating them, and what the backend is absorbing.

In Practice

Context. The documented pattern from Meta’s memcache architecture is that caching at scale requires more than a key-value store. The NSDI paper “Scaling Memcache at Facebook” describes leases to address stale sets and thundering herd behavior, regional cache deployment, and operational mechanisms for avoiding backend overload. The public lesson is not “use memcache.” It is that large read-heavy systems need cache coordination semantics when many clients share a backend.

Action. Apply the same pattern in service-level design. Add per-key regeneration ownership, stale serving for eligible data, TTL jitter, and a database miss budget. Treat cache fills as controlled backend work, not ordinary request work. For hot objects, separate freshness policy from availability policy: a profile page, product catalog entry, or feature flag snapshot may tolerate seconds or minutes of staleness; a payment authorization result may not.

Result. The expected operational result is reduced peak backend amplification. During a hit rate collapse, only bounded fill work reaches the database. Callers may see stale responses or controlled degradation, but the primary datastore remains available. This is the difference between a cache incident and a full service outage.

Learning. The documented pattern is that cache correctness and cache availability are separate concerns. A system can be correct but fragile if every miss synchronously regenerates through the database. A system can also be fast but unsafe if TTLs align and all clients refresh together. Production cache design has to encode contention control, not just expiration.

Another known pattern appears in Amazon DynamoDB Accelerator documentation: DAX is positioned as a write-through and read-through caching layer for DynamoDB workloads that need microsecond read latency. The architecture is useful because it makes the cache part of the data access path rather than a scattered application convention. The broader learning is that centralizing cache behavior can reduce inconsistent miss handling across services, but it does not remove the need for capacity planning, TTL discipline, and fallback behavior.

PostgreSQL and MySQL also demonstrate the backend side of the same pattern. When connection pools saturate, the database does not merely become slower; it starts changing the behavior of the whole system. Transactions hold locks longer, application threads wait longer, retries overlap, and health checks can become noisy. A cache incident workflow must therefore protect database concurrency first, then restore hit rate.

Where It Breaks

Failure mode	Why it happens	Mitigation	Residual risk
Hot key expiration	One popular key expires and all workers miss together	Per-key singleflight, stale-while-revalidate, refresh-ahead	Leader refresh can still fail repeatedly
TTL cliff	Many keys share the same expiration window	TTL jitter and staged warmup	Bulk deploys can still invalidate too much
Cold cache after deploy	New version changes key names or serialization	Versioned rollout and prewarming	Bad prewarm can overload backend
Eviction churn	Cache is too small or key distribution changed	Track eviction rate and resize by working set	Large tenants can dominate shared caches
Retry amplification	Misses become slow, then callers retry	Retry budgets and circuit breakers	Client libraries may ignore service policy
Stale data misuse	Degraded mode serves data that must be fresh	Classify keys by freshness contract	Product requirements may be ambiguous
Database collapse	Cache fill traffic exceeds backend capacity	Miss budget and load shedding	User-visible errors may be unavoidable

What to Do Next

Problem: Your cache is probably measured as a latency tool, not as a database safety boundary. Start by charting hit rate, miss rate, fill latency, stale serves, evictions, and backend queries caused by misses on the same dashboard.
Solution: Put a controlled workflow on every expensive miss: coalesce by key, check backend budget, serve stale when allowed, apply TTL jitter, and emit a structured incident signal when protection logic activates.
Proof: Test the failure directly. Run a game day that expires the top 1,000 keys, disables one cache node, or deploys a changed key prefix in staging. The pass condition is not zero errors; it is that the database remains inside its concurrency and latency budget.
Action: Classify cached data into three contracts: must be fresh, may be briefly stale, and may degrade. Then make the miss path enforce those contracts in code instead of relying on humans to remember them during an incident.

Situation

The Problem

The Answer: Treat Cache Recovery as an Incident Workflow

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Staff Engineer's System Design Review: Questions That Expose Real Risk

Designing for Peak Traffic Without Designing for Permanent Waste

Building a Commerce Platform Data Plane: OLTP, Search, Cache, Queue, Warehouse