330 Redundant Data Centers All Failed Simultaneously — Because They Were Identical
Redundancy is a solution to independent failure. It does nothing when the failure is correlated. Cloudflare operates more than 330 data centers. In November 2023, a single auto-generated config file crashed the bot mitigation service at all of them simultaneously. The redundancy was real. The outage was total. Both things were true because every node was running identical code with the same defect — there was nothing for the redundancy to protect against.
Situation
Distributed systems reliability engineering has centered on redundancy for two decades. N+1 capacity, geographic distribution, active-active multi-region deployments — the playbook is well-established, and for hardware failures, random software crashes, and localized network partitions, it works. Systems that have internalized this model have materially better uptime than those that have not.
The math behind it is straightforward: if two independent components each have a 0.1% probability of failure on any given day, the probability of both failing simultaneously is 0.0001%. Multiply across enough independent nodes and the reliability numbers become very good.
The word doing the work in that calculation is “independent.”
| Independent failures | Correlated failures | |
|---|---|---|
| Root cause | Separate — hardware variance, random crashes | Shared — same code, same config, same defect |
| Redundancy effectiveness | High — protects directly | None — all nodes fail together |
| Detection | Gradual — partial degradation first | Sudden — full fleet impact at once |
The Problem
Software defects are not independent events. A config change, a dependency update, a new library version — these roll out to all nodes in a fleet, not to a random sample. When the defect lives in code or configuration that every node runs, every node fails at the same moment. The independence assumption collapses, and with it the reliability guarantees that redundancy provides.
Cloudflare’s bot mitigation service used a config file auto-generated from live threat intelligence. Under production load, the file grew past the size limits that had been validated in development and staging. In those environments, the file never reached the problematic size — traffic volume was lower, the threat intelligence feed was smaller, the problematic code path was never exercised.
When the file crossed the size limit under real production load, the service crashed. And because every data center was running the same version of the same service consuming the same auto-generated config, every data center crashed at the same time.
| Failure point | What broke | Why it matters |
|---|---|---|
| Auto-generated config with no size enforcement | File grew past validated limit under production load | Generation pipeline produced invalid output without signaling it |
| Staging environment gap | Dev and staging never saw the problematic size | Size-dependent defects are invisible below the threshold |
| Homogeneous fleet | Identical code and config on all 330+ nodes | One defect becomes 330 simultaneous failures with no partial degradation |
The central question this forces: when your redundancy architecture assumes independent failures, what is your actual blast radius for a correlated one?
Core Concept
flowchart TD
A[threat intelligence feed] --> B[config auto-generation pipeline]
B --> C[config file — identical version distributed to all DCs]
C --> D1[DC 1 — bot mitigation service]
C --> D2[DC 2 — bot mitigation service]
C --> D3[DC 330 — bot mitigation service]
D1 --> E[crash — size limit exceeded]
D2 --> E
D3 --> E
The auto-generation pipeline is the single point of correlation — not the single point of failure in the traditional sense, but the single origin of defect. A defect in its output is a defect in every consumer simultaneously.
The mitigations that address correlated failure are different from those that address independent failure:
-
Validate at generation time, not at runtime. A config file that will crash the service at size N should be caught before it reaches size N. Schema and size validation in the generation pipeline converts a runtime failure into a build-time rejection — always preferable.
Confirm: the generation pipeline rejects configs that exceed defined size or schema constraints before they are distributed. -
Require canary deployment for any auto-generated config. Deploy the new config to a small, representative subset of nodes receiving real production traffic and observe behavior before fleet-wide rollout. If the config crashes the service, the blast radius is bounded.
Confirm: the canary slice receives production-volume traffic, not synthetic or low-volume testing traffic. -
Add operational diversity where the config update latency budget allows. Running different config versions on different subsets of the fleet means no single generation artifact reaches 100% of nodes simultaneously.
Confirm: fleet diversity is tracked and maintained as an operational metric, not treated as a one-time configuration decision.
In Practice
Cloudflare’s incident analysis frames this explicitly as correlated failure and documents it as a distinct reliability category from the independent hardware and network failures that redundancy addresses. Their post-incident work centers on validation at generation time and staged rollout — both of which address the root cause (homogeneous fleet, shared defect) rather than the symptom (100% outage vs. the expected partial degradation).
The staging environment gap is worth examining as a separate pattern. Development and staging environments are routinely configured with lower traffic volumes, smaller datasets, and synthetic inputs. This makes them structurally unable to exercise behaviors that only appear at production scale — size limits, throughput-dependent code paths, resource pressure that doesn’t manifest until the load is real. Teams often treat “passes staging” as a proxy for “safe to deploy.” Cloudflare’s outage is a clear counterexample: the defect was invisible in staging not because staging was poorly designed but because it was a fundamentally different operating environment.
The auto-generation pattern itself is worth auditing. Configs generated from live data feeds have a property that manually authored configs do not: their content can change continuously without a code review or a human approval step. Size, complexity, and schema violations that would be caught in a review can accumulate silently in generated output until the violation crosses a threshold that breaks something.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| Canary misses the defect | Canary traffic volume too low to trigger size-dependent failure | Canary must receive production-representative traffic |
| Validation doesn’t cover novel failures | Size limit enforced but schema violation goes unchecked | Schema validation must evolve with the config format |
| Staged rollout delays security response | Threat intelligence update needs immediate propagation | Define explicit fast-path criteria with compensating controls |
| Operational diversity adds complexity | Multiple config versions require support across the fleet | Treat diversity as a cost with a known risk benefit, not an afterthought |
There is a genuine tension between security config velocity and correlated failure risk. Threat intelligence is most valuable when it is current; staged rollouts delay propagation. There is no clean resolution — only an explicit, documented decision about which risk to accept and under what conditions.
What to Do Next
- Problem: Auto-generated config that passes staging can silently exceed limits under production load, crashing the service fleet-wide because every node runs the same version.
- Solution: Enforce size and schema constraints at generation time, and require a representative canary stage — with real production traffic — before any auto-generated config reaches the full fleet.
- Proof: Cloudflare’s post-incident analysis documents both the failure mode and the mitigations. The specific pattern — auto-generated config, staging gap, homogeneous fleet — is common enough that auditing your own pipeline is not premature optimization.
- Action: Identify every auto-generated config in your infrastructure. For each: what is the maximum safe size, is that limit enforced before the config reaches production, and does the deployment pipeline require a canary stage with production-representative traffic?
Redundancy and correlated failure resistance are not the same property. Engineering for one does not buy you the other. The teams that discover this through a post-incident review have paid a high price for a lesson that is not actually hard to apply in advance.