330 Redundant Data Centers All Failed Simultaneously — Because They Were Identical

Redundancy is a solution to independent failure. It does nothing when the failure is correlated. Cloudflare operates more than 330 data centers. In November 2023, a single auto-generated config file crashed the bot mitigation service at all of them simultaneously. The redundancy was real. The outage was total. Both things were true because every node was running identical code with the same defect — there was nothing for the redundancy to protect against.

Situation

Distributed systems reliability engineering has centered on redundancy for two decades. N+1 capacity, geographic distribution, active-active multi-region deployments — the playbook is well-established, and for hardware failures, random software crashes, and localized network partitions, it works. Systems that have internalized this model have materially better uptime than those that have not.

The math behind it is straightforward: if two independent components each have a 0.1% probability of failure on any given day, the probability of both failing simultaneously is 0.0001%. Multiply across enough independent nodes and the reliability numbers become very good.

The word doing the work in that calculation is “independent.”

	Independent failures	Correlated failures
Root cause	Separate — hardware variance, random crashes	Shared — same code, same config, same defect
Redundancy effectiveness	High — protects directly	None — all nodes fail together
Detection	Gradual — partial degradation first	Sudden — full fleet impact at once

The Problem

Software defects are not independent events. A config change, a dependency update, a new library version — these roll out to all nodes in a fleet, not to a random sample. When the defect lives in code or configuration that every node runs, every node fails at the same moment. The independence assumption collapses, and with it the reliability guarantees that redundancy provides.

Cloudflare’s bot mitigation service used a config file auto-generated from live threat intelligence. Under production load, the file grew past the size limits that had been validated in development and staging. In those environments, the file never reached the problematic size — traffic volume was lower, the threat intelligence feed was smaller, the problematic code path was never exercised.

When the file crossed the size limit under real production load, the service crashed. And because every data center was running the same version of the same service consuming the same auto-generated config, every data center crashed at the same time.

Failure point	What broke	Why it matters
Auto-generated config with no size enforcement	File grew past validated limit under production load	Generation pipeline produced invalid output without signaling it
Staging environment gap	Dev and staging never saw the problematic size	Size-dependent defects are invisible below the threshold
Homogeneous fleet	Identical code and config on all 330+ nodes	One defect becomes 330 simultaneous failures with no partial degradation

The central question this forces: when your redundancy architecture assumes independent failures, what is your actual blast radius for a correlated one?

Core Concept

flowchart TD
    A[threat intelligence feed] --> B[config auto-generation pipeline]
    B --> C[config file — identical version distributed to all DCs]
    C --> D1[DC 1 — bot mitigation service]
    C --> D2[DC 2 — bot mitigation service]
    C --> D3[DC 330 — bot mitigation service]
    D1 --> E[crash — size limit exceeded]
    D2 --> E
    D3 --> E

The auto-generation pipeline is the single point of correlation — not the single point of failure in the traditional sense, but the single origin of defect. A defect in its output is a defect in every consumer simultaneously.

The mitigations that address correlated failure are different from those that address independent failure:

Validate at generation time, not at runtime. A config file that will crash the service at size N should be caught before it reaches size N. Schema and size validation in the generation pipeline converts a runtime failure into a build-time rejection — always preferable.
Confirm: the generation pipeline rejects configs that exceed defined size or schema constraints before they are distributed.
Require canary deployment for any auto-generated config. Deploy the new config to a small, representative subset of nodes receiving real production traffic and observe behavior before fleet-wide rollout. If the config crashes the service, the blast radius is bounded.
Confirm: the canary slice receives production-volume traffic, not synthetic or low-volume testing traffic.
Add operational diversity where the config update latency budget allows. Running different config versions on different subsets of the fleet means no single generation artifact reaches 100% of nodes simultaneously.
Confirm: fleet diversity is tracked and maintained as an operational metric, not treated as a one-time configuration decision.

In Practice

Cloudflare’s incident analysis frames this explicitly as correlated failure and documents it as a distinct reliability category from the independent hardware and network failures that redundancy addresses. Their post-incident work centers on validation at generation time and staged rollout — both of which address the root cause (homogeneous fleet, shared defect) rather than the symptom (100% outage vs. the expected partial degradation).

The staging environment gap is worth examining as a separate pattern. Development and staging environments are routinely configured with lower traffic volumes, smaller datasets, and synthetic inputs. This makes them structurally unable to exercise behaviors that only appear at production scale — size limits, throughput-dependent code paths, resource pressure that doesn’t manifest until the load is real. Teams often treat “passes staging” as a proxy for “safe to deploy.” Cloudflare’s outage is a clear counterexample: the defect was invisible in staging not because staging was poorly designed but because it was a fundamentally different operating environment.

The auto-generation pattern itself is worth auditing. Configs generated from live data feeds have a property that manually authored configs do not: their content can change continuously without a code review or a human approval step. Size, complexity, and schema violations that would be caught in a review can accumulate silently in generated output until the violation crosses a threshold that breaks something.

Where It Breaks

Failure mode	Trigger	Fix
Canary misses the defect	Canary traffic volume too low to trigger size-dependent failure	Canary must receive production-representative traffic
Validation doesn’t cover novel failures	Size limit enforced but schema violation goes unchecked	Schema validation must evolve with the config format
Staged rollout delays security response	Threat intelligence update needs immediate propagation	Define explicit fast-path criteria with compensating controls
Operational diversity adds complexity	Multiple config versions require support across the fleet	Treat diversity as a cost with a known risk benefit, not an afterthought

There is a genuine tension between security config velocity and correlated failure risk. Threat intelligence is most valuable when it is current; staged rollouts delay propagation. There is no clean resolution — only an explicit, documented decision about which risk to accept and under what conditions.

What to Do Next

Problem: Auto-generated config that passes staging can silently exceed limits under production load, crashing the service fleet-wide because every node runs the same version.
Solution: Enforce size and schema constraints at generation time, and require a representative canary stage — with real production traffic — before any auto-generated config reaches the full fleet.
Proof: Cloudflare’s post-incident analysis documents both the failure mode and the mitigations. The specific pattern — auto-generated config, staging gap, homogeneous fleet — is common enough that auditing your own pipeline is not premature optimization.
Action: Identify every auto-generated config in your infrastructure. For each: what is the maximum safe size, is that limit enforced before the config reaches production, and does the deployment pipeline require a canary stage with production-representative traffic?

Redundancy and correlated failure resistance are not the same property. Engineering for one does not buy you the other. The teams that discover this through a post-incident review have paid a high price for a lesson that is not actually hard to apply in advance.

Situation

The Problem

Core Concept

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Why Your Non-Prod Databases Cost as Much as Production

The End of Single-Signal Alerting: Correlating Metrics, Logs, Traces, Deployments, and Cost

Backpressure Design: How Healthy Systems Say No