On November 18, 2025, Cloudflare's network failed to serve traffic for approximately two hours and ten minutes. The cause, per Cloudflare's published postmortem, was a subtle regression in a routine ClickHouse cluster security improvement that made table access explicit for users. A metadata query that historically returned a clean list of columns from the default database suddenly started returning duplicate rows from underlying r0 database shards. Those rows fed into a "feature file" used by Cloudflare's Bot Management module. The feature file's size approximately doubled — growing from around 60 features to more than 200 — exceeding a hard-coded limit inside the core proxy. When proxies refreshed the file every five minutes and tried to load the oversized version, the Bot Management module failed, and because Bot Management is in the request path for every customer, the proxy returned errors for nearly all of Cloudflare's traffic. The initial diagnostic hypothesis was a hyperscale DDoS attack; correctly identifying the actual cause and rolling back to the previous feature file took the bulk of the outage window.
Why is a "feature file" in the request path of every customer?
Cloudflare's Bot Management runs an ML model that scores incoming requests for bot likelihood. The model needs a consistent set of input features — request headers, JA3 fingerprints, ASN, behavioral signals — and those features are defined in a feature file generated centrally and propagated to every proxy node in the network. Because Bot Management is enabled by default on most Cloudflare zones and feeds into the core security decision for every HTTP request, the feature file is on the hot path for nearly all traffic. The architectural choice to keep the file lean (around 60 features) and the proxy's hard-coded limit on file size (somewhere above that working size but well under the doubled size produced on November 18) were rational individually — each enforced bounded latency and memory footprint at scale. The interaction between them was not. When the file silently doubled, the proxy's load step failed and the module crashed rather than degrading gracefully.
-- Conceptual root cause: a metadata query against default DB that started returning r0 rows
SELECT name, type
FROM system.columns
WHERE database = 'default'
AND table = 'bot_features';
-- After the permissions change, this query also returned rows for tables of the same name in the r0 shard databases,
-- duplicating every entry and ballooning the generated feature file size beyond the proxy hard-coded limit.
What was the security improvement that caused this?
The change was a ClickHouse permissions update intended to make table access explicit per user. Before the change, ClickHouse's metadata queries operated against a logical view of the default database that hid the underlying r0 sharding. After the change, the metadata layer surfaced columns from the underlying r0 shard databases as well, producing duplicate rows for any table whose name was the same across the default and r0 namespaces. From a security posture standpoint, the change was correct — explicit table access is the right principle — but the regression in metadata semantics was not surfaced by the change's review, because no one anticipated that downstream pipelines would parse system.columns and treat duplicate rows as additive feature definitions. This is a classic case of an intended security hardening producing an unintended downstream operational failure, and the postmortem is candid that the change went through normal review without anyone catching the metadata query's new behavior.
Why did the initial DDoS hypothesis cost so much time?
Cloudflare's incident response is calibrated for very large-scale traffic anomalies, and the early symptom set — error rates climbing globally across many customers simultaneously — pattern-matched onto a hyperscale DDoS. Responders pulled DDoS playbooks first, which is the right first guess given how often Cloudflare faces large attacks. The actual cause was not visible in traffic shape; it was visible only in proxy module logs, which showed the Bot Management module crashing on feature file load. Reconciling those signals took time because the symptom (global error rate) and the cause (a specific module failure inside the proxy) lived in different observability surfaces. Once responders identified the feature file's size as the trigger, rolling back to the previous version restored service. The postmortem frames this as a detection latency problem: the right signal existed, but it was not the signal on-call was watching.
What does the "Fail Small" remediation look like?
Cloudflare announced "Code Orange: Fail Small" in late November 2025 as a top-priority initiative to make the network more resilient to changes that affect every customer. Key components include a Health Mediated Deployment (HMD) system where every team responsible for a service defines success and failure indicators, with automatic rollback procedures triggered when those indicators trip. The HMD pattern is essentially staged canary deployment with health-based gates, applied to feature file rollouts and configuration pushes in addition to code deployments. Cloudflare also called out replacing incorrectly applied hard-fail logic across critical data-plane components so systems log errors and default to a known-good state rather than dropping requests outright — directly addressing the Bot Management proxy module's behavior on November 18. Break-glass procedures were reviewed for circular dependencies; the postmortem alluded to break-glass tools that themselves depended on the same network the team was trying to recover, which is a coordination hazard worth eliminating before the next incident.
Why did a December 5 incident follow so quickly?
On December 5, 2025, less than three weeks after the November outage, Cloudflare experienced a second incident in which approximately 28% of applications behind the network failed for about 25 minutes. The December postmortem characterized it as a deployment intended to mitigate a security issue for customers that propagated to the entire network and led to errors for nearly all customers. The pattern across both incidents — a change pushed broadly that affected core proxy behavior — is what motivated the Fail Small framing: the structural cause is that some classes of changes still propagated globally without health-mediated gates. Cloudflare's December postmortem and the November Fail Small announcement together represent a deliberate shift toward narrower deployment scopes, health-gated rollouts, and explicit blast-radius tracking for every config or code change that touches the data plane.
What can other teams take from this?
Six lessons worth carrying over. First, audit every hard-coded limit inside critical request-path components for whether the rest of the system can produce inputs that exceed it. Second, when introducing a security hardening change, simulate downstream consumers of the changed interface — not just the immediate consumers, but parsers and pipelines several hops downstream — to catch unintended semantic regressions. Third, treat configuration files that propagate to every node as code, and apply staged canary rollouts with health-mediated gates rather than fan-out-everywhere distribution. Fourth, design data-plane modules to degrade gracefully on bad input — fail open with a known-good fallback rather than fail closed when the failure mode would drop legitimate traffic for every customer. Fifth, calibrate incident response so that DDoS, configuration regression, and dependency outage hypotheses can be distinguished in the first 5 minutes — invest in observability that disambiguates them. Sixth, review break-glass tooling for circular dependencies on the very system the tools are meant to recover.
How Safeguard Helps
Safeguard ingests configuration changes across customer cloud and edge platforms and surfaces changes that touch hot-path data structures whose size or schema is consumed by tier-1 production components. Policy gates block changes to security-tier configuration files that lack explicit size and shape assertions in CI. Griffin AI correlates third-party vendor incident pages with customer workloads in real time, identifying within seconds which internal services depend on which underlying vendor product. For Cloudflare customers, Safeguard maps Worker, Pages, and Bot Management dependencies against the vendor's published incident history and remediation SLAs, producing a continuously updated risk profile rather than the static annual questionnaire most TPRM programs settle for.