Cloud Security

Cloudflare Code Orange Fail Small: What the Resilience Plan Actually Changes

After November and December 2025 outages, Cloudflare declared Code Orange and shipped a Health Mediated Deployment system, break-glass dependency audits, and graceful-degradation rewrites.

Aman Khan
Threat Researcher
7 min read

In late November 2025, after a 2-hour-10-minute outage on November 18 and a 25-minute partial outage on December 5, Cloudflare declared "Code Orange: Fail Small" — an internal top-priority initiative to make the network more resilient to changes that could cause major outages. By early 2026, Cloudflare published a follow-up engineering blog confirming the initiative was complete and detailing the structural changes it produced. The framing is useful for any platform engineering team: "fail small" reframes reliability not as "never fail" but as "when we fail, the blast radius is bounded by design." This post walks through what Code Orange actually shipped, how the changes map onto the November and December incidents, and what other teams can adopt from the playbook.

What is a Health Mediated Deployment?

The headline change is the Health Mediated Deployment (HMD) system. Every Cloudflare team responsible for a production service now defines what indicates success or failure during a rollout — error rates, latency percentiles, downstream service health checks, customer-visible metrics — and the deployment tooling enforces those gates automatically. A rollout that trips a failure indicator triggers automatic rollback rather than continuing to fan out. This is a staged canary pattern with health-mediated gates, but the practical change is that it applies uniformly across code deployments, configuration changes, and feature-file pushes. Before Code Orange, some classes of change — particularly configuration files that propagated to every node in the network — bypassed staged rollout entirely and fanned out globally within seconds. After Code Orange, those propagations route through HMD too.

# Conceptual HMD policy applied to a hot-path config file rollout
deployment:
  artifact: bot-management-feature-file
  stages:
    - name: canary
      scope: 1% nodes
      health_gates:
        - module: bot-management
          metric: load_success_rate
          threshold: 0.999
          window: 60s
        - module: proxy
          metric: error_rate
          threshold: 0.001
          window: 60s
      rollback_on_failure: true
    - name: partial
      scope: 10% nodes
      health_gates: *canary_gates
      rollback_on_failure: true
    - name: global
      scope: 100% nodes
      health_gates: *canary_gates
      rollback_on_failure: true

What about break-glass procedures?

Code Orange explicitly addressed break-glass tooling — the emergency-access mechanisms incident responders use when normal pathways fail. The November and December postmortems revealed cases where break-glass tools themselves depended on parts of the network the team was trying to recover, creating circular dependencies that delayed mitigation. The Fail Small remediation reviewed and removed those dependencies, and where they could not be removed, the tools were rearchitected to fail to a known-good state that still permitted manual recovery. In practical terms, this means break-glass admin consoles, recovery API endpoints, and rollback tooling are now expected to run on infrastructure that does not share fate with the data plane they recover. This is a standard reliability engineering principle that many organizations fail to enforce because break-glass tools are touched rarely and reviewed rarely.

What does "fail to a known-good state" mean in practice?

The Fail Small remediation called out replacing incorrectly applied hard-fail logic across critical Cloudflare data-plane components. Hard-fail means: when input parsing fails or a downstream call errors, return a failure status and drop the request. Hard-fail is appropriate when serving a wrong response would be worse than serving no response — security checks, payment authorization, identity verification. Hard-fail is inappropriate when the failure mode drops legitimate traffic for every customer because of a transient, recoverable issue with a non-essential subsystem. The November 18 outage is the canonical example: the Bot Management module's feature file failed to load because the file was larger than expected, and the proxy returned errors rather than falling back to the previous feature file or to a sane default. After Code Orange, those modules default to a known-good fallback — typically the last successfully loaded version — and surface the failure to operations without taking customer traffic with them.

How does Code Orange interact with Cloudflare's security posture?

There is a meaningful tradeoff at the heart of Fail Small that's worth surfacing. Some of Cloudflare's most security-sensitive modules — Bot Management, WAF rules, DDoS mitigation — exist precisely to fail closed when they cannot confidently distinguish legitimate traffic from malicious traffic. Replacing hard-fail with graceful degradation in those modules is not unambiguously good: a Bot Management module that defaults to "allow" under load is a less effective security control than one that defaults to "block" under load. Cloudflare's resolution is module-specific. Some modules now degrade to a stale-but-known-good configuration rather than dropping traffic. Others retain hard-fail semantics but with much tighter blast-radius controls (staged rollout, automatic rollback) so that the hard-fail path is exercised only against a 1% canary, not the entire network. The takeaway for other security teams: the right default depends on the module's role in the trust chain, and you should make that decision explicitly per module rather than applying one rule to all components.

What changed in incident response itself?

The November 18 postmortem flagged that responders' initial hypothesis was a hyperscale DDoS attack, and reconciling the actual symptom (Bot Management module crashes) with the cause (oversized feature file from a ClickHouse permissions change) took meaningful time. Code Orange added explicit guidance and tooling for distinguishing DDoS, configuration regression, internal-service degradation, and third-party dependency failure within the first five minutes of an incident. The published playbook now includes specific signals to check on each hypothesis class before committing to a mitigation path. This is a process improvement more than a software change, but it reflects a real maturation: at Cloudflare's scale, the cost of pulling the wrong incident playbook is measured in customer-impacting minutes, and the team invested in disambiguation tooling so on-call could pick the right playbook fast.

What should other platform teams adopt from this?

Six concrete patterns worth carrying over. First, define and instrument success and failure indicators for every production service, and gate rollouts on those indicators rather than on operator judgment alone. Second, treat configuration files that propagate globally with the same rigor as code deployments — staged rollouts, health-mediated gates, automatic rollback. Third, audit break-glass tooling for circular dependencies on the systems they recover, and rearchitect or relocate the ones that share fate with the data plane. Fourth, classify every fail-closed code path against whether the failure mode actually serves the security goal — sometimes hard-fail is correct, sometimes it just inflates blast radius. Fifth, invest in incident-class disambiguation so on-call picks the right playbook within five minutes of paging — even at much smaller scale than Cloudflare, picking wrong wastes the first hour of every incident. Sixth, publish your own postmortems against this template: customers, regulators, and your own team learn more from a candid postmortem than from any uptime SLA.

How does Code Orange affect Cloudflare's customers?

The customer-facing impact is mostly invisible by design — that's the point of fail-small architecture. Customers should observe fewer hours of total outage time per year and tighter blast-radius bounding when incidents do happen. There is no API change, no customer-visible configuration to adopt, and no migration. Customers should treat Code Orange as a vendor reliability improvement to factor into their own risk modeling. For platform teams whose own products run on Cloudflare's stack, the reduced correlation between Bot Management, Workers KV, R2, and Access incidents is the most operationally meaningful change — but verifying that correlation reduction is real will take time, and the next major incident will be the test.

How Safeguard Helps

Safeguard's TPRM module continuously tracks resilience-relevant signals for tier-1 cloud and edge vendors, including published incident frequencies, postmortem quality, time-to-public-disclosure, and remediation completion against the vendor's own commitments. The post-Code Orange Cloudflare profile is updated continuously as new postmortems publish, giving customers a real-time view of vendor reliability rather than a snapshot from the last questionnaire. Dependency mapping traces customer workloads through Cloudflare products to underlying Cloudflare or third-party services, surfacing the shared-fate paths that matter for incident planning. Policy gates block production deployments that introduce new dependencies on vendor products without documented fallback paths matching the vendor's stated blast-radius bounds, and Griffin AI correlates multi-vendor status pages in real time to identify the underlying provider behind any visible customer-impacting incident.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.