Cloud Security

Cloudflare R2 February 6, 2025 Outage: When Abuse Tooling Took Down Production

A routine phishing-URL takedown clicked the wrong button and disabled R2 globally for 59 minutes. Here is what went wrong and the two-party approval Cloudflare added afterwards.

Pooja Rao
Cloud Security Engineer
7 min read

On February 6, 2025, Cloudflare R2 was unavailable for 59 minutes after an abuse remediation against a phishing site hosted on R2 disabled the entire R2 Gateway service instead of the specific endpoint or bucket associated with the abuse report. According to Cloudflare's published postmortem, the incident was caused by human error and insufficient validation safeguards inside the abuse processing system. No data was lost or corrupted — this was a control-plane failure, not a storage-layer failure — but every R2 customer, plus every Cloudflare product that depends on R2, saw 100% write failures and severely degraded reads for the duration. Six weeks later, on March 21, R2 went down again, this time over a credential rotation error. The pair of incidents drove Cloudflare's November "Code Orange: Fail Small" resilience initiative, and the February root cause is the simpler of the two to internalize as a teaching example.

What actually happened on February 6?

A Cloudflare Trust & Safety analyst was processing an abuse report against a phishing URL hosted on R2. The standard remediation, when an analyst confirms abuse, is to disable the specific endpoint serving the malicious content — a narrow action against a single bucket or hostname. The tooling exposed an "advanced product disablement" control that, when triggered against the report's associated account, disabled the production R2 Gateway service in its entirety rather than the offending endpoint. Because abuse remediations are intentionally fast and irreversible — you do not want a takedown to require a second approval in time-sensitive cases — the action propagated within seconds. R2's Gateway service is the front door for every R2 request globally, and disabling it cut off every customer at once. Recovery required identifying the misapplied action, reversing it through manual coordination between Trust & Safety and the R2 team, and waiting for the Gateway to re-enable across data centers. Total elapsed time: 59 minutes.

Why was the tooling capable of this in the first place?

Two contributing factors emerged in the postmortem. First, the abuse processing system was not configured to identify internal Cloudflare accounts and block product-disablement actions against them. Cloudflare's internal teams operate inside Cloudflare's own account hierarchy — that's how dogfooding works — and the R2 Gateway service itself runs in an internal account. The abuse tooling treated that account the same way it would treat any external customer hosting a phishing site, applying the disable action without recognizing that the target was infrastructure serving every R2 customer. Second, no two-party approval was required for ad-hoc product disablement at the account level. Abuse remediations are typically single-analyst actions for speed, which is appropriate when the action targets a single URL or bucket — but the product disablement workflow extended the same single-analyst trust model to a vastly more destructive action.

# Conceptual remediation policy that Cloudflare added afterwards
abuse_actions:
  endpoint_disable:
    approvers_required: 1
    blast_radius_threshold: single_endpoint
  bucket_disable:
    approvers_required: 1
    blast_radius_threshold: single_bucket
  product_disable_account:
    approvers_required: 2
    blast_radius_threshold: multi_tenant
    forbidden_target_accounts:
      - internal:cloudflare:*
      - internal:rd:*
      - internal:trust:*

How is this incident different from the March 21 outage?

Both incidents took R2 to zero writes globally. Both involved internal tooling that operated correctly within its design envelope but produced catastrophic outcomes when the operator's intent diverged from the action they triggered. The February incident is fundamentally a blast-radius failure: a control that should have been scoped to an endpoint operated at the service level. The March incident is fundamentally a deployment-environment failure: a credential rotation targeted the wrong environment. Both share a structural cause — internal tools whose defaults or affordances do not match the operator's likely intent — and both motivated Cloudflare's broader push to require release-pipeline mediation for high-blast-radius actions. The November "Code Orange" announcement consolidates the lessons from both into the "Fail Small" framing: changes that affect every customer should require approval and validation steps that scale with the blast radius.

What did Cloudflare change after the postmortem?

Cloudflare implemented two-party approval for any ad-hoc product disablement action, requiring investigators to submit additional remediations to a manager or to an approved remediation acceptance list. The abuse processing system was expanded with explicit checks that prevent any product-disablement action against products associated with an internal Cloudflare account, removing the foot-gun that turned a routine takedown into a global outage. Operationally, the Trust & Safety team now coordinates with the affected product team for any account-level disablement, ensuring that the product team can intercept misapplied actions before they propagate. These are simple fixes in hindsight, and that's part of the takeaway: the most expensive incidents often come from controls that were designed for one scope and used at another scope without the controls noticing the difference.

What patterns from this should every team audit?

Five concrete questions every platform team should ask of its own abuse and operational tooling. First, what is the blast radius of every action your support, fraud, abuse, and operations teams can trigger from internal consoles? Map them and rank them. Second, which of those actions require approvals proportional to their blast radius — and which would let a single person take down production? Third, are there account hierarchies in your own infrastructure that should be tagged as off-limits to operational tooling that targets customer accounts? If your CDN runs on your CDN, your storage runs on your storage, or your monitoring runs on your platform, the answer is yes. Fourth, do you have circuit breakers that detect when an operational action affects more than a configured percentage of traffic or customers and refuse to apply without escalation? Fifth, when an internal tool fires an action, do downstream affected teams receive a real-time notification, so they can intercept misapplications within minutes rather than waiting for customer reports?

How should you think about R2 as a dependency?

If you use R2 as a backing store for your own application, the February and March incidents establish a baseline expectation: assume R2 can be unavailable for up to two hours at a stretch, with no advance notice, and design accordingly. For build artifact storage, that means a fallback or a queued retry. For backups, that means alternate destinations. For customer-facing serving, that means CDN caching with longer TTLs against the origin and graceful degradation when origin fetches fail. For tier-1 traffic, that means a multi-provider strategy that explicitly does not assume any single object storage service will be available through every hour of every day. The harder question — and one Cloudflare engineers will openly acknowledge — is that R2's customer base is now broad enough that an R2 outage is also a Stream outage, an Images outage, and a Cache Reserve outage. Cloudflare's "Fail Small" remediation aims to reduce that coupling.

How Safeguard Helps

Safeguard's TPRM module continuously scores cloud object storage providers including Cloudflare R2 against published incident histories, postmortem quality, and remediation SLAs, giving customers a real-time risk profile rather than the static questionnaire vendors typically supply. Dependency mapping traces application services to their backing object stores and surfaces the cascade impact of an R2 outage on internal product surface area — distinguishing which services depend on R2 directly, which ride on top of R2-backed Cloudflare products like Stream or Images, and which have no R2 path. Policy gates block production deployments that introduce new single-provider object storage dependencies for tier-1 workloads without an approved fallback design, and Griffin AI correlates third-party status pages with customer workloads to identify affected internal services within seconds of a vendor incident going public.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.