Cloud Security

Cloudflare R2 March 21, 2025 Outage: A Credential Rotation Postmortem

A missing --env flag during a Wrangler secret rotation took R2 writes to zero for 67 minutes. Here is the failure mode and the deployment guardrails that should have caught it.

Between 21:38 and 22:45 UTC on March 21, 2025, Cloudflare R2 Object Storage experienced 100% write failures and approximately 35% read failures globally. The cause, per Cloudflare's published postmortem, was a credential rotation gone wrong: an engineer running wrangler secret put and wrangler deploy omitted the --env production flag, deploying new credentials to the default environment instead of the production R2 Gateway Worker. When the rotation completed by deleting the old credentials from the storage infrastructure, the production Gateway no longer had any valid credential to authenticate with, and writes started failing immediately. Reads partially survived because some served from cache, but anything that needed to hit the origin returned errors. This is the type of incident that should never make it past a deploy review, and the postmortem makes clear that the gap was tooling, not skill — the system's defaults were the wrong defaults.

Why does `--env` matter in Wrangler?

Wrangler is the CLI for deploying Cloudflare Workers and managing their bindings, secrets, and environment configuration. A Worker can be deployed to multiple environments — dev, staging, production — each with its own secrets and resource bindings. The trap is that Wrangler's secret put and deploy subcommands both default to the unnamed "default" environment if you do not pass --env. For a Worker that has only ever been deployed with explicit environments, the default environment may not even be the production deployment customers actually hit. The Cloudflare R2 Gateway sits in this exact pattern: the production deployment uses --env production, but the unnamed default environment also exists in the account, configured but unrouted. A rotation command without --env writes to the default deployment, which is not the one serving traffic. The subsequent deletion of the old credential from the storage backend strands the live deployment.

# What was accidentally run (writes to default env)
wrangler secret put R2_STORAGE_CREDENTIAL
wrangler deploy

# What should have been run (production target)
wrangler secret put R2_STORAGE_CREDENTIAL --env production
wrangler deploy --env production

# A safer policy via wrangler.toml that fails fast
# main = "src/index.ts"
# [env.production]
# routes = [...]
# Prevent accidental default-env deploys by removing the unnamed env entirely.

What did the failure look like?

The Gateway Worker started returning authentication errors against R2's storage backend at 21:38 UTC, the moment the old credential was deleted. Within seconds, every write request that arrived at the Gateway failed because the Worker held a credential that the storage backend no longer accepted. Reads that could be served from cache continued working; reads that required origin fetches hit the same authentication wall. The Cloudflare incident timeline shows responders correctly hypothesized a credential issue inside the first 10 minutes, but tracing back which credential the Gateway was using and why it had diverged from the storage backend took the remaining time. The postmortem flags this as a key visibility gap: there was no live correlation between the credential ID currently deployed to the production Worker and the credential ID currently accepted by the storage layer.

Which downstream services failed because R2 failed?

R2 sits underneath a surprising number of Cloudflare services. The March 21 outage took down or degraded Cache Reserve, Images, Stream (uploads and delivery), Email Security (attachment storage), Vectorize (embedding storage), Log Delivery, Billing, and Key Transparency Auditor. Customers running their own applications on R2 — backup pipelines, build artifact stores, dataset hosting — also saw failures. This was the second major R2 incident in six weeks, following the February 6 outage where a botched abuse remediation disabled the Gateway service entirely. Together, the two incidents underscore that R2 is now a cross-product dependency inside Cloudflare's stack, not just a customer-facing object store, and any operational error in the Gateway propagates through dozens of products.

What guardrails were missing?

Three categories, all visible in the postmortem. First, Wrangler's defaults are unsafe for production-critical Workers: a CLI that silently targets a different environment than the operator likely intended is a foot-gun, and Cloudflare's remediation is to require key rotation to flow through release tooling rather than ad-hoc wrangler commands. Second, there was no out-of-band check between credential issuance and credential deletion. The rotation procedure should have included a step that confirmed the new credential was active on the production Gateway before deleting the old one — for example, by verifying that the new credential's last-used timestamp had advanced past the rotation start time. Cloudflare's fix here is explicit confirmation that the suffix of the new token ID matches what the storage infrastructure logs report seeing on incoming requests, before any deletion is permitted. Third, the deployment system should have surfaced that two environments existed for the Gateway Worker and that only one — production — actually served traffic. Defaulting deploys to a non-routed environment is a footgun across Wrangler's user base.

What changed after the postmortem?

Cloudflare published three concrete remediations. Rotation tooling now runs through a release pipeline that explicitly sets --env production and rejects commands that target the default environment for Workers tagged as production. The pipeline performs a pre-deletion check that the new credential is actively serving traffic by comparing storage-side logs against the issuance timestamp. And the Gateway Worker itself emits a periodic heartbeat metric that includes the active credential's identifier, giving SRE a live dashboard of which credential the production deployment is using — closing the visibility gap that turned a 5-minute fix into a 67-minute outage.

What should other teams take from this incident?

Five lessons that apply far beyond Cloudflare's stack. First, audit every credential rotation procedure in your environment for the "delete old before confirming new is in use" anti-pattern. It is endemic in homegrown rotation scripts and shows up in published vendor procedures too. Second, treat CLIs with environment-scoped defaults as production hazards: either remove the unscoped environment from the configuration, or wrap the CLI in a release pipeline that enforces the scope. Third, instrument credential identifiers as first-class observability: emit which credential ID is currently active in each deployment and alert on divergence. Fourth, run a tabletop on this exact scenario for any service that has multiple environments and shared credentials — Cloudflare is not alone, and the same pattern bit a major AWS customer in 2024 that we are not at liberty to name. Fifth, classify R2 (and equivalent dependencies in your stack) as part of your control plane and apply the same scrutiny you apply to your IdP and your KMS.

How Safeguard Helps

Safeguard's cloud configuration scanning ingests Cloudflare Wrangler configurations through wrangler.toml introspection and flags Workers that retain an unnamed default environment alongside named production environments — exactly the misconfiguration that turned a routine secret rotation into a global outage. TPRM workflows continuously score critical SaaS dependencies, including object storage providers like Cloudflare R2, against their public incident histories and remediation SLAs, giving customers a real-time risk profile rather than an annual security questionnaire. Policy gates block production deployments where credential rotation procedures do not include verifiable "active before delete" checks, and Griffin AI correlates third-party status pages with customer workloads to identify which internal services will be affected within seconds of a vendor incident page going red. For organizations that depend on R2 for build artifacts or backup storage, Safeguard surfaces the blast radius before the incident, not during it.

cloudflare r2 credential-rotation wrangler incident-response

Back to all articles