Cloud Security

Cloudflare Workers KV June 12 2025 Outage: A GCP Dependency Story

A 2-hour, 28-minute Workers KV outage rolled into Access, Gateway, WARP, and Turnstile because the central store sat on GCP. Here is the dependency chain and the R2 re-architecture that followed.

Vikram Iyer
Platform Engineer
7 min read

On June 12, 2025, Cloudflare Workers KV — the distributed key-value store used by Workers for configuration, authentication tokens, and asset delivery — was unavailable for 2 hours and 28 minutes. Workers KV's central store at the time was hosted on Google Cloud, and a GCP outage cascaded directly into Cloudflare's stack. Affected products included Workers KV itself, WARP, Access, Gateway, Images, Stream, Workers AI, Turnstile, Challenges, AutoRAG, Zaraz, and parts of the Cloudflare Dashboard. Cloudflare's postmortem and follow-up engineering blog spell out a structural reason for the blast radius: Cloudflare had consolidated from a dual-backend setup to GCP-only for KV's central store to reduce operational complexity, and when GCP's IAM service went down, Workers KV could not authenticate to retrieve configuration data, taking every dependent product with it.

What did the dependency chain look like?

Workers KV is structured as a globally replicated key-value store with a central data store backing per-region caches. The central data store provides authoritative state for keys that are not yet cached locally. Cloudflare ran the central store on GCP, and access to it required GCP IAM to authenticate requests. When GCP IAM became unavailable on June 12 — as part of a broader GCP outage attributed to an API management issue inside Google — Cloudflare's Workers could not retrieve any KV value that was not already cached locally. Many of Cloudflare's own products use KV to store routing configuration, authentication state, ACL data, and short-lived session tokens that are not durably cacheable. As those values aged out of edge caches, the dependent products started failing one by one. Access lost the ability to validate identity provider sessions. Turnstile and Challenges lost their per-site configuration. WARP lost routing policy data. The blast radius was as wide as KV's internal use, which is to say very wide.

Why was KV's central store single-provider?

Cloudflare's pre-incident design ran KV's central store across two backends — Google Cloud and an alternate — for redundancy. The team consolidated to GCP-only because the dual-backend design imposed real operational cost: consistency reconciliation, dual-write debugging, divergent failure modes, and the cognitive load of two distinct storage backplanes. The consolidation was a deliberate engineering tradeoff, and it worked well under normal conditions. June 12 demonstrated the tradeoff's downside: a single-provider central store inherits that provider's incidents at full strength. The postmortem is candid about this — Cloudflare did not regret the consolidation decision on its merits, but reframed it as a reason to bring the central store onto Cloudflare's own infrastructure rather than swing back to a dual-cloud design.

// Conceptual size-based routing policy from the post-incident KV redesign
{
  "router": "kv-storage-router",
  "rules": [
    {
      "match": { "value_size_bytes_lt": 4096 },
      "destination": "cloudflare-distributed-database",
      "rationale": "Majority traffic, median value size 288 bytes"
    },
    {
      "match": { "value_size_bytes_gte": 4096 },
      "destination": "r2-object-storage",
      "rationale": "Large objects, leverage existing R2 throughput"
    }
  ]
}

What changed in the rearchitecture?

Cloudflare announced and shipped a redesigned Workers KV backend that combines Cloudflare's own distributed database with R2 object storage through a size-based router. The team's published telemetry showed that the median KV value size is 288 bytes — the workload is dominated by small values that fit naturally into a low-latency distributed database. Values above a configurable threshold get routed to R2 object storage instead, which gives KV access to R2's throughput for the long tail of larger payloads. The InfoQ writeup of the redesign reports a 40x performance improvement against the prior architecture, and the central store now lives entirely on Cloudflare-operated infrastructure rather than GCP. The architecture sidesteps the June 12 failure mode by removing the dependency on a third-party cloud's IAM service, and the size-based router lets Cloudflare evolve the storage backplane without forcing every key through the same hot path.

How did dependent products fail differently?

The breakdown is instructive because it shows how cache TTLs, retry behavior, and fallback paths shape outage shape. Products with long edge-cache TTLs degraded gradually — Turnstile, for instance, kept serving customers whose site configurations were already cached locally, and failure rates climbed slowly as cache entries aged out. Products that fetched per-request data from KV without local cache, or with very short TTLs, failed immediately. Access fell into the latter category: identity provider validation requires fresh state, and the inability to read from KV took the product down within minutes. WARP and Gateway sit between those extremes. The takeaway for any team building on KV-like systems: design every read against a global key-value store with explicit consideration of "what if the read fails for 90 minutes" — fallback to last-known-good, graceful degradation, or fail-open versus fail-closed semantics depending on the use case. Defaults matter here, and Cloudflare's postmortem implicitly acknowledged that some products had defaults that turned a degraded KV into a hard product outage.

What does this say about multi-cloud strategies for security infrastructure?

The June 12 incident is one of the cleanest case studies in the limits of cross-cloud sourcing. Cloudflare is the third-party cloud provider for tens of thousands of organizations — its CDN, WAF, Access, and Turnstile products are themselves dependencies in customers' security stacks. When Cloudflare consolidated KV's central store onto a single hyperscaler, that hyperscaler's reliability became a transitive dependency of every Cloudflare customer using KV-backed services. The lesson is not that consolidation is always wrong, but that any tier-1 dependency on another cloud provider should be visible in your own dependency map and factored into your incident response runbooks. Cloudflare's own response — bringing the central store in-house — is the right answer for them. For customers, the right answer is to know which Cloudflare products lean on KV, which lean on R2, and which lean on neither, so you can assess outage correlation correctly.

What should other teams audit after this incident?

Five questions worth asking of your own stack. First, do you depend on any vendor whose own infrastructure depends on another vendor's IAM service? GCP IAM, AWS IAM, and Entra ID are common transitive dependencies that show up in unexpected places. Second, for any global key-value store you operate or depend on, what is the cold-start failure mode when the central store is unreachable for 30, 60, or 120 minutes? Third, do your edge caches have explicit "serve stale" policies during origin outages, or do they default to returning errors after TTL expiry? Fourth, are your edge product configurations stored in a way that can survive a central store outage, or do they require live reads? Fifth, when a vendor publishes a postmortem like Cloudflare's, do you incorporate its lessons into your own runbooks, or treat it as their problem to solve?

How Safeguard Helps

Safeguard's TPRM module continuously tracks the transitive cloud dependencies of critical SaaS vendors, surfacing chains like "Cloudflare KV depends on GCP IAM" so customers see the second-order risk before an incident, not during it. Dependency mapping traces application services to their backing key-value stores and surfaces blast radius cascades when a vendor product fails. Policy gates block production deployments that introduce new dependencies on edge-state services without documented fallback paths, and Griffin AI correlates multi-vendor status pages in real time to identify which internal services are affected by which underlying provider outage. For platform teams designing cross-cloud architectures, Safeguard produces a continuously updated map of which vendors carry transitive dependencies on which hyperscalers, turning the post-June-12 question of "what else depends on GCP underneath?" from a research project into a dashboard.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.