On the evening of May 5, 2026, a large slice of Germany's internet stopped resolving. The cause was not an attack, a cable cut, or a routing leak. It was a cryptographic self-inflicted wound at the very top of the DNS hierarchy: DENIC, the registry operator for the .de country-code top-level domain, began publishing DNSSEC signatures that did not match its published keys. By the rules of the protocol, every validating resolver on the planet that received those signatures was required to reject them and return SERVFAIL — a security mechanism working exactly as designed, turned into an outage by a signing failure upstream.
This belongs in a May 2026 infrastructure-security review for a subtle but important reason. DNSSEC is a security control. Its entire job is to make resolvers refuse data whose cryptographic chain of trust is broken. When the registry that anchors that chain ships bad signatures, the control faithfully converts an integrity failure into an availability failure. The incident is a clean case study in how a deployment defect — code that passed tests and "cold" parallel operation but was still wrong — propagated into a globally enforced trust mechanism, and how downstream operators like Cloudflare had to make a hard call between honoring DNSSEC and keeping users online.
There are two excellent public post-mortems: DENIC's own root-cause analysis, and Cloudflare's account of how its 1.1.1.1 resolver responded. Together they let us reconstruct the full chain from "a code change in April" to "millions of potential lookups failing in May" with unusual clarity. This post walks that chain, the detection signals, the remediation, and why the supply chain of trust at the DNS root is so unforgiving.
TL;DR
- On May 5, 2026 (~19:30 UTC), DENIC began publishing unvalidatable DNSSEC signatures for the
.dezone during a routine key rollover. - Root cause: a deployment defect introduced in April 2026 in DENIC's third-generation in-house signer made it generate three different key pairs (one per HSM) instead of one shared key pair across three HSMs. Only one public key was in the DNSKEY record, so roughly two-thirds of signatures could not be validated.
- The defect was "not fully covered by the test scenarios" and was not caught in test runs or in "cold" parallel operation before going live. Validation systems flagged anomalies, but "the notifications were not processed correctly," delaying human response.
- Any validating resolver, including Cloudflare's 1.1.1.1, was protocol-bound to reject the bad signatures and return SERVFAIL. About 18 million
.dedomains exist; ~3.6% are DNSSEC-signed, still hundreds of thousands of domains, plus knock-on effects via NSEC3. - Cloudflare cushioned impact by serving stale cached records (RFC 8767), keeping ~65% of queries answered, then at 22:17 UTC deployed an override marking
.deinsecure (a Negative Trust Anchor equivalent) to bypass validation. - DENIC distributed a corrected zone starting 00:08 UTC May 6; normal operation restored by ~01:15 UTC.
- Lesson: a code change that clears tests can still be catastrophically wrong in production; the fix is provenance, drift detection between intended and live state, and policy gates on changes to trust-critical systems.
What happened
The verified timeline, drawn from DENIC's and Cloudflare's post-mortems (UTC):
- April 2026 — During the rollout of a third-generation DNSSEC signing system, a defect is introduced into DENIC's in-house signing software. It passes test runs and cold parallel operation without being flagged.
- May 5, ~19:30 — During a routine DNSSEC key rollover, the defective signer produces and distributes signatures that cannot be validated against the published DNSKEY record. Validating resolvers worldwide begin returning SERVFAIL for
.de. Cloudflare's 1.1.1.1 sees SERVFAILs spike immediately. - 19:30–22:30 — As cached records expire, SERVFAIL rates climb steadily. Cloudflare's "serving stale" behavior keeps roughly 65% of queries answered from expired cache rather than erroring.
- 22:17 — Cloudflare deploys an override rule marking
.deas an insecure zone (functionally a Negative Trust Anchor), bypassing DNSSEC validation for.deso resolution resumes. It applies the same mitigation to the internal resolver that serves CDN customers with.deorigins. - May 6, 00:08 — DENIC begins distributing a corrected DNS zone with valid signatures.
- May 6, ~01:15 — Pre-outage operational status fully restored.
DENIC publicly apologized. ICANN data indicates only about 3.6% of .de domains are DNSSEC-signed, but with close to 18 million .de registrations that still amounts to hundreds of thousands of directly affected domains, and the failure of NSEC3 records affected resolution behavior more broadly for validating resolvers.
Technical analysis: one key pair, or three?
DNSSEC works by signing DNS records with private keys whose corresponding public keys are published in DNSKEY records, themselves vouched for up the chain to the root. A resolver that validates checks that each record's signature (RRSIG) verifies against a published DNSKEY. If it does not, the resolver must treat the data as bogus and refuse to return it — SERVFAIL. There is no "ignore the broken signature and serve it anyway" in a validating resolver; that refusal is the whole security guarantee.
DENIC's root cause, in its own framing: during deployment of the third-generation signer, a faulty piece of code "was incorporated into the in-house development which was not fully covered by the test scenarios and was therefore not identified as defective during test runs or in 'cold' parallel operation prior to commissioning." The functional defect: DENIC signs across three Hardware Security Modules (HSMs) and the correct behavior is to generate one key pair shared across all three. The defective code instead "generated three different key pairs, one per HSM." Only one of those three public keys was published in the DNSKEY record.
The consequence is arithmetic. Signatures produced by the two HSMs whose public keys were not published could not be validated by any resolver — roughly two-thirds of signature records were unvalidatable. From a resolver's perspective, the .de zone was intermittently and then increasingly serving signatures with no matching key. Correct DNSSEC behavior mandated SERVFAIL.
# Conceptual illustration (not real keys/config).
# Intended: one shared key pair across 3 HSMs; its public key in DNSKEY.
DNSKEY .de -> [ KEY_A ] # one published public key
HSM1 signs with KEY_A -> RRSIG validates against DNSKEY OK
HSM2 signs with KEY_A -> RRSIG validates against DNSKEY OK
HSM3 signs with KEY_A -> RRSIG validates against DNSKEY OK
# Defect: three independent key pairs, only one public key published.
DNSKEY .de -> [ KEY_A ] # still only one published
HSM1 signs with KEY_A -> validates OK
HSM2 signs with KEY_B -> NO matching DNSKEY -> BOGUS -> SERVFAIL
HSM3 signs with KEY_C -> NO matching DNSKEY -> BOGUS -> SERVFAIL
Two secondary factors deepened and prolonged the impact. First, caching cut both ways: it delayed the onset for many users (records cached before 19:30 kept resolving until their TTLs expired), which is why SERVFAIL rates climbed gradually rather than instantly, but it also meant recovery was not instantaneous after the fix. Second, and more troubling operationally, DENIC's validation systems did detect the anomalies — "the notifications were not processed correctly." The technical safety net fired; the human/process layer above it did not act on the alert promptly. A correct detector whose output is dropped is, in effect, no detector.
Cloudflare's side illustrates the downstream operator's dilemma. A validating public resolver cannot simply ignore broken DNSSEC without abandoning the integrity guarantee its users rely on. Cloudflare first leaned on RFC 8767 "serve stale," continuing to answer from expired cache rather than erroring, which preserved roughly 65% of queries. When that was not enough, it made the deliberate, scoped decision to mark .de insecure via an override (a Negative Trust Anchor in spirit), temporarily turning off validation for that one zone to restore service — a controlled reduction of a security guarantee, applied narrowly, to limit availability damage. That is the kind of break-glass tradeoff teams should pre-plan rather than improvise.
What detection looks like
For a registry or any DNSSEC signer, the detection lesson is to validate the actual published artifact against the published keys before and after it goes live, and to ensure detector output reaches a human or an automated rollback.
Signals and checks:
- Post-publish validation against published DNSKEY. Independently fetch the zone's
DNSKEYset and verify that everyRRSIGyou publish validates against a published key — from outside the signer, using a different code path than the one that produced the signatures. The defect here was precisely that two-thirds of signatures had no matching published key; an external validator would have caught it immediately. - Alert delivery is part of the control. DENIC's validators flagged the anomaly but the notification was not processed. Detection is not done when an alert is generated; it is done when the alert is acted on. Test the alerting path itself, and prefer automated halt/rollback on signing anomalies over relying on a human reading a notification.
- Resolver-side signals (for operators consuming DNSSEC). A sharp, sustained rise in SERVFAIL for a specific TLD, correlated across many domains under it, points at a zone-level DNSSEC failure rather than a single misconfigured domain. Cloudflare's 1.1.1.1 saw exactly this for
.deat 19:30 UTC.
# Illustrative resolver-side detection intent (pseudo-query)
source=resolver_logs
| where rcode == "SERVFAIL"
| where qname endswith ".de"
| timechart span=1m count
# A zone-wide SERVFAIL spike across many distinct qnames under one TLD
# == upstream DNSSEC/zone failure, not a per-domain problem.
What to do Monday morning
If you operate a DNSSEC-signed zone or a signer:
- Validate the live artifact externally, every time. After every rollover or signer change, independently confirm the published zone validates against the published DNSKEY using tooling that does not share code with the signer. Block "commissioning" on a green external validation.
- Make signing anomalies auto-halt, not auto-notify. Wire your validators so a detected signature/key mismatch automatically halts publication or rolls back, rather than emitting a notification someone might not process. Then test that path under fault injection.
- Treat signer deployments as trust-critical changes. A change to signing software is not a routine deploy. Require provenance, review, and a tested rollback for the artifact before it touches production, and prefer staged rollover with validation gates between stages.
- Pre-plan the break-glass. Decide in advance what you do if your own zone is bogus or an upstream zone you depend on is bogus. For resolver operators, document when and how you would serve stale (RFC 8767) and when you would apply a scoped Negative Trust Anchor, so you are not improvising during an outage.
If you consume DNS / DNSSEC:
- Know your single points of trust. Inventory which TLDs and zones your critical services depend on, and recognize that a registry-level DNSSEC failure is outside your control but inside your blast radius. Build resolver redundancy and consider serve-stale behavior for resilience.
- Monitor SERVFAIL by zone. Alert on TLD-wide SERVFAIL spikes so you can attribute an outage to an upstream zone quickly and reach for the right mitigation instead of debugging your own DNS.
Why this keeps happening
DNSSEC sits at a brutal intersection: it is a strict, fail-closed security control anchored by a small number of registry operators, and it is implemented in software that, like all software, has bugs. The protocol gives no quarter — a signature that does not validate is rejected, full stop — so any defect that corrupts signing converts directly into an availability event for everyone who validates. There is no graceful degradation built into the protocol; graceful degradation has to be bolted on downstream (serve-stale, negative trust anchors), which only the most sophisticated resolver operators do well.
The recurring pattern in registry-level DNSSEC incidents is the same as in many supply-chain failures: a change that passed its tests was still wrong, because the tests did not cover the failure mode. DENIC's own account is candid about this — the defect survived test runs and cold parallel operation because the scenarios did not exercise the multi-HSM key-generation behavior that broke. Cold parallel operation looked fine because, presumably, it was not being validated end-to-end against published keys by an independent path. The gap between "our tests are green" and "the artifact we shipped is correct in production" is exactly where these incidents live, and it is widest for trust-critical systems where the failure mode is rare, global, and instantaneous.
The structural fix
This is a software supply-chain-of-trust problem wearing a DNS costume, and the fixes are the same ones that apply to any change touching a trust-critical system: provenance on the artifact, continuous reconciliation of intended versus actual state, and policy gates that block trust-critical changes from shipping unverified. Safeguard's drift-detection is built to surface the gap between what a system is supposed to be publishing and what it is actually publishing — the precise gap (one published key versus three signing keys) that turned a code defect into a global outage. Policy-as-code lets you encode a non-negotiable gate like "no signer deployment is commissioned until an independent validator confirms every published signature verifies against a published key," so a change that clears unit tests still cannot reach production without that end-to-end check. For the integrity of the signing artifacts themselves, SLSA provenance on the signer build ties the deployed code back to a reviewed, attested source. None of this rewrites the DNSSEC protocol's fail-closed behavior, but it shortens the time between "a defect shipped" and "we caught it before, or seconds after, it went live."
What we know we don't know
- Why the alert was not processed. DENIC states notifications "were not processed correctly," but the specifics of that process failure (tooling, on-call, alert routing) were not detailed publicly.
- Precise user-facing impact. The 3.6% DNSSEC-signed figure and ~18 million
.deregistrations bound the directly affected population, but a precise count of failed lookups or affected end users is not published. Cloudflare's ~65% serve-stale success rate is specific to its resolver, not a global figure. - Exact code defect. DENIC described the behavior (three key pairs instead of one) but not the line-level defect or why the test scenarios omitted the multi-HSM case.
References
- DENIC — Analysis of the DNS outage on 5 May 2026
- DENIC — Technical issue with .de domains resolved
- Cloudflare — When DNSSEC goes wrong: how we responded to the .de TLD outage
- The Register — DENIC sorry for DNSSEC error that crashed Germany's internet
- StatusGator — Major .de Outage: DNSSEC Failure at DENIC Takes Down German Domains
Internal reading: