Vulnerability Management

Differential Testing for Supply Chain Vulns

Differential testing compares the behavior of multiple implementations of the same specification. In supply-chain work, it surfaces bugs that nobody else can see.

Shadab Khan
Security Engineer
7 min read

There is a class of vulnerability that is almost impossible to find with traditional testing. It does not cause a crash. It does not trigger a sanitizer. The code does exactly what the author intended. The bug is that two implementations of the same specification disagree about what that specification says, and an attacker can use that disagreement to slip past one while convincing the other everything is fine.

This is the territory of differential testing. It is one of the most effective techniques in supply-chain vulnerability research, and it is responsible for some of the most consequential findings of the last decade, from the Cloudflare certificate issues to the recurring XML signature wrapping attacks.

This post walks through what differential testing is, how to run a useful campaign, and what kinds of bugs it is best at finding.

The Core Idea

Differential testing compares the outputs of two or more programs that are supposed to implement the same behavior. You feed the same input to both programs. If their outputs differ, you have either a bug in one of them or an ambiguity in the specification. Either outcome is interesting.

William McKeeman coined the term in a 1998 paper, "Differential Testing for Software," where he applied it to compilers. McKeeman's insight was that when two C compilers compile the same source file differently, at least one of them is wrong. This observation has been extended to nearly every domain where multiple independent implementations exist.

For supply-chain work, the payoff is that differential testing finds the bugs that both implementations have never been tested against together. Each individual library has its unit tests. What nobody has is a unit test that says "the output of OpenSSL's X.509 parser should match the output of Go's X.509 parser," and that is exactly where the interesting bugs live.

Classic Targets

X.509 certificate parsing is the poster child for differential testing. In 2014, Chad Brubaker and collaborators published "Using Frankencerts for Automated Adversarial Testing of Certificate Validation in SSL/TLS Implementations" at IEEE Security and Privacy. They used differential testing to compare OpenSSL, GnuTLS, NSS, and CyaSSL on thousands of synthetic certificates, finding dozens of CVE-worthy bugs. Several of the findings involved certificates that one library accepted as valid while another rejected them, which is directly exploitable if the two libraries sit on opposite sides of a trust boundary.

JSON parsing has been another productive target. Nicolas Seriot's 2017 blog post, "Parsing JSON is a Minefield," documented dozens of inconsistencies between JSON parsers across languages. Some parsers accept trailing commas, others do not. Some handle Unicode surrogate pairs correctly, others produce garbage. These inconsistencies become vulnerabilities when, for example, a WAF and a backend parse the same request differently.

Protocol buffers, YAML, TOML, and XML have all been productive targets. The HTTP Request Smuggling research by James Kettle at PortSwigger, while not strictly differential testing in the academic sense, uses the same core insight: find cases where a front-end proxy and a back-end server parse the same HTTP request differently.

Cryptographic libraries are another classic target. The Wycheproof test suite, published by Google in 2016, systematically tests crypto libraries against each other and against known attack vectors. It has found bugs in nearly every major implementation.

Running a Differential Campaign

The mechanics of a differential campaign are simpler than they sound. You need three things: a corpus of inputs, a set of implementations, and a harness that runs all the implementations on each input and compares the outputs.

The input corpus is the hard part. Random bytes rarely produce useful differential results because most implementations reject them outright. You want inputs that are structurally valid but probe edge cases. There are three common approaches.

The first is seed corpora plus mutation. You start with a corpus of valid inputs, mutate them with a structure-aware fuzzer like libprotobuf-mutator or Nautilus, and feed the mutants to all implementations. This is the approach used by the Go team's differential fuzzing of the standard library's crypto implementations.

The second is grammar-based generation. You write a grammar that produces valid inputs for the target format and generate a large number of them. Tools like Gramatron and Grammarinator make this relatively easy. This approach works well for formats with a clean grammar, like JSON or ASN.1.

The third is specification-driven generation. For formats with a formal specification, you can sometimes generate inputs directly from the spec. The Frankencerts work used this approach for X.509.

What Makes Differential Testing Hard

The comparison function is surprisingly tricky. Two implementations might produce structurally different outputs that are semantically equivalent. A JSON parser that serializes numbers with different precision is not necessarily buggy, but a simple byte-level comparison would flag it. You need to normalize outputs before comparing.

Implementations also fail in different ways. One library might throw an exception on an invalid input, while another returns a default value. Both behaviors might be correct under different readings of the spec. The differential harness needs to decide which differences count.

For supply-chain work, the useful rule is to flag differences that correspond to trust decisions. A JSON parser that accepts an input another parser rejects is interesting because somebody somewhere is going to make a security decision based on that acceptance. A TLS library that considers a certificate valid when another considers it invalid is interesting for the same reason.

A Worked Example: CRLF Smuggling

In 2023, researchers used differential testing to find a class of HTTP header parsing bugs across major web frameworks. The setup was simple: feed the same raw HTTP request to different parsers and check whether they produce the same decomposition into headers and body.

The finding was that several parsers disagreed about how to handle bare newlines versus CRLF line endings, and about how to handle whitespace around header names. The disagreement was exploitable when a reverse proxy normalized headers differently from the application behind it. An attacker could smuggle headers past the proxy that the application would honor.

This kind of finding is invisible to conventional testing. Each parser has its own unit tests. The bug only emerges when you compare them against each other.

When Differential Testing Fails

Differential testing requires multiple independent implementations. If only one implementation exists, you have nothing to compare against. This sounds obvious but it matters for modern ecosystems, where a particular protocol or format might have only one widely-used library.

It also struggles with implementations that have diverged intentionally. If two libraries implement slightly different profiles of the same spec, every difference looks like a bug, and the signal-to-noise ratio collapses.

Finally, differential testing is not great at finding bugs that affect all implementations equally. If every parser of a given format has the same integer overflow, differential testing will not find it, because the parsers all agree that the overflowed result is correct.

Fitting Differential Testing into a Program

For a supply-chain security team, differential testing is a specialist tool rather than a daily practice. Most teams do not have the capacity to run a full differential campaign against every dependency. The leverage comes from picking a few high-impact targets and hitting them hard.

Formats that carry security decisions are the right targets. X.509, JWT, SAML, PKCS, JSON Web Encryption, CBOR, and TLS handshakes all qualify. If your dependency graph includes multiple libraries that implement one of these formats, differential testing those libraries against each other is usually productive.

The public differential test suites from Google (Wycheproof, ProjectZero's fuzzing corpora, OSS-Fuzz) are the first stop. Running an affected library against the relevant corpus is a twenty-minute job and often produces findings.

How Safeguard Helps

Safeguard integrates with several public differential test suites, including Wycheproof, to flag dependencies that fail known differential tests. When our research team runs new differential campaigns against common supply-chain targets, findings are published through our zero-day pipeline before the CVEs land. For teams running their own differential campaigns, Safeguard can ingest the resulting findings as structured advisories and propagate them across your dependency graph, so that a bug found in one implementation surfaces automatically against every application that uses it.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.