Research

Reachability Noise Reduction: Findings

The Safeguard Research team ran reachability analysis across a large corpus of real codebases. This is what we learned about which CVEs actually matter.

Shadab Khan
Security Engineer
7 min read

Every security engineer who has inherited an SCA tool has faced the same meeting: a screen full of "critical" CVEs that nobody can get to zero, on code that nobody has the appetite to refactor, against dependencies that nobody wants to fork. The pitch for reachability analysis has been consistent for several years now. Only a fraction of CVEs in a dependency graph are actually exercised by the code you wrote. If you can tell which ones, you can spend your engineering time on the small list that matters.

The Safeguard Research team spent three months running reachability analysis against a deliberately diverse corpus of repositories, and in this post we share what we found, how we measured it, and what we think this means for the vulnerability management programs our customers run.

How did the team design the study?

We picked a corpus of roughly a hundred and fifty production codebases spanning JavaScript, TypeScript, Python, Java, and Go, weighted toward the ecosystems we see most often in customer environments, and ran three variants of analysis against each.

The variants were: classic SCA with no reachability signal, function-level static reachability (can the vulnerable symbol be reached from any application entry point, ignoring runtime configuration), and contextual reachability (function-level reachability plus consideration of framework routing, feature flags, and optional dependency groups). For every reported advisory, we recorded whether the vulnerable symbol existed in the resolved dependency tree, whether it was statically reachable, and whether a reasonable human reviewer would rate it exploitable in context.

The human-review step mattered. Our engineers spent a capped amount of time per finding, with a written rubric, flagging each as exploitable, conditionally exploitable, or not exploitable in the deployed configuration. This gave us a ground-truth column to compare automated signals against.

How much noise does reachability actually remove?

In our corpus, static function-level reachability removed between 70% and 90% of advisories from the actionable queue, depending on ecosystem and advisory style.

The largest reductions were in Java and JavaScript, where the average transitive graph is deep and most advisories target internal utility code that application developers do not call directly. The smallest reductions were in Go, where dependency graphs tend to be flatter and applications more often use libraries in the ways the libraries were built for.

Contextual reachability, layered on top, trimmed another meaningful slice, frequently another 20% to 40% of what remained. These were overwhelmingly cases where the vulnerable code path required a framework feature the application did not use, an optional dependency extra that was not installed, or a configuration flag disabled by policy.

Those numbers match the broad range of published results from other reachability vendors and academic studies, which we think is a healthy sign that the technique is measuring something real rather than something tool-specific.

Where does reachability get it wrong?

Reachability's error modes are asymmetric. It under-reports when analysis is incomplete, and it over-reports when a reachable symbol is technically callable but never reached by user-controlled input.

We classified each false negative and false positive we could identify in the corpus. The false-negative categories we saw most often were reflective method calls, dynamic imports, gRPC or protobuf-generated code that escaped the call-graph builder, and native code bridges. False positives were dominated by reachable-but-benign paths, especially in logging frameworks and test utilities that sometimes ship in production images.

The practical consequence is that a reachable finding is strong signal but not proof of exploitability, and a not-reachable finding is strong signal but not proof of safety. Teams that treat either as gospel will be wrong a measurable fraction of the time.

How does reachability interact with EPSS and KEV?

Reachability is most useful when combined with exploit-likelihood signals. It is least useful when used alone.

In our corpus, roughly 2% to 5% of classic SCA findings corresponded to advisories with high EPSS scores or listings on CISA's KEV catalogue. When we filtered the reachable set by those exploit-likelihood signals, we consistently ended up with a double-digit to low-triple-digit list across the entire corpus, not per repository. That is a list a real security team can work through.

By contrast, filtering by EPSS alone without reachability still left many thousands of findings in aggregate, most of which we then had to discard manually because the vulnerable code was not exercised. Filtering by reachability alone without exploit signal left us investigating reachable-but-obscure findings that the attacker community had not shown any interest in.

The combination is what makes the queue tractable.

Does reachability shift developer behaviour?

Yes, and not always in the direction you would hope. The most important behavioural effect we saw, anecdotally across the engineering teams involved, was a reduction in defensive patching churn. When reachability was trusted, teams stopped reflexively bumping every dependency on every advisory and instead planned upgrades around product milestones.

The less healthy effect was a subtle complacency around not-reachable findings. We saw teams close tickets aggressively on "not reachable" status without revisiting them when new code introduced a new call path. Reachability is not a one-time verdict. It is a property of the current state of your codebase, and it must be re-evaluated on every meaningful change.

Our recommendation for teams adopting reachability is to keep a background scan job that re-evaluates previously-not-reachable findings against every merge to the default branch, and to set a short service-level objective for re-triage if the status flips.

What should teams measure to know it is working?

The two metrics we found most useful were actionable backlog size and mean time to remediate on reachable, high-EPSS findings.

Actionable backlog size is the count of findings your team has agreed to treat as work. Before reachability, this number is usually unbounded. After reachability, it should fit in a spreadsheet, and it should move day to day in a way that reflects actual engineering activity. If it does not, either your reachability tool is miscalibrated or your policy is swallowing too much.

Mean time to remediate on reachable, high-EPSS findings is the number that connects to risk reduction. It is the one we report to leadership as evidence that vulnerability management is a closed-loop process rather than an ever-growing backlog. In teams that had adopted reachability for at least two quarters, we saw this metric drop substantially, sometimes by an order of magnitude, primarily because the queue was small enough to actually work.

What are the limits of what we measured?

Our study was limited by corpus composition, a snapshot-in-time view, and the framework coverage of static analysis tooling.

The corpus was weighted toward well-structured backend services and CLIs, and away from plugin-heavy systems and notebook-driven workflows where static reachability is known to struggle. We expect reachability reductions to be smaller in Jupyter or plugin-based environments, and we would not generalise our 70% to 90% figure to those settings without further work.

The snapshot view means our numbers describe the queue at a point in time, not the full history of an application. A reachability tool must maintain accuracy across years of refactors, framework upgrades, and language runtime changes, and that is a harder engineering problem than getting a single run right.

What this means for the rest of your program

Reachability is not a silver bullet, but in our data it is the single most effective noise-reduction technique we measured, ahead of severity filtering, ahead of tag-based ignoring, and ahead of basic EPSS filtering. Combined with exploit-likelihood signals and an honest understanding of its error modes, it turns vulnerability management from a treadmill into a manageable queue.

The teams that get the most value from reachability are the teams that use the freed-up time to harden the small set of things that actually matter: the reachable, exploitable findings on the exposed attack surface.

How Safeguard.sh Helps

Safeguard.sh combines function-level and contextual reachability with EPSS, KEV, and CI-time exploit telemetry to produce a single prioritised queue rather than an unbounded list. We re-evaluate reachability on every merge, surface status flips as alerts, and expose the underlying call path so engineers can verify and disagree. Our customers typically see queue reductions consistent with the ranges in this post within the first two weeks of use, with the remaining work focused on the small list of reachable, exploitable findings that actually move risk.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.