Vulnerability Management

Static Analysis False-Positive Reduction

A technique-by-technique tour of how modern static analyzers cut false positives, from CodeQL's path pruning to Infer's bi-abduction.

Shadab Khan
Security Engineer
8 min read

A static analyzer that produces a ten-thousand-finding report is not a security tool. It is a random number generator that occasionally coincides with a real vulnerability. The cost of triaging those ten thousand findings is so high that developers stop looking after the first few dozen, at which point the analyzer is functionally turned off whether or not it is still running in CI.

The interesting engineering problem in modern static analysis is not finding potential issues. A first-semester compiler student can write a pattern matcher that flags every call to strcpy. The interesting problem is reporting only the issues that matter, in a form developers will actually read, with enough context for them to act. This is the false-positive reduction problem, and it is where the commercial and academic state of the art is concentrated.

This post walks through the techniques that modern static analyzers use to cut false positives, with enough detail that a practitioner can apply the same ideas to a custom analysis.

The Soundness-Precision Tradeoff

Every static analysis lives on a spectrum between sound and precise. A sound analysis reports every possible bug, at the cost of reporting many non-bugs. A precise analysis reports only real bugs, at the cost of missing some. In theory you can have one or the other but not both, because the underlying problem is undecidable by Rice's theorem.

In practice, analyzers pick a position on the spectrum and then spend enormous engineering effort moving that position toward the upper-right corner, where both soundness and precision are high. The specific techniques for doing so are the subject of the rest of this post.

The 2002 POPL paper by Flanagan and Leino, "Houdini, an Annotation Assistant for ESC/Java," framed the problem clearly: the analyzer should not emit warnings it cannot substantiate. That framing still guides the design of tools like Facebook's Infer.

Path Sensitivity and Pruning

A flow-sensitive analyzer tracks different facts at different program points. A path-sensitive analyzer tracks different facts along different execution paths. The difference matters because many false positives come from conflating paths that are not actually feasible.

Consider a function with a branch that checks a null argument and returns early. A flow-sensitive but path-insensitive analyzer might warn about a null dereference later in the function, because it does not track which paths reach that later point. A path-sensitive analyzer knows that the dereference is only reachable when the argument is non-null and suppresses the warning.

Full path sensitivity is expensive, so most practical analyzers approximate it. CodeQL uses a combination of flow labels and path problem queries to achieve path sensitivity for the specific properties a query cares about. Semgrep Pro's interprocedural taint tracking does something similar for taint propagation. The design pattern in both cases is to track only the properties relevant to the current check, rather than all program facts.

Sanitizer Specifications

Every injection-class false positive has the same root cause: the analyzer sees data flow from a source to a sink and does not know that the flow passes through a sanitizer. Fixing this is boring but essential work.

The best analyzers ship with extensive sanitizer specifications for common frameworks. CodeQL's ruby-on-rails library knows about ActionController::Parameters#permit. Semgrep's taint mode includes sanitizer rules for Spring's @Valid annotation and Laravel's form request validation. The 2021 OOPSLA paper, "High-Level Abstractions for Giving Users Control Over Flow-Sensitive Type-Based Analyses," by Shixin Tian and colleagues at Purdue, describes a framework for letting users write sanitizer specifications without modifying the analyzer itself.

The practical work for a product security team is to extend the sanitizer list for the codebase's custom helpers. If you have a function called safeInterpolate that escapes HTML, you need to tell the analyzer about it. Leaving this undone is the single largest source of false positives in most SAST deployments.

Ranking Models

Not all findings are equally likely to be real. A finding that involves two hops through standard library code is more likely to be a false positive than one that involves a direct data flow within a single function. A finding along a path that includes an obvious guard condition is more likely to be a false positive than one with no guards.

Modern analyzers use ranking models to surface the most likely true positives first. The ranking features typically include path length, number of sanitizer candidates along the path, function-call depth, and heuristic signals like whether the source and sink are in the same file.

GitHub's research on CodeQL findings showed that a simple ranking model based on these features could surface true positives in the top ten findings about eighty percent of the time, even when the overall false-positive rate was much higher. For a developer reviewing findings at the top of a list, this is the difference between "worth my time" and "another SAST report."

Bi-Abduction and Compositional Analysis

Facebook's Infer uses a technique called bi-abduction, developed by Cristiano Calcagno, Dino Distefano, Peter O'Hearn, and Hongseok Yang in a 2009 POPL paper. Bi-abduction computes, for each function, a specification that describes what the function requires of its inputs and guarantees about its outputs. The specifications are composable, so the analysis of a whole program is the composition of the specifications of its parts.

The false-positive reduction benefit comes from the way bi-abduction handles unknown context. If a function's inputs might be null, the specification simply says so, and the analyzer only warns when the function is actually called with a null argument. This avoids the classic false positive where an analyzer warns about every function that dereferences a pointer, regardless of whether any caller ever passes null.

Infer is not the only bi-abductive analyzer, but it is the most widely deployed. Its track record on large codebases at Meta, Uber, and other users is a strong argument that compositional analysis scales.

Interprocedural Context Sensitivity

Context-sensitive analysis distinguishes different callers of the same function. A context-insensitive analysis treats foo() as always producing the same result, regardless of who calls it. A context-sensitive analysis produces different results for different call sites.

The classic example is a utility function like format. If one caller passes a constant and another caller passes a request parameter, a context-insensitive analyzer has to assume the worst for both, which leads to false positives in the constant case. A context-sensitive analyzer tracks the two calls separately and only warns about the one with tainted input.

The state of the art for context sensitivity in Java is k-CFA and object-sensitive variants. Yannis Smaragdakis's group at the University of Athens has published extensively on the tradeoffs, and their 2014 PLDI paper, "Introspective Analysis: Context-Sensitivity, Across the Board," is a good starting point.

User Feedback Loops

The techniques above are all analyzer-side. The other half of false-positive reduction is the feedback loop between developers and the analyzer.

A mature SAST deployment captures developer dispositions (true positive, false positive, accepted risk) and feeds them back into the analyzer. Findings that have been marked false positive in one place should not be re-surfaced in similar places. Findings that correspond to previously confirmed true positives should be ranked higher.

Commercial tools like Snyk, Checkmarx, and Veracode have all converged on some form of this feedback loop. The open-source options are more limited, but Semgrep's baseline mode provides a basic form of it by suppressing findings that existed before a given commit.

Where Machine Learning Fits

There is a growing body of research on machine-learning-based false-positive filters. The basic idea is to train a classifier on a corpus of triaged findings and use it to predict which new findings are likely to be true positives.

The 2020 ESEC/FSE paper by Tricentis and collaborators, "Generating Precise Error Specifications for C," is an example of using ML to learn sanitizer specifications from data rather than requiring manual specification. Results vary widely by codebase and by tool, but the direction of travel is clear.

A word of caution: ML-based filters can hide real vulnerabilities if the training data is biased. Responsible deployments include a periodic audit of findings the filter suppressed.

How Safeguard Helps

Safeguard's static analysis pipeline ships with a library of sanitizer specifications for major web frameworks and our customers' most common custom helpers, which cuts the false-positive rate on injection-class findings by the largest margin of any single technique. The ranking models that drive our findings dashboard learn from every disposition your team records, so after a few weeks of use the top findings in your queue are reliably the ones worth investigating. For teams running their own SAST in parallel, Safeguard can ingest the findings and apply the same ranking and sanitizer models, giving you a single prioritized queue across every analyzer you run.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.