Open Source Security

The OSV Vulnerability Database API Cookbook

Practical patterns for using the OSV.dev API in production: batch queries, schema gotchas, version range parsing, and how to integrate OSV data into your own vulnerability pipelines.

Daniel Chen
Platform Engineer
5 min read

OSV.dev has quietly become the most useful open vulnerability data source for anyone building tooling, and most teams underuse it. The API is small, the schema is well-specified, and the data quality across ecosystems is competitive with commercial feeds for the common case. This post covers the patterns we have found durable in production after running OSV in our own pipelines for two years.

Why use OSV when NVD exists?

NVD remains the authoritative CVE feed, but it has been operationally unreliable since the analysis backlog began in early 2024 and has not fully recovered. OSV solves a different problem: it aggregates vulnerability data from ecosystem-native sources, GitHub Advisory Database, PyPI Advisory Database, RustSec, Go vulndb, and roughly two dozen others, into a single schema with consistent version range semantics. For a tool that needs to know whether a specific package version is affected, OSV usually returns a defensible answer faster than NVD plus per-ecosystem feeds.

The API is free, has no API key requirement for the public tier, and supports both per-package lookups and batched queries. It scales well enough that we have not bothered mirroring it locally for our internal scanning. The data lag from publication to OSV ingestion runs hours, not days, for the major ecosystems.

What does the basic query pattern look like?

The simplest useful query is a POST to the /v1/query endpoint with a package name, ecosystem, and version. OSV returns a list of vulnerabilities affecting that exact version, with affected ranges, severity, and references. The schema is documented at ossf.github.io/osv-schema and worth reading carefully because the version range semantics differ subtly per ecosystem.

For pipeline integration, the more useful endpoint is /v1/querybatch, which accepts up to 1000 package queries per request and returns vulnerability IDs without the full record. You then fetch full records for the small subset that returned hits. This pattern keeps you under the rate limits and reduces per-scan latency meaningfully. A 2000-package SBOM resolves in two batch calls plus typically 10 to 40 detail fetches, total round-trip under two seconds.

Which schema gotchas matter in production?

The first gotcha is that the affected field can contain multiple ranges and the version comparator depends on the ecosystem. Semver works for npm, Go, and most others, but PyPI uses PEP 440, Maven uses its own ordering, and Debian uses dpkg version comparison. OSV tags the comparator in each entry, but if you write your own range evaluator, you must dispatch on ecosystem or you will produce wrong answers for the edge cases.

The second gotcha is that withdrawn vulnerabilities remain in the database with a withdrawn timestamp. Filter those out unless you specifically want historical context. The third is that aliases tie together CVE, GHSA, and ecosystem-specific IDs, so deduplication across IDs is the consumer's responsibility. We hash on the modified timestamp plus the primary alias to detect actual changes during incremental ingestion.

A subtler one: the severity field is not always populated, and when it is, the scoring system is often CVSS v3 even though v4 is available. Do not treat severity as ground truth, and never sort solely on it for prioritization.

How should you handle freshness and incremental updates?

OSV provides a public GCS bucket with daily snapshots of the entire database, organized by ecosystem and zipped. For tools that need a local mirror, that bucket is the cleanest way to bootstrap: download once, then poll the per-vulnerability modified timestamps via the API for incremental updates. The bucket refreshes daily, so combining the snapshot with API polling gives you both bulk and near-real-time freshness.

If you cannot run a local mirror, the API is reliable enough to query live on every scan. We have not seen sustained outages in two years of production use. Cache aggressively, an hour TTL on individual vulnerability records is safe for most pipelines, and use ETags on the batch endpoint to skip unchanged responses.

What does a useful integration look like end-to-end?

A typical integration takes an SBOM, extracts the package coordinates, batches them into querybatch calls, fetches detail for any hits, normalizes the ranges, and emits a findings record per affected package. Two patterns we have found important: filter to the package versions actually present rather than treating ranges as opaque strings, and preserve the OSV alias graph so you can correlate findings against other feeds like KEV or commercial intelligence.

The output of this pipeline is a clean per-package vulnerability list that integrates well with downstream prioritization. The thing OSV does not do, and was never designed to do, is rank findings by exploitability or reachability. That work happens in the layer above OSV, against the OSV-sourced data.

How Safeguard Helps

Safeguard uses OSV as one of several data sources alongside NVD, GitHub Advisory, vendor feeds, and commercial intelligence, and Griffin AI reconciles the inevitable disagreements between sources into a single defensible per-CVE record. Our reachability engine runs against the OSV-sourced findings to filter the long tail of unreachable issues, typically reducing the actionable count by an order of magnitude. SBOM ingestion handles the OSV ecosystem-version edge cases we describe above without consumer code, and policy gates can express OSV-aware rules like "block on any GHSA referenced in CISA KEV with reachability greater than zero." Zero-CVE base images are validated against the same OSV feed so the guarantee is reproducible from public data.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.