Architecture

Safeguard Knowledge Graph Architecture

How Safeguard's knowledge graph unifies components, vulnerabilities, policies, and runtime evidence into a single queryable substrate that powers every product surface.

Shadab Khan
Security Engineer
8 min read

Every Safeguard product surface — vulnerability triage, policy evaluation, reachability, SBOM comparison, the Griffin agent, the dashboards — talks to the same underlying store. That store is a property graph with somewhere north of thirty billion edges for our largest tenant, and it is the single source of truth for everything the platform knows about a customer's software estate. This post is an engineering walkthrough of how we built it, why we chose the shape we did, and the operational lessons from running it at scale.

Why a graph and not a relational warehouse?

The short answer is that the questions we need to answer are graph questions. Which services depend, transitively, on a vulnerable version of libxml2? Which binaries in our fleet share a common build provenance with an artifact flagged by an auditor? Which policies, if enabled, would have blocked the last forty admission decisions? Each of these is a traversal, and writing them as repeated joins over normalized tables is both awkward and slow.

We also evaluated the alternative of storing a denormalized document per artifact. That works for isolated lookups but falls apart the moment you want to compare artifacts or reason about relationships between them. Every time a shared component changes, you have to update every denormalized document that references it. The graph lets us update one node and have every query see the new truth immediately.

That said, we are not dogmatic. About twenty percent of our query load is aggregate analytics that would be painful in a graph engine — "show me the top 50 vulnerable components across the fleet ranked by exposure" is a columnar query, not a traversal. We serve those from a ClickHouse materialization that is kept in sync with the graph through change-data-capture.

What is the node and edge model?

Every entity in Safeguard is a typed node. The core types are Artifact, Component, Version, Vulnerability, Advisory, Policy, Tenant, Identity, Build, Host, and Finding. Edges are also typed and carry a small number of structured properties. Below is a simplified slice of the schema:

  (Tenant) ──owns──▶ (Artifact) ──contains──▶ (Component) ──at──▶ (Version)
                         │                                          │
                         │                                          │
                     produced_by                                affected_by
                         │                                          │
                         ▼                                          ▼
                      (Build) ──signed_by──▶ (Identity)        (Vulnerability)
                                                                    │
                                                                published_by
                                                                    │
                                                                    ▼
                                                               (Advisory)

The Version node is the hub. Everything vulnerability-related hangs off it — advisories, KEV status, EPSS scores, reachability verdicts, patch availability, fix versions. Versions are identified canonically by pURL plus a content hash when available, which removes duplicates even when SBOMs disagree on the version string.

We also carry first-class Finding nodes. A finding is the materialized decision that "this tenant, on this artifact, with this vulnerability, at this reachability verdict, under this policy, has this status." Findings exist as nodes rather than being computed on query because they carry their own lifecycle: a triage owner, a due date, a remediation plan, a comment thread. Treating them as data (not as a view) is what lets our ticketing integrations and audit trails work without rebuilding state on every request.

Which graph engine do you use under the hood?

We run a sharded property graph on top of JanusGraph with a ScyllaDB backend, but customers never talk directly to it. Every query goes through a Safeguard-specific query layer that speaks a subset of Gremlin plus our own query DSL called SGQL. SGQL compiles to Gremlin traversals but also carries tenant isolation, authorization predicates, and rate limits that we do not want to rely on the underlying engine to enforce.

A typical SGQL query reads like this:

from artifact a
where tenant = @ctx.tenant
  and a.environment = "production"
traverse a -[:contains*1..5]-> c:Component
where c has_cve("CVE-2024-3094")
  and reachability(a, c) in ["runtime", "static"]
return a.name, c.purl, path_from_entrypoint(a, c)
limit 100

The compiler rewrites this into a Gremlin traversal with the tenant predicate pushed as the first filter (so the planner uses the tenant index), hoists the reachability check into the graph walk to prune early, and attaches the per-query quota header. This is the pattern we settled on after trying pure Gremlin for about a year — without a compilation layer, engineers regularly wrote traversals that worked in isolation but did not compose with authorization or quota.

How is the graph sharded across tenants?

Tenants are the primary sharding dimension. Each tenant's nodes are routed to a partition group based on a salted hash of the tenant ID, and cross-tenant edges are explicitly forbidden at the write layer. This gives us strong isolation — a pathological query from one tenant cannot saturate another tenant's shard — and it makes compliance posture straightforward because partition boundaries map one-to-one onto customer boundaries.

Within a tenant, we also shard by artifact family. Services, container images, firmware, model weights, and ML training datasets live on different partition groups. The reasoning is that the access patterns are different. Container findings see a lot of traversal load per scan, while ML model artifacts are read-heavy but written rarely. Keeping them separate means a burst on one side does not degrade the other.

We maintain global nodes for entities that must be shared: the NVD vulnerability catalog, the KEV list, the OSV advisories, the global framework descriptor registry. These are replicated to every partition group as read-only materializations. Writes happen in one place and propagate within seconds through a fan-out service.

How do you keep the graph fresh?

Freshness is the single hardest operational problem. Vulnerability data updates in near-real-time. A new CVE published against a popular library needs to be reflected on every affected version node, on every component node that points to one of those versions, and on every artifact that transitively contains those components. For a large tenant this can be millions of nodes touched per advisory.

We use a three-tier invalidation model. Tier one is the hot path — advisories that match components already present in a tenant's graph. These fan out immediately through a queue whose work items are (tenant, component, new_advisory) tuples, and we target under 60 seconds from feed ingestion to finding materialization. Tier two is the warm path — components present in the global catalog but not yet seen in this tenant. These get lazy-materialized on the next scan. Tier three is the cold path — full recomputation of derived fields on a schedule, typically nightly, to catch drift.

The queue that drives tier one is idempotent and key-ordered by tenant, which is important because advisory bursts are common (Patch Tuesday dumps 80-200 advisories in an hour). Without key-ordering, two concurrent workers could race on the same finding node and produce a short window of inconsistent status. With it, the system handles bursts without special handling.

How do you prevent a bad write from corrupting the graph?

Every write to the graph goes through a write intent log before it hits the engine. Writes are structured as idempotent transitions: "node X from state A to state B, precondition the current state is A." If the precondition fails, the write is rejected and re-queued. This gives us optimistic concurrency without locking and also gives us a complete audit stream that is replayable. Every change the platform makes to a customer's graph is recoverable from the log.

The write layer also enforces schema invariants that the underlying engine does not understand. Examples: Finding nodes must always have exactly one incoming triaged_by edge from a Tenant; Vulnerability nodes cannot be created without a source advisory reference; Version nodes cannot exist without a Component parent. These invariants are validated at write time and we run a background invariant checker that sweeps the graph looking for violations. When it finds one (which has happened roughly a dozen times in two years, usually due to an import path we did not expect), it files an engineering ticket and quarantines the affected subgraph.

How Safeguard.sh Helps

The knowledge graph is the spine of the Safeguard platform — every product decision, every policy evaluation, every agent query ultimately reads from it. Because everything lives in one typed, consistent substrate, we can surface answers that point tools cannot: cross-artifact blast radius, cross-build provenance, cross-vendor exposure. The graph is strongly isolated per tenant, operated with explicit freshness SLAs, and extensible through our SGQL interface for customers with unusual reporting needs. If you want to run federated queries over your own software estate at the graph level, the Safeguard query layer is the path we expose to do it.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.