Most security platforms ship policies as a feature. Safeguard runs on policy — it is the substrate that decides what to flag, what to block, what to auto-remediate, and what to escalate. Every customer has hundreds of rules, and some enterprise deployments have thousands. Evaluating all of them against every artifact, for every finding, on every change, within a single-digit-second budget, is an engineering problem that took us several iterations to get right. This post walks through the evaluation engine that we settled on.
Why did we build on top of Rego rather than roll our own DSL?
Rego (OPA's query language) is a well-understood Datalog variant with a healthy ecosystem, and using it means customers who already have OPA policies can port them directly. We considered building our own DSL — it would have been easier to design but harder to adopt. For a platform that lives inside a customer's trust boundary, adoption cost matters more than expressive power. The engineers writing the policies are often platform or security engineers who already write Rego for Kyverno or Gatekeeper, and we wanted to meet them where they are.
That said, we do not run stock OPA. The vanilla OPA evaluator is designed for simple key-value input documents; our inputs are dense graph projections with thousands of nodes per artifact, and stock OPA struggles with that shape. We kept the language but replaced the evaluator.
What is the architecture of the engine?
The engine has three components: a compiler that turns Rego rules into an optimized plan, an evaluator that executes the plan, and a cache layer that memoizes intermediate results. A query flows through them as follows:
Rego Source Graph Projection Tenant Context
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────┐
│ Policy Compiler │
│ (rules → typed plan + index requirements) │
└───────────────────────┬─────────────────────────┘
│ compiled plan
▼
┌─────────────────────────────────────────────────┐
│ Evaluator │
│ (walks plan, issues indexed graph lookups) │
└────────┬────────────────────────┬───────────────┘
│ │
│ ▼
│ ┌─────────────────────┐
│ │ Decision Cache │
│ │ (per-plan per-input)│
│ └─────────────────────┘
▼
┌─────────────────────────────────────────────────┐
│ Decision Log │
│ (structured trail of rule fires + data) │
└─────────────────────────────────────────────────┘
The compiler runs once per policy version. Its output is a typed plan that names the exact graph indexes each rule requires. This matters because policies are often shaped like "for every component C of type X, check that C.license is in L and C.maintenance_score >= M." Without index hints the evaluator has to scan; with them it can do a direct point lookup.
How does the compiler optimize rules?
The compiler does four passes. Pass one is type inference — every variable in the policy is given a graph-aware type so the compiler knows whether c.license is a string, a set, or null. Pass two is predicate pushdown — filters on graph nodes are pushed into the graph query rather than evaluated in the rule body. Pass three is common subexpression elimination — if ten rules all compute "the set of components in artifact A," that set is computed once and referenced. Pass four is dependency analysis — the compiler figures out which rules depend on which graph sub-projections so the cache layer can invalidate precisely.
To give a concrete example, a rule like this:
package safeguard.license
deny[msg] {
some component in input.artifact.components
component.license in data.blocked_licenses
not component.license in input.tenant.exceptions
msg := sprintf("component %s uses blocked license %s",
[component.purl, component.license])
}
Compiles into a plan that pulls the license set via a single graph query rather than iterating the component list, intersects it with the blocked set in memory, and only materializes the msg strings for the components that fail. On a typical artifact with 800 components and 30 license rules, the compiled plan is roughly 40x faster than the stock OPA evaluation.
How do you keep latency predictable at scale?
Three techniques. The first is decision caching. Every compiled plan has a stable fingerprint; every input has a content hash. We cache decisions keyed by (plan_fingerprint, input_hash) with a bounded TTL. Cache hits serve in under 5ms. Across our fleet, roughly 60-70 percent of evaluations hit the cache, which dominates the latency distribution.
The second is sub-plan memoization within a single evaluation. If a policy set has ten rules that all need "the reachability verdict for finding F," that lookup happens once and is reused. This is where the compiler's common subexpression elimination pays off.
The third is eager pre-projection. Before a policy evaluates, we project exactly the subgraph the plan will touch into a dense in-memory representation. This avoids repeated graph engine round trips during evaluation. The projection itself is cached per-artifact-per-policy-version, which is effective because the policies in a tenant change rarely while artifacts change often.
Together these three bring p50 evaluation to around 40ms and p99 to around 250ms for a typical enterprise policy set of 400-600 rules evaluated against a medium-sized artifact. For the few pathological cases where a rule asks something genuinely expensive (for example, a transitive provenance chain check across a 20k-node subgraph), we precompute the answer asynchronously on each ingestion and the policy reads a materialized view.
What does the decision log look like?
Every evaluation emits a structured decision log that names the policy version, the input, every rule that was evaluated, every rule that fired, and the data that caused it to fire. The log is append-only and indexed by tenant and by decision ID so an auditor can pull the full trail for any blocked artifact.
A simplified entry looks like:
{
"decision_id": "dec_2026_02_07_a41c9",
"tenant_id": "tnt_acme",
"policy_version": "pol-v17",
"plan_fingerprint": "plan_8af3..",
"input_hash": "sha256:41b7..",
"outcome": "deny",
"rules_evaluated": 612,
"rules_fired": 3,
"fired": [
{
"package": "safeguard.license",
"rule": "deny",
"msg": "component pkg:npm/left-pad@1.3.0 uses blocked license GPL-3.0",
"witness": {
"component_node": "cmp_npm_left_pad_130",
"license": "GPL-3.0"
}
}
],
"elapsed_ms": 52,
"cache_hit": false,
"timestamp": "2026-02-07T14:09:11Z"
}
The witness field is the thing auditors actually ask for. It points to the specific graph nodes that triggered the rule, so a reviewer can click through to the SBOM, the scan, the source commit, or whatever evidence is needed. Decision logs are retained per customer policy (typically 7 years for federal tenants).
How do customers actually author and test policies?
Policies live in git repositories that the customer controls, and Safeguard pulls them on a webhook or schedule. Every pushed version goes through a compilation step that catches type errors, missing data references, and ambiguous rules before the policy ever touches a production artifact. Customers also get a policy playground where they can run candidate policies against real (or synthetic) inputs and see the decision log. This is a big deal for adoption — engineers resist policy systems that feel opaque, and the playground makes the behavior concrete.
We also ship a shadow mode. A new policy can be deployed in shadow for a configurable period, during which every evaluation runs both the new and old policy, compares outcomes, and surfaces divergence. Teams use this to validate migrations, to prove a rule change does not regress existing decisions, and to ease in policies from non-blocking to blocking. Without shadow mode, customers are reluctant to evolve their policy posture because the blast radius of a bad rule is too high.
How do you handle policy conflicts and precedence?
Policies can compose in three ways: require-all, require-any, and explicit-precedence. The default is require-all — every policy must allow the artifact. require-any is used for alternative pathways (for example, "signed by key A OR signed by key B"). explicit-precedence is used when a tenant and a parent organization both ship policies and one should win; the winner is declared in the tenant config rather than inferred from rule order.
When policies conflict — one says allow, another says deny — the engine does not guess. It emits a conflict outcome with the full set of voting rules, and the admission path surfaces this to the caller. The caller's policy then decides what to do; the default is to treat conflict as deny, but some tenants prefer conflict to route to human review.
How Safeguard.sh Helps
The policy evaluation engine is how Safeguard gives every customer a precise, auditable, and fast control plane over their supply chain. Whether the decision point is admission into a cluster, ingestion into the SBOM store, or triage of a new finding, the same engine runs the same Rego, emits the same decision log, and respects the same tenant boundaries. Customers write their policy once and have it enforced everywhere consistently. If your organization is already invested in OPA-style policy and needs to scale it across thousands of artifacts with auditable decisions, the Safeguard policy engine is the native home for those rules.