On May 5, 2026, the Center for AI Standards and Innovation (CAISI) at the Department of Commerce's National Institute of Standards and Technology announced pre-deployment evaluation agreements with Google DeepMind, Microsoft, and xAI. The three join Anthropic and OpenAI, which signed comparable agreements roughly two years earlier when the body was still called the US AI Safety Institute. With these additions, five frontier labs now submit models for government evaluation before release, and CAISI reports having completed more than 40 such evaluations to date, including on state-of-the-art models that remain unreleased.
For a security audience, this is easy to file under "policy news" and ignore. That would be a mistake. Pre-deployment evaluation is becoming a supply-chain control point for the models that increasingly sit inside enterprise software. The evaluations cover cybersecurity, biosecurity, and chemical-weapons risk, are partly conducted in classified environments by an interagency group called the TRAINS Taskforce, and produce the kind of capability and safety evidence that downstream consumers have, until now, had to take on faith from vendor safety cards. The question this raises for anyone integrating frontier models is concrete: what evidence of evaluation can you actually obtain, and how does it fit into your own AI governance and risk acceptance?
This post treats the CAISI expansion as what it is from a defender's seat: the emergence of an independent evaluation tier in the model supply chain. We look at what was announced, what the evaluations do and do not cover, and how to translate "the government tested this model" into something an AppSec lead or CISO can actually use.
TL;DR
- On May 5, 2026, NIST's CAISI announced frontier AI national-security testing agreements with Google DeepMind, Microsoft, and xAI.
- They join Anthropic and OpenAI (which signed similar agreements about two years prior), bringing the program to five participating frontier labs.
- CAISI conducts pre-deployment evaluations covering cybersecurity, biosecurity, and chemical-weapons risk; some run in classified environments via the interagency TRAINS Taskforce.
- CAISI reports completing more than 40 evaluations to date, including on unreleased state-of-the-art models.
- The agreements are voluntary and advisory, not a regulatory mandate, but they build the institutional capacity that a future mandate could rely on.
- For consumers, this is the start of an independent evaluation tier in the model supply chain. The practical gap: evaluation results are not yet a standardized, machine-readable artifact you can pull into an AI-BOM or a policy gate.
What happened
CAISI is the successor to the US AI Safety Institute, housed at NIST within the Department of Commerce. On May 5, 2026, it announced new agreements with Google DeepMind, Microsoft, and xAI to conduct pre-deployment evaluations and targeted research on frontier models. Anthropic and OpenAI had signed equivalent agreements roughly two years earlier under the prior administration, when the body operated as the AI Safety Institute. The May 2026 additions bring the participant count to five frontier labs.
The reported scope and mechanics:
- What is evaluated: frontier model capabilities and risks, with a stated focus on cybersecurity, biosecurity, and chemical-weapons concerns. CAISI also conducts "targeted research" to advance evaluation methods.
- When: pre-deployment, meaning before the model is publicly released. CAISI says it has evaluated state-of-the-art models that remain unreleased.
- Who runs it: CAISI, with evaluators from across government participating and providing feedback through the CAISI-convened TRAINS Taskforce, an interagency group focused on AI national-security concerns. Some evaluations are conducted in classified environments.
- Scale so far: more than 40 evaluations completed to date.
- Legal posture: the agreements support testing in classified environments and were drafted with flexibility to respond to rapid AI advances. They are voluntary collaborations, not compliance mandates.
The simplest way to read the announcement is that the United States now has a standing, government-run, pre-release evaluation function covering the major Western frontier labs, and it is expanding rather than contracting.
Why this matters for the model supply chain
Enterprises consume frontier models the way they consume any other dependency: through APIs, through fine-tuning, embedded in vendor products, and increasingly as the reasoning core of internal applications. The difference from a normal dependency is that you cannot read the source. You get weights or an API endpoint, a model card, and the vendor's own safety claims. There has been no independent party attesting to what the model can and cannot do at the capability level that matters for security: can it meaningfully uplift a cyber-offensive task, can it assist with bio or chemical harm, how robust are its safeguards.
Pre-deployment evaluation by an independent body changes that picture in principle. It introduces a tier in the supply chain analogous to a third-party audit: someone other than the producer has looked at the artifact before it ships. For a CISO building an AI risk program, "this model class was evaluated pre-deployment by CAISI, including in a classified setting, for cyber and CBRN uplift" is a materially different input than "the vendor's safety card says it is safe."
The caveat, and it is a large one, is the gap between the existence of evaluations and the usability of their results. The announcement establishes that evaluations happen. It does not establish a standardized, public, machine-readable evaluation artifact that a downstream consumer can pull into an AI-BOM or feed into a policy gate. Classified evaluation outputs, by definition, do not flow to enterprise risk teams. So the control exists upstream, but the evidence is not yet a supply-chain artifact in the way an SBOM or a SLSA attestation is.
How an evaluation tier fits a defender's model
If you map this onto familiar supply-chain concepts, pre-deployment evaluation occupies the slot that independent attestation occupies for code. The useful mental model:
Code supply chain Model supply chain (emerging)
----------------- -----------------------------
source + build training + fine-tuning
SBOM AI-BOM / model card
SLSA provenance model signing (OpenSSF OMS)
3rd-party audit / pentest CAISI pre-deployment evaluation
CVE / advisory feed model risk + capability disclosures
policy gate on the above policy gate on the above (immature)
The rows that are mature on the code side (SBOM, provenance, audit, policy gating) are at varying stages on the model side. Model signing has a v1.0 specification and library from the OpenSSF. AI-BOM and model cards exist but are inconsistently populated. Independent evaluation, as of May 2026, is operational at CAISI but its outputs are not yet a standardized consumable. The strategic point for a security program is to build the AI governance plumbing now so that when evaluation results, signatures, and AI-BOMs do become reliably available, you can ingest and gate on them rather than scrambling to retrofit.
What to do Monday morning
- Add "independent evaluation status" to your model intake questionnaire. When onboarding a frontier model or a vendor product built on one, ask whether the underlying model class participated in CAISI (or an equivalent body's) pre-deployment evaluation, and request whatever summary the vendor can share. Treat the answer as one input, not a guarantee.
- Do not treat evaluation as a substitute for your own controls. A model evaluated for cyber and CBRN uplift can still be jailbroken in your application context and can still leak data through your integration. Government evaluation addresses catastrophic-capability risk, not your specific deployment risk.
- Build the AI-BOM now. Inventory every model your organization consumes, with its provider, version, and the provenance signals you can currently obtain (model card, signature, evaluation status). The plumbing is the hard part; populate it before the richer artifacts arrive.
- Stand up a model policy gate. Define the rules you would enforce if you had the data: require a signed model, require a populated model card, prefer evaluated model classes for high-stakes use. Even with partial inputs today, the gate gives you a place to enforce what you can verify.
- Track the regulatory trajectory. These agreements are voluntary today. The same institutional capacity is what a future mandate would be built on. Map which of your AI use cases would be in scope if pre-deployment evaluation became a requirement.
Why this keeps happening, the structural view
The recurring pattern in AI supply-chain security is that capability outruns assurance. Models ship and get embedded in products faster than any independent assurance mechanism can attest to them, so the trust model defaults to the producer's own word. That is the same failure mode that made unverified open-source dependencies a systemic risk: consumption is frictionless, verification is not, so verification gets skipped.
CAISI's expansion is an attempt to insert assurance into that gap at the most consequential layer, the frontier models themselves. The structural limitation is that the assurance is being produced in a form (government evaluations, partly classified) that does not naturally flow downstream to the enterprises bearing the deployment risk. Until evaluation results, model signatures, and AI-BOMs converge into a standardized, verifiable, machine-readable artifact, consumers will keep operating on vendor assertions while a richer body of evidence exists upstream but out of reach. Closing that loop is the multi-year project.
The structural fix
Safeguard's role here is on the consumer side of the supply chain: turning whatever assurance signals exist into enforceable policy. The AI-BOM inventories every model in your estate with its provider, version, and available provenance, so when signatures or evaluation summaries become obtainable you have a place to record and reason about them. Policy enforcement lets you express rules such as "prefer evaluated model classes for high-stakes use" and "require a verifiable signature," and SLSA provenance verification handles the chain-of-custody for internally produced fine-tunes. This does not duplicate what CAISI does; CAISI evaluates the upstream artifact. Safeguard helps you govern consumption of it, which is the part the enterprise actually controls. For teams formalizing AI risk programs, AI governance and the supply-chain compliance workflows tie these signals into auditable evidence.
What we know we don't know
- The detailed methodology and per-model results of CAISI's evaluations are not public; some are classified, and the announcement summarizes scope rather than findings.
- The "more than 40 evaluations" figure is CAISI's reported count; the breakdown by lab, model, and risk category is not disclosed.
- It is unclear whether or how any evaluation output will become available to downstream enterprise consumers in a standardized form.
- The agreements are voluntary; whether and when pre-deployment evaluation becomes mandatory, and for which model classes, is an open policy question as of May 2026.
References
- NIST CAISI bulletin: Agreements Regarding Frontier AI National Security Testing With Google DeepMind, Microsoft and xAI
- HPCwire: NIST's CAISI Announces New Frontier AI Testing Agreements with Google DeepMind, Microsoft, xAI
- Nextgov/FCW: Commerce AI center will evaluate Google DeepMind, Microsoft and xAI models
- CIO: US government agency to safety test frontier AI models before release
- Safeguard: AI-BOM concept
- Safeguard: Policy enforcement concept
- Safeguard: SLSA provenance concept
- Safeguard: AI governance use case
- Safeguard: Supply-chain compliance use case