The first time a regulator asks "what is in your model?" most engineering teams realise their SBOM does not answer it. A modern production AI feature is not a single artefact. It is a base model pulled from a registry, a tokenizer with its own version drift, a set of fine-tuning datasets each with its own licence and consent status, a quantisation pass, an inference runtime like vllm or tgi, a serving stack, and a handful of guardrail models layered on top. Traditional SBOM tooling captures the Python wheels in the container and stops. The interesting risk lives in the model artefacts and their provenance, and that surface is invisible to a CycloneDX 1.4 emitter that only walks pip freeze. AI-BOM is the answer: a structured, signed, machine-readable record of the model supply chain that sits next to your SBOM and answers the questions auditors, customers, and incident responders will ask in 2026. This post explains what an AI-BOM is, why a traditional SBOM is not enough, the minimum fields that hold up under scrutiny, and the operational patterns that keep the data fresh.
Why SBOM Alone Misses The Model Layer
A typical ML service container has 200-400 Python dependencies and weighs in at 4-8 GB. A standard SBOM captures all of those, and zero of the things that actually matter for AI governance. The base model weights, often 10-70 GB, are mounted from a volume or pulled at startup from a registry; the SBOM never sees them. The training data that shaped the model is upstream of the build entirely. The evaluation suite that demonstrates the model meets safety thresholds runs in a separate pipeline. The guardrail prompts that wrap the model live in a config repo.
When xz-utils 5.6 had its backdoor in 2024, SBOM consumers could answer "do we ship it?" in minutes. When a popular open-weights model was found to have been fine-tuned on a copyrighted corpus in late 2025, almost nobody could answer the equivalent question, "do we ship a derivative?", in less than a week. The data was not in the SBOM. It was nowhere structured at all.
AI-BOM closes that gap by treating models, datasets, prompts, and evaluation suites as first-class components with their own identifiers, versions, suppliers, hashes, licences, and relationships.
The Minimum Viable AI-BOM
A useful AI-BOM does not need to capture every gradient. It needs to answer five questions for every model surface you ship:
- What is the model's identity and where did it come from?
- What was it trained or fine-tuned on, and under what licence?
- How was it evaluated, by whom, and when?
- What runtime serves it, and what is the integrity story?
- Who is responsible if it breaks?
CycloneDX 1.6 covers this with mlModel and data component types. Twelve fields per model is a reasonable floor:
purlor model URI (for examplepkg:huggingface/meta-llama/Llama-3.1-8B-Instruct@<sha256>)- Base model identifier and version
- Fine-tune lineage (parent model and training dataset references)
- Tokenizer identifier and version
- Quantisation method and bit-width
- Training dataset list with licence, consent status, and PII assessment
- Evaluation suite identifier, version, and result hash
- Inference runtime (
vllm,tgi,triton,onnxruntime) with version - Hardware class (for example
H100-80GB,MI300X) - Weight hash (
sha256of the safetensors or GGUF artefact) - Supplier and licence
- Date of last revalidation
If you cannot fill a field, mark it unknown with a remediation date. Auditors react far worse to invented metadata than to honest gaps.
Datasets Are The Hardest Field
Training data provenance is where most AI-BOM implementations either succeed or quietly collapse. Modern instruction-tuned models often have 10-40 fine-tuning datasets. Each one needs licence (cc-by-4.0, mit, proprietary), consent status (whether subjects opted in), PII risk class (none, low, high), and a hash of the snapshot that was actually used. Datasets drift; "the Common Crawl May 2024 snapshot" and "the Common Crawl August 2024 snapshot" are different supply chain inputs.
For derivative models, lineage is recursive. If you fine-tuned Llama-3.1-8B on 30k internal customer support tickets, your AI-BOM has to reference both the upstream model's lineage and your own fine-tune dataset hash. EU AI Act Article 10 makes this explicit for high-risk systems, and US sectoral regulators are following. The Article 10 expectations are detailed in our companion post on AI-BOM and EU AI Act data governance.
A practical pattern: keep dataset records in a content-addressable store keyed by hash, and reference the hash from the AI-BOM rather than embedding the dataset metadata. This keeps the AI-BOM small and lets you answer "every model trained on dataset D" in a single index lookup.
Evaluations As Supply Chain Evidence
An AI-BOM without evaluation references is a parts list without a quality certificate. For each production model, capture the identifier of the evaluation suite (for example an internal red-team battery, plus a third-party benchmark like MLCommons AILuminate), the version of the suite at run time, the result hash, the date of the run, and the pass/fail status against your published thresholds.
Two practical rules. First, evaluations are time-bound. A model that passed in January is not certified for August unless you re-ran. Set a maximum staleness, typically 30-90 days for production-facing models, and treat anything older as expired in the AI-BOM. Second, evaluation results should be signed by the team that ran them, not by the team that ships the model. Signature separation is what makes the evidence credible to an external auditor.
Runtime Integrity And The Inference Stack
The same model weights served by vllm 0.5.3 on H100 and by a custom CUDA kernel on consumer hardware are not the same supply chain. Quantisation choices, kernel implementations, and GPU driver versions all change behaviour. Your AI-BOM should record the inference runtime version, the quantisation pass, the GPU driver and CUDA version, and ideally the hash of the runtime container image.
Sign the model weights at registry-push time, and verify the signature at load time inside the inference server. If your serving fleet does not refuse to load an unsigned or mis-signed weights file, the AI-BOM is descriptive rather than enforcing, and adversaries know it.
Operational Hygiene
AI-BOM is only useful if it stays current. Three rules keep it honest. Generate AI-BOM at the same point in the pipeline that produces the model artefact, never after the fact. Sign and timestamp every emission. Re-emit on any change to weights, runtime, or evaluation status, not on a fixed schedule. Programmes that emit AI-BOM weekly drift quickly; programmes that emit on change stay accurate.
How Safeguard Helps
Safeguard treats AI-BOM as a first-class peer to SBOM. The platform ingests CycloneDX 1.6 mlModel and data components from CI, model registries, and serving stacks, normalises model identifiers across huggingface, ollama, and private registries, and indexes dataset lineage so a single query answers which products use a model derived from a flagged dataset. VEX statements extend to model-level vulnerabilities, letting teams suppress non-applicable findings on guardrail or sandboxed surfaces. Signed attestations cover weights, evaluation results, and runtime images using the same Sigstore-compatible chain Safeguard applies to traditional SBOM artefacts. The outcome is a single defensible record of the entire software-and-model supply chain that holds up to regulators, customers, and incident response in equal measure.