The software supply chain around a text-only language model is complicated enough. Training data, model weights, tokenizers, fine-tunes, adapters, serving frameworks, and the integrations that consume model output all need to be tracked and defended. Multi-modal models add to this picture in ways that feel like a straightforward extension but actually introduce categories of risk that text-only pipelines never had to consider. An image the model processes can contain steganographic instructions. An audio file can encode prompt injection. A video frame can carry visual prompt attacks that do not resolve on human inspection but are legible to the model.
This post walks through the supply chain considerations that multi-modality brings, focused on the ones we have observed in production deployments rather than on theoretical possibilities. The practical question is how to extend the security practices that have matured for text-based AI to the wider set of input types that modern multi-modal systems accept.
Why Multi-Modality Is Different
Text input has properties that make it relatively tractable for security review. It is human-readable. It can be logged, searched, and compared. Suspicious content often looks suspicious. Tooling exists for detecting many classes of injection.
Images, audio, and video have none of these properties consistently. A malicious image may be indistinguishable from a benign one to human reviewers. An adversarial audio sample may sound like ordinary speech. A manipulated video frame may pass visual review and still trigger unintended model behavior. The absence of human-legibility shifts the defense burden almost entirely to automated systems, and the automated systems for non-text modalities are less mature than their text counterparts.
This has consequences throughout the supply chain. Training data provenance, input validation, logging, and incident response all need to accommodate content that cannot be meaningfully reviewed by a human in the loop.
Training Data Provenance
Text training corpora are large but inspectable at scale. Sampling a random slice and reading it is a meaningful form of review. Image, audio, and video training data do not support this kind of sampling review. A corpus of a billion images cannot be inspected at any useful fraction, and sampled inspection catches problems only in the aggregate. Adversarial content can hide in the tails.
Providers of base models typically do not release their training corpora, which means downstream consumers are trusting the provider's internal processes to have caught problems. This trust is a supply chain decision that deserves explicit consideration. When choosing a multi-modal model, treat the provider's data curation practices as part of the evaluation. Ask what filtering was applied to training data. Ask whether known adversarial patterns were excluded. Ask how the provider would detect and respond to post-training discoveries of problematic training data.
For fine-tuning, the situation is more tractable because the fine-tune corpus is typically smaller and owned by the organization doing the fine-tuning. Apply rigorous curation: document the source of every image or audio sample, verify that consent and rights are in place, and filter for known adversarial patterns before inclusion.
Input Validation at Inference Time
Text-based systems have developed a robust set of input validation patterns: length limits, character set restrictions, encoding normalization, semantic filters for known injection strings. Multi-modal systems need equivalent patterns for each modality, and the toolkit is thinner.
For images, the most useful validation layers include format normalization, resolution limits, EXIF stripping, and checks for known adversarial patterns such as text embedded in regions that humans would not notice. Watermarking and provenance signatures are beginning to appear as a way to authenticate the origin of images, through standards like C2PA, and consuming systems should verify these signatures where possible.
For audio, normalization to a standard sample rate and format, length limits, and detection of spectrogram-embedded content are the current state of the practice. Audio prompt injection, where instructions are embedded in frequencies not audible to humans but legible to the model's feature extractor, is a real attack vector; filtering that removes inaudible content is a partial defense.
For video, the inspection burden scales with frame count. Sampling keyframes and applying image-level defenses to them is the practical approach, with full-video inspection reserved for cases flagged by metadata or context.
Format-Specific Parsing Risks
Multi-modal inputs reach the model through feature extractors that parse file formats before feeding the content to the model. The parsing step is traditional attack surface: image libraries have long histories of vulnerabilities in format handling, and audio and video codecs are notoriously complex parsers with unresolved CVE backlogs.
This is software supply chain territory. The feature extractors are software dependencies of the model-serving stack, and they need the same vulnerability management attention as any other component. An unpatched libpng or ffmpeg in the inference pipeline is a remote code execution vulnerability that happens to be reachable through a model input.
Treat the feature extraction stack as security-sensitive. Track its components in an SBOM. Scan for known vulnerabilities. Patch on a disciplined cadence. Consider running feature extraction in an isolated sandbox separate from the rest of the model runtime, so that parser exploitation does not compromise the inference infrastructure directly.
Adversarial Patterns Specific to Each Modality
Each modality has its own catalog of adversarial patterns that attackers use to manipulate multi-modal models.
Images are vulnerable to typographic attacks, where text rendered into the image overrides the semantic content, and to patch attacks, where small regions with specific perturbations cause large shifts in model output. Visual prompt injection, where images contain human-readable or machine-readable instructions intended for the model to follow, is increasingly common in document processing pipelines.
Audio is vulnerable to phonetic attacks, where samples that sound like one thing to humans are interpreted as another by the model, and to subliminal embedding, where content is hidden in frequency ranges outside the human audible band. Voice cloning presents a distinct risk, not to the model itself but to downstream systems that treat model-generated audio as authenticated.
Video combines the risks of its component modalities with additional ones specific to temporal patterns. Frame-specific perturbations that do not affect human perception can shift model interpretation of a sequence. Synchronization between audio and visual tracks can be exploited when the model processes them jointly.
Staying current with the adversarial catalog for each modality is a research task. Designate someone in the security team to track the literature and update the defenses as new patterns are documented.
Logging and Retention
Logging multi-modal inputs is expensive. A text log entry might be a few kilobytes; an image input is often several megabytes. Videos are orders of magnitude larger. Full retention of all multi-modal inputs is impractical for most deployments.
The compromise is tiered retention. Hashes of all inputs are retained long-term, with the original content retained for a shorter window. For inputs that trigger security-relevant events, such as content filter hits or downstream incident investigations, the original content is promoted to long-term retention.
This tiering has a privacy dimension. Images and audio often contain personally identifiable information, and long-term retention carries regulatory obligations. Coordinate retention policy with privacy and legal teams, and apply access controls on the retained content that reflect its sensitivity.
Incident Response for Multi-Modal Incidents
When an incident involves multi-modal input, the investigation steps differ from text-only incidents. The input itself is usually the single most important artifact, but sharing it during investigation may violate privacy policies. Techniques for redacting or summarizing multi-modal content for shareable reports are less developed than for text, and ad-hoc approaches often leak information or destroy the evidence.
Plan for this in advance. Establish procedures for how multi-modal artifacts are handled during incident response, who is allowed to view them, and how they are summarized for wider communication. Practice the procedures on synthetic incidents; the real one is a bad moment to work out the details.
Downstream Trust in Model Outputs
Multi-modal models produce outputs across multiple modalities, not just text. Generated images, synthesized audio, and edited video all carry provenance implications that flow downstream. A document or media asset produced by the model needs to be marked as such, and consuming systems need to handle the mark appropriately.
C2PA and similar provenance standards are the emerging answer. Where the deployment produces media that will be distributed externally, embed provenance metadata that identifies the output as model-generated. Where the deployment consumes media from external sources, verify provenance where present and treat unverified media with appropriate skepticism.
How Safeguard Helps
Safeguard extends its supply chain inventory to cover multi-modal AI systems, tracking the feature extraction libraries, model weights, fine-tune adapters, and serving runtimes that handle each modality. The platform scans multi-modal dependencies for known vulnerabilities, monitors weights for provenance and integrity, and correlates multi-modal inputs with downstream findings so security teams can investigate incidents involving non-text content. Policy gates can prevent deployment of multi-modal systems whose feature extractors are on vulnerable versions or whose training data lineage cannot be verified, bringing the same supply chain discipline to image, audio, and video pipelines that has matured for text.