AI Security

Llama 4 Release and LlamaFirewall: A Defender's Guide

Meta shipped Llama 4 Scout and Maverick on April 5, 2025, along with Llama Guard 4, LlamaFirewall, and CyberSecEval 4. We unpack what defenders should deploy and what to ignore.

On April 5, 2025, Meta released the Llama 4 family — initially Llama 4 Scout (17B active parameters, 16 experts) and Llama 4 Maverick (17B active, 128 experts), both natively multimodal. The release was accompanied by a refreshed defender stack: Llama Guard 4 (input/output safety classifier), LlamaFirewall (a runtime guardrail system for agentic applications), Llama Prompt Guard 2 (a classifier trained to detect jailbreaks and prompt injections), and CyberSecEval 4 (the benchmark suite for evaluating LLM cybersecurity properties). For enterprises that already self-host or fine-tune open-weight models, this release substantially changes the toolkit available to defenders. This post walks through what each component does, what its limitations are, and how to roll them into an existing security architecture.

What is new in Llama 4 from a security standpoint?

Three operational properties matter. First, Llama 4 is natively multimodal — the model sees images directly rather than via a frozen vision encoder, which changes the prompt-injection threat surface compared to Llama 3-derived stacks. Second, Meta describes a heavier emphasis on driving down refusals to benign-but-borderline prompts, which is a quality win but creates new tuning pressure on downstream safety classifiers. Third, Meta integrated the Generative Offensive Agent Tester throughout training, an adversarial methodology designed to probe LLM susceptibilities and improve baseline robustness. Defenders should not assume the model is "safer" in any absolute sense — the safety profile is different, not strictly better, and the operative question is whether your downstream guardrails match the changed surface.

How should we deploy Llama Guard 4?

Llama Guard 4 is a 12B-parameter safety classifier trained to label both inputs and outputs against a configurable taxonomy. It supports custom categories — you can specify your own harm classes via prompt and the model classifies accordingly. The right deployment pattern is a two-stage gate: Llama Guard 4 inspects user prompts before they reach the main model, and inspects model outputs before they reach the user. Latency overhead is non-trivial (typically 100-400ms per call on H100-class hardware), so for production you batch where possible and consider a lighter classifier for low-risk turns. Critically, Llama Guard 4 is not a prompt-injection detector — it is a content classifier. Stacking it with Prompt Guard 2 covers both axes.

When does LlamaFirewall help versus when does it not?

LlamaFirewall is Meta's runtime guardrail system specifically designed for agentic applications. It implements policy enforcement around tool calls, code execution, and prompt-injection-sensitive flows. The framework supports policy expressions that constrain which tools an agent may call, which file paths it may touch, and which network destinations it may reach. The honest assessment: LlamaFirewall is a meaningful improvement over hand-rolled scope enforcement, but it is not a substitute for proper sandboxing. Treat it as a defense-in-depth layer that catches policy violations the agent itself attempts to execute. The system call boundary (containers, gVisor, microVMs) remains your authoritative enforcement point.

What about Llama Prompt Guard 2?

Prompt Guard 2 is a small classifier (typically deployed as an 86M or 22M parameter model) trained to detect both jailbreak attempts and indirect prompt injection — the latter being content that originates from a document, tool output, or other non-user source and attempts to redirect the model's behavior. The model returns three classes: benign, jailbreak, injection. Meta reports improved precision-recall over Prompt Guard 1 on their internal benchmark. The deployment pattern: run Prompt Guard 2 on every tool-output before it lands in the model's context window. This is the single highest-leverage control for any agent that ingests web content, email, documents, or RAG retrievals.

# Example agentic stack with Llama 4 defender components
agent:
  base_model: meta-llama/Llama-4-Scout-17B-16E
  preflight_guards:
    - name: prompt_guard_2
      mode: classify_user_input
      block_classes: ["jailbreak"]
      flag_classes: ["injection"]
    - name: llama_guard_4
      mode: classify_user_input
      taxonomy: ./harm_taxonomy.json
  tool_output_guards:
    - name: prompt_guard_2
      mode: classify_tool_output
      block_classes: ["injection", "jailbreak"]
  postflight_guards:
    - name: llama_guard_4
      mode: classify_model_output
      taxonomy: ./harm_taxonomy.json
  policy_runtime:
    framework: llamafirewall
    config: ./firewall_policies.yaml
  sandbox:
    type: gvisor
    egress_allowlist: ["api.example-corp.internal"]

How should we use CyberSecEval 4?

CyberSecEval 4 is the benchmark suite — not a runtime tool — and it is the right artifact to use when comparing Llama 4 against your own fine-tunes or against other vendors. The suite covers insecure code generation rates, vulnerability identification, exploit construction limitations, and several agentic-cyber-task evaluations. Meta publishes the harness; you run it against your model and get comparable numbers. Two cautions. First, the benchmark is partially saturated for some sub-tasks at the frontier — Llama 4 Maverick and competing frontier models all do well on the easier categories, so per-category resolution matters. Second, CyberSecEval is a Meta benchmark; pair it with at least one independent suite (CyberSecBench, Anthropic's internal cyber suite if you can access it through a partnership) to avoid overfitting to a single vendor's defining of "cyber capability."

What about the open-weight supply-chain risks?

Llama 4 weights are distributed via the Llama license — a community license with some commercial restrictions. The weight files are large (hundreds of GB) and the Hugging Face mirror is the most common distribution point. Two supply-chain risks deserve mention. First, integrity: pin to the specific Hugging Face revision (commit SHA) rather than the floating reference, and verify SHA-256 against Meta's published checksums. Second, distribution-channel risk: in early 2025 Hugging Face deployments saw repeated pickle-based supply-chain attacks (the nullifAI campaign, JFrog's broader research). Llama 4 itself is distributed in safetensors format, which mitigates the pickle vector, but companion files (tokenizer, configuration, conversion scripts) may still come in pickle format. Scan everything in the model directory, not just the weights.

What about fine-tuning Llama 4 safely?

Llama 4's open-weight nature means a non-trivial fraction of enterprise users will fine-tune it. The defender concerns at fine-tune time differ from inference-time concerns. The training data is the new attack surface: a poisoned fine-tuning corpus can introduce backdoors that are undetectable through normal evaluation but trigger on specific input patterns. Apply the same RAG-corpus hygiene controls to fine-tuning datasets — source allowlisting, prompt-injection scanning, deduplication, and provenance recording. The training run itself should be reproducible: record the base model SHA, the dataset SHAs, the hyperparameters, and the random seed in a training attestation. CycloneDX 1.7 ML-BOM's pedigree block is the right format for this attestation. After fine-tuning, re-run CyberSecEval 4 and your internal safety benchmarks against the resulting checkpoint; do not assume the base-model safety properties transfer cleanly. Multiple 2025 research efforts (the BadLLM family of papers, follow-on work on backdoor-via-fine-tune) demonstrated that fine-tuning can silently erode safety properties without affecting capability benchmarks. The audit trail you need is the AIBOM pedigree plus the post-fine-tune evaluation results.

How Safeguard Helps

Safeguard catalogs Llama 4 model variants, Llama Guard 4, LlamaFirewall, and Prompt Guard 2 as first-class components in your AIBOM, with pinned revisions and continuous SHA verification against Meta's checksums. When Hugging Face ships a new revision of a weight file, Safeguard detects the drift and flags whether the change is approved or unexpected. Griffin AI runs CyberSecEval 4 plus the SecBench independent suite against your fine-tunes and produces a comparison report, so you have defensible numbers for both procurement and audit. Policy gates enforce that any deployment using a Llama 4 derivative also wires in Prompt Guard 2 on tool outputs and Llama Guard 4 on user inputs, blocking PRs that ship a bare agent without guardrails. The result: the defender stack Meta shipped becomes mandatory architecture in your organization, not optional add-ons.

meta llama-4 llamafirewall llama-guard open-weights

Back to all articles