AI Security

Cisco's May 2026 Multi-Turn Jailbreak Study: Why Frontier Model Safety Collapses Over a Conversation

Cisco's AI threat team tested 15 flagship models with ~7,000 multi-turn attacks and found success rates as high as 88 percent. Single-turn safety scores told defenders almost nothing about real-world resilience.

On May 28, 2026, Cisco's AI threat and security research team, led by Amy Chang, published a study that should change how every AppSec team reads model safety claims. The headline is simple and uncomfortable: the single-turn refusal rates that vendors and benchmarks report tell you almost nothing about how a model behaves when an adversary is allowed to have a conversation. Across 15 closed flagship models, multi-turn attack success rates climbed as high as 88 percent, and several models that looked nearly bulletproof against one-shot jailbreaks failed five to nine times more often once the attacker could adapt across turns.

This is not a new theoretical observation. The "crescendo" class of attacks has been documented since 2024, and red teamers have long known that gradual escalation beats a single hostile prompt. What is new is the scale and the cross-model rigor: roughly 30,000 single-turn prompts and nearly 7,000 multi-turn attacks spread across more than 1,400 conversations, run against the current generation of frontier models from OpenAI, Google, xAI, Anthropic, and Amazon. The result is the clearest public evidence to date that the industry's dominant safety metric, single-turn attack success rate (ASR), systematically overstates how safe a deployed model is.

For security engineers integrating these models into products, the takeaway is structural. If your threat model assumes the model's built-in guardrails will hold because the vendor's safety card shows a low single-turn ASR, you are measuring the wrong thing. The conversation is the attack surface.

TL;DR

On May 28, 2026, Cisco's AI threat research team published a multi-turn jailbreak study covering 15 closed flagship models, including GPT-5.4, Gemini 3 Pro, Grok 4.1 Fast, Anthropic's Claude family, and three Amazon Nova variants.
The study ran roughly 30,000 single-turn prompts and nearly 7,000 multi-turn attacks across more than 1,400 conversations.
Multi-turn attack success rates reached as high as 88 percent. More than half the models showed an absolute gap of at least 15 percentage points between single-turn and multi-turn regimes.
GPT-5.4 jumped roughly ninefold, from single-digit to nearly 25 percent, when attackers were allowed to adapt. Gemini 3 Pro climbed from about 18 percent to 73 percent. Grok 4.1 Fast topped the cohort at 88 percent. Claude models stayed lowest but still rose into the 11 to 16 percent range.
Five attack families drove most multi-turn success: role-play and persona adoption, contextual ambiguity, refusal reframing, information decomposition, and crescendo-style escalation.
The actionable conclusion: single-turn safety metrics are not a proxy for deployed safety. Defenders must add their own conversation-level monitoring and policy enforcement rather than relying on the model's native guardrails.

What happened

Cisco's study, published May 28, 2026 and reported by Help Net Security the same day, is a systematic comparison of two attack regimes against the same set of models. In the single-turn regime, each adversarial objective is delivered in one prompt. In the multi-turn regime, the attacker is permitted to pursue the same objective across a conversation, adapting based on the model's responses, reframing refusals, and escalating gradually.

The cohort was 15 closed flagship models. Named models in the reporting include OpenAI's GPT-5.4, Google's Gemini 3 Pro, xAI's Grok 4.1 Fast, the Anthropic Claude family, and three Amazon Nova variants including Nova 2 Lite. The scale was roughly 30,000 single-turn prompts and nearly 7,000 multi-turn attacks across more than 1,400 distinct conversations.

The core finding is the gap between the two regimes. The numbers reported are stark:

GPT-5.4 rose roughly ninefold, from a single-digit single-turn rate to nearly 25 percent under multi-turn pressure.
Gemini 3 Pro climbed from approximately 18 percent to 73 percent.
Grok 4.1 Fast topped the cohort at 88 percent multi-turn success.
Claude models were the most resilient, with single-turn ASRs in the low single digits, but still reached the 11 to 16 percent range once attackers could adapt.

More than half of the tested models showed an absolute gap of at least 15 percentage points between single-turn and multi-turn success. The relative ordering of models was roughly preserved (the models that were safer single-turn tended to stay safer multi-turn), but the absolute risk for every model was meaningfully higher than its single-turn score suggested.

How the attack worked

The study identified five strategy families that drove multi-turn success. None of them is exotic. They are the techniques a determined human red teamer or a moderately capable automated attack harness would reach for. The following descriptions are conceptual and educational, not functional attack recipes.

Role-play and persona adoption. The attacker asks the model to adopt a persona for which the harmful output is in-character, for instance a fictional character, a "research assistant with no restrictions," or a system in a hypothetical scenario. The harmful request is then framed as something the persona would naturally do. This deflects responsibility away from the model's own voice, which historically correlates with the highest success rates of any single technique.

Contextual ambiguity. Rather than stating the harmful objective directly, the attacker builds a context in which the harmful answer is the locally helpful one. The model is led to optimize for being helpful within a frame the attacker controls.

Refusal reframing. When the model refuses, the attacker does not give up. They reinterpret the refusal as a misunderstanding, narrow the request, or claim a legitimate purpose, then re-ask. Each refusal becomes a data point the attacker uses to find the boundary.

Information decomposition. The harmful objective is broken into individually benign sub-requests. No single turn trips a safety classifier, but the assembled outputs accomplish the goal. This is the multi-turn analogue of splitting a payload to evade a signature.

Crescendo-style escalation. The attacker starts well inside policy and escalates one small step at a time. Because each step is only marginally more sensitive than the last, the model's per-turn judgment never registers a sharp enough change to refuse, and the conversation drifts across the line.

The reason these work against models that are strong single-turn is that current safety training and inference-time classifiers are heavily optimized for the single-prompt case. A model evaluating turn 12 in isolation sees a request that, given everything it has already "agreed to," looks reasonable. The safety decision is made locally; the attack is constructed globally.

# Illustrative crescendo shape (NOT a functional jailbreak).
# Each turn is only slightly more sensitive than the previous one.

turn 1  -> broad, clearly-allowed educational question
turn 2  -> ask for more specific framing, still in policy
turn 3  -> reframe a soft refusal as a misunderstanding, re-ask
...
turn N  -> request that would have been refused as a single prompt,
           now answered because the local context normalizes it

What detection looks like

If you operate an application on top of one of these models, you cannot rely on the model's native guardrails to catch multi-turn attacks. Detection has to happen at the conversation level, on your side of the integration. Concrete signals:

Refusal-then-comply transitions. A session where the model refuses a request and then, a few turns later, produces closely related content is a high-value signal. Log refusals and correlate them with subsequent assistant outputs in the same session.
Escalating sensitivity scores per turn. Score each user turn and each assistant turn with a content classifier and track the trend across the session. A monotonic climb toward sensitive topics is the crescendo fingerprint.
Persona and role-play markers. Watch for user turns that instruct the model to adopt an unrestricted persona, write "in character," or treat the interaction as fiction, hypothetical, or a game, especially when followed by topic drift.
Decomposition patterns. Multiple sub-requests in one session that are individually benign but topically converge on a sensitive objective. Topic-clustering across a session surfaces this better than per-turn classification.
Session-level anomalies. Abnormally long sessions, high refusal counts, or rapid reframing after refusals. These are cheap to compute and correlate well with adversarial intent.

The key architectural point is that the unit of analysis must be the session, not the prompt. A per-prompt guardrail will pass every individual turn of a successful crescendo.

What to do Monday morning

Stop treating vendor single-turn safety scores as deployment guarantees. Re-read any risk acceptance that cited a model's safety card ASR. The Cisco data shows that number can understate real risk by an order of magnitude for some models.
Add conversation-level logging now. If you are not already retaining full session transcripts (within your privacy and retention policy), you cannot detect or investigate multi-turn attacks. Make session transcripts, refusal events, and per-turn classifier scores first-class logs.
Deploy a session-aware guardrail in front of the model. A guardrail that only inspects the current prompt is insufficient. Use one that maintains session state and can flag refusal-then-comply transitions and escalating sensitivity.
Run your own multi-turn red team. Reproduce the five strategy families against your specific system prompt and tools. Single-turn red teaming will miss the failures that matter. Prioritize crescendo and refusal-reframing because they are the cheapest for an attacker to automate.
Constrain capability, not just content. For high-stakes integrations, the durable mitigation is to limit what a jailbroken model can actually do (scoped tools, no unguarded data egress, human approval for sensitive actions) so that a successful jailbreak yields words, not consequences.
Re-evaluate after every model upgrade. The study shows the absolute gap varies a lot by model. A model swap that improves your single-turn scores can quietly worsen your multi-turn exposure.

Why this keeps happening

The structural problem is a measurement mismatch. Safety training, public benchmarks, and most vendor evaluations are built around the single-prompt unit because it is cheap to generate, cheap to score, and easy to compare across models. Adversaries operate on the conversation unit because that is how chat interfaces work. The whole evaluation economy is optimized for a threat model that does not match deployment.

There is also an incentive asymmetry. A low single-turn ASR is a marketable number that fits on a safety card. Multi-turn resilience is expensive to measure, hard to summarize in one figure, and tends to produce worse-looking numbers. Until buyers demand multi-turn metrics, the reported metric will keep being the flattering one.

Finally, the defense surface is fragmented. The model vendor controls the weights and the native guardrails. The application developer controls the system prompt, the tools, and the session. Neither party alone sees the whole attack. Vendors optimize the per-turn refusal; few application teams add the session-level monitoring that would catch what the per-turn refusal misses. The gap between them is exactly where multi-turn attacks live.

The structural fix

The honest framing is that no guardrail makes a model jailbreak-proof, and the Cisco study is good evidence that anyone claiming otherwise is measuring single-turn. What a defender can realistically do is shorten the time a successful jailbreak goes undetected and shrink what it can accomplish. Safeguard's guardrails and the Guard product operate at the session level rather than the prompt level, so refusal-then-comply transitions and crescendo-style escalation are visible as patterns across a conversation instead of being evaluated one turn at a time. Paired with capability scoping, which limits what tools and data a model-backed feature can reach, this turns a jailbroken response into a contained event rather than a privileged action. None of that prevents the jailbreak. It reduces blast radius and dwell time, which is the realistic security goal for a probabilistic system. For teams formalizing this, AI governance ties the monitoring and scoping into an auditable policy rather than ad hoc per-app logic.

What we know we don't know

The reporting names several models (GPT-5.4, Gemini 3 Pro, Grok 4.1 Fast, Nova variants) but does not enumerate all 15, and the exact per-model breakdown beyond the highlighted figures is not fully public.
The precise scoring methodology for "success" in the multi-turn regime, and how partial compliance was counted, is summarized rather than fully specified in the coverage available.
Whether application-layer guardrails (as opposed to the models' native safety) were present during testing is not clearly stated; the figures appear to reflect the raw model behavior.
The study covers closed flagship models. How open-weight models compare under the same protocol is not addressed here, though prior work suggests they are generally easier to jailbreak.

References

Help Net Security: Frontier AI models collapse under multi-turn AI attacks, Cisco finds (May 28, 2026)
Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs (arXiv)
Safeguard: Guardrails concept
Safeguard: Guard product
Safeguard: Capability scoping
Safeguard: AI governance use case
Safeguard: Prompt injection concept

jailbreak llm-security multi-turn-attacks red-teaming ai-safety prompt-injection model-evaluation

Back to all articles

More on #jailbreak

View all

AI Security

LLM Jailbreak Prevention: A Defense-in-Depth Playbook

5 min read

AI Security

LLM Jailbreak Defense Architectures in 2026

6 min read

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.

Cisco's May 2026 Multi-Turn Jailbreak Study: Why Frontier Model Safety Collapses Over a Conversation

TL;DR

What happened

How the attack worked

What detection looks like

What to do Monday morning

Why this keeps happening

The structural fix

What we know we don't know

References

More on #jailbreak

LLM Jailbreak Prevention: A Defense-in-Depth Playbook

LLM Jailbreak Defense Architectures in 2026

Related articles in AI Security

The Cursor extension that cost a developer $500,000

When the Scanner Is the Backdoor: The LiteLLM Trivy Attack

The Nx Attack Turned AI Coding Agents Into the Malware

Never miss an update