One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Manual multi-turn jailbreaking attacks against large language models (LLMs) incur high labor costs and lack scalability. Method: This paper proposes the first framework to automatically compress multi-turn adversarial dialogues into single-turn, highly effective prompts. It introduces three novel structured compression strategies—Hyphenize, Numberize, and Pythonize—to exploit formatting cues that evade safety classifiers, thereby challenging the prevailing assumption that multi-turn interactions are inherently more dangerous. Leveraging the MHJ dataset, we build a transformation pipeline validated by StrongREJECT for harm assessment and cross-model robustness testing on Mistral-7B and GPT-4o. Contribution/Results: Our compressed prompts achieve a 95.9% attack success rate on Mistral-7B and improve success by up to 17.5 percentage points over original multi-turn prompts on GPT-4o—demonstrating that single-turn attacks can match or even exceed the threat potency of multi-turn counterparts.

Technology Category

Application Category

📝 Abstract

Despite extensive safety enhancements in large language models (LLMs), multi-turn"jailbreak"conversations crafted by skilled human adversaries can still breach even the most sophisticated guardrails. However, these multi-turn attacks demand considerable manual effort, limiting their scalability. In this work, we introduce a novel approach called Multi-turn-to-Single-turn (M2S) that systematically converts multi-turn jailbreak prompts into single-turn attacks. Specifically, we propose three conversion strategies - Hyphenize, Numberize, and Pythonize - each preserving sequential context yet packaging it in a single query. Our experiments on the Multi-turn Human Jailbreak (MHJ) dataset show that M2S often increases or maintains high Attack Success Rates (ASRs) compared to original multi-turn conversations. Notably, using a StrongREJECT-based evaluation of harmfulness, M2S achieves up to 95.9% ASR on Mistral-7B and outperforms original multi-turn prompts by as much as 17.5% in absolute improvement on GPT-4o. Further analysis reveals that certain adversarial tactics, when consolidated into a single prompt, exploit structural formatting cues to evade standard policy checks. These findings underscore that single-turn attacks - despite being simpler and cheaper to conduct - can be just as potent, if not more, than their multi-turn counterparts. Our findings underscore the urgent need to reevaluate and reinforce LLM safety strategies, given how adversarial queries can be compacted into a single prompt while still retaining sufficient complexity to bypass existing safety measures.

Problem

Research questions and friction points this paper is trying to address.

Converts multi-turn jailbreak prompts into single-turn attacks

Increases attack success rates using single-turn strategies

Exploits structural formatting to bypass LLM safety measures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts multi-turn attacks into single-turn prompts

Uses Hyphenize, Numberize, Pythonize strategies

Achieves high Attack Success Rates (ASRs)

🔎 Similar Papers

Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization