The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the limitations of current alignment mechanisms, which effectively detect explicit harmful triggers but remain vulnerable to multi-turn jailbreak attacks that rely heavily on specific contextual setups and thus lack practicality. The authors propose a novel "Salami Slicing Risk" mechanism that chains together multiple low-risk inputs to gradually accumulate malicious intent without triggering defenses, ultimately inducing large language models to generate high-risk content. They introduce the first general-purpose automated Salami Attack framework that operates without pre-defined complex contexts and pair it with a context-aware dynamic defense strategy. Experimental results demonstrate over 90% attack success rates against GPT-4o and Gemini, while the proposed defense reduces Salami attack success by at least 44.8% and achieves up to 64.8% mitigation efficacy against other multi-turn jailbreak methods.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) face prominent security risks from jailbreaking, a practice that manipulates models to bypass built-in security constraints and generate unethical or unsafe content. Among various jailbreak techniques, multi-turn jailbreak attacks are more covert and persistent than single-turn counterparts, exposing critical vulnerabilities of LLMs. However, existing multi-turn jailbreak methods suffer from two fundamental limitations that affect the actual impact in real-world scenarios: (a) As models become more context-aware, any explicit harmful trigger is increasingly likely to be flagged and blocked; (b) Successful final-step triggers often require finely tuned, model-specific contexts, making such attacks highly context-dependent. To fill this gap, we propose \textit{Salami Slicing Risk}, which operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, we develop Salami Attack, an automatic framework universally applicable to multiple model types and modalities. Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90\% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses. We also proposed a defense strategy to constrain the Salami Attack by at least 44.8\% while achieving a maximum blocking rate of 64.8\% against other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security.

Problem

Research questions and friction points this paper is trying to address.

jailbreaking

multi-turn attacks

cumulative risks

LLM security

alignment bypass

Innovation

Methods, ideas, or system contributions that make the work stand out.

Salami Slicing Risk

multi-turn jailbreak

cumulative risk