SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Multimodal large language models (MLLMs) are vulnerable to text-driven multimodal jailbreaking attacks; existing defenses remain limited because they fail to localize token-level vulnerability origins. This paper proposes a training-free, inference-time dynamic pruning–recovery defense mechanism: leveraging multimodal attention analysis, it precisely identifies and prunes fewer than 1% of harmful visual tokens in early layers that dominate jailbreaking behavior, then restores benign features in deeper layers to disrupt the jailbreaking pathway. The method requires no safety fine-tuning, incurs low computational overhead, and avoids over-defense. Evaluated across three mainstream MLLMs and five benchmarks, it significantly reduces jailbreaking success rates while preserving standard task performance. To our knowledge, this is the first approach to achieve token-level interpretability, efficiency, and robustness in multimodal jailbreaking defense.

Technology Category

Application Category

📝 Abstract

By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment.Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards.Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing heavy training overhead.To bridge this gap, we present an comprehensive analysis of where, how and which harmful multimodal tokens bypass safeguards in MLLMs. Surprisingly, we find that less than 1% tokens in early-middle layers are responsible for inducing unsafe behaviors, highlighting the potential of precisely removing a small subset of harmful tokens, without requiring safety tuning, can still effectively improve safety against jailbreaks. Motivated by this, we propose Safe Prune-then-Restore (SafePTR), an training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.Without incurring additional computational overhead, SafePTR significantly enhances the safety of MLLMs while preserving efficiency. Extensive evaluations across three MLLMs and five benchmarks demonstrate SafePTR's state-of-the-art performance in mitigating jailbreak risks without compromising utility.

Problem

Research questions and friction points this paper is trying to address.

Identifies harmful tokens causing MLLM jailbreak vulnerabilities

Proposes training-free defense by pruning and restoring tokens

Enhances safety without computational overhead or utility loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes harmful tokens in early-middle layers

Restores benign features post-pruning

Training-free defense framework SafePTR

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance