Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current safety prompts in vision-language models (VLMs) exhibit limited efficacy against jailbreak attacks, as they often fail to activate the models’ latent safety-aligned structures. This work proposes Safety-Potential Pruning, a novel approach that introduces the hypothesis of a “safety subnetwork” and leverages one-shot structured pruning to remove weights with weak responses to safety prompts, thereby explicitly eliciting the model’s intrinsic safety capabilities without requiring fine-tuning. The method performs parameter sparsity analysis based on safety prompt responsiveness and applies structural intervention accordingly. Evaluated across three mainstream VLM architectures and three jailbreak benchmarks, it reduces attack success rates by up to 22% while preserving strong performance on standard tasks.

Technology Category

Application Category

📝 Abstract
Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.
Problem

Research questions and friction points this paper is trying to address.

safety prompts
vision-language models
jailbreak attacks
model robustness
alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Safety-Potential Pruning
Vision-Language Models
Jailbreak Defense
Safety Subnetwork
Model Pruning
C
Chongxin Li
School of Computer Engineering and Science, Shanghai University
H
Hanzhang Wang
School of Computer Engineering and Science, Shanghai University
Lian Duan
Lian Duan
Associate Professor of Information Systems, Hofstra University
Data MiningMachine Learning