🤖 AI Summary
Current safety prompts in vision-language models (VLMs) exhibit limited efficacy against jailbreak attacks, as they often fail to activate the models’ latent safety-aligned structures. This work proposes Safety-Potential Pruning, a novel approach that introduces the hypothesis of a “safety subnetwork” and leverages one-shot structured pruning to remove weights with weak responses to safety prompts, thereby explicitly eliciting the model’s intrinsic safety capabilities without requiring fine-tuning. The method performs parameter sparsity analysis based on safety prompt responsiveness and applies structural intervention accordingly. Evaluated across three mainstream VLM architectures and three jailbreak benchmarks, it reduces attack success rates by up to 22% while preserving strong performance on standard tasks.
📝 Abstract
Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.