Reimagining Safety Alignment with An Image

📅 2025-11-01

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Large language models (LLMs) and multimodal large language models (MLLMs) face a dual security dilemma: vulnerability to jailbreaking attacks and excessive rejection of benign queries, while conventional supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) struggle to enable a single model to align with diverse value systems. This paper introduces Magic Image, the first framework to employ **parameter-free visual prompt optimization** for MLLM safety alignment. It performs fine-grained safety preference alignment by optimizing image prompts—using both harmful and benign samples—without modifying model weights. The method simultaneously enhances jailbreak resistance and mitigates false rejections, and supports flexible switching across multiple value systems. Experiments on multiple cross-modal benchmarks show that Magic Image significantly reduces over-rejection rates (average decrease of 32.7%) while preserving original task performance, achieving a superior trade-off between safety and usability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel in diverse applications but face dual challenges: generating harmful content under jailbreak attacks and over-refusal of benign queries due to rigid safety mechanisms. These issues are further complicated by the need to accommodate different value systems and precisely align with given safety preferences. Moreover, traditional methods like SFT and RLHF lack this capability due to their costly parameter tuning requirements and inability to support multiple value systems within a single model. These problems are more obvious in multimodal large language models (MLLMs), especially in terms of heightened over-refusal in cross-modal tasks and new security risks arising from expanded attack surfaces. We propose Magic Image, an optimization-driven visual prompt framework that enhances security while reducing over-refusal. By optimizing image prompts using harmful/benign samples, our method enables a single model to adapt to different value systems and better align with given safety preferences without parameter updates. Experiments demonstrate improved safety-effectiveness balance across diverse datasets while preserving model performance, offering a practical solution for deployable MLLM safety alignment.

Problem

Research questions and friction points this paper is trying to address.

Addressing harmful content generation and excessive refusal in LLMs

Enabling single-model adaptation to diverse value systems

Reducing cross-modal over-refusal and security risks in MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes image prompts for safety alignment

Enables single model multiple value systems

No parameter updates required for adaptation

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?