DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) face compound safety risks arising from vision–language fusion, which are difficult to disentangle; conventional alignment methods often over-prioritize safety at the expense of task performance. Method: This paper proposes the first interpretable multimodal risk disentanglement framework, enabling explicit separation of input-level risk factors via fine-grained cross-modal risk discrimination modeling. It integrates supervised fine-tuning with iterative AI feedback reinforcement learning (RLAIF) to achieve precise safety alignment without compromising standard task performance. Contribution/Results: Experiments demonstrate that our method improves safety–effectiveness scores on the SIUO benchmark by 16.17% over GPT-4V, significantly enhancing risk awareness and safety robustness during both inference and training phases.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data, thereby introducing new dimensions of potential attacks and complex risk combinations. In this paper, we begin with a detailed analysis aimed at disentangling risks through step-by-step reasoning within multimodal inputs. We find that systematic multimodal risk disentanglement substantially enhances the risk awareness of MLLMs. Via leveraging the strong discriminative abilities of multimodal risk disentanglement, we further introduce extbf{DREAM} ( extit{ extbf{D}isentangling extbf{R}isks to extbf{E}nhance Safety extbf{A}lignment in extbf{M}LLMs}), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback (RLAIF). Experimental results show that DREAM significantly boosts safety during both inference and training phases without compromising performance on normal tasks (namely oversafety), achieving a 16.17% improvement in the SIUO safe&effective score compared to GPT-4V. The data and code are available at https://github.com/Kizna1ver/DREAM.
Problem

Research questions and friction points this paper is trying to address.

Disentangling risks in multimodal large language models
Enhancing safety alignment via supervised fine-tuning
Improving risk awareness without compromising performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal risk disentanglement enhances MLLM safety
Supervised fine-tuning improves safety alignment
Iterative RLAIF boosts safety without performance loss
🔎 Similar Papers
J
Jianyu Liu
Alibaba Group, Zhejiang University
H
Hangyu Guo
Alibaba Group
Ranjie Duan
Ranjie Duan
Alibaba Group
AIAI 安全AI推动共同富裕
X
Xingyuan Bu
Alibaba Group
Yancheng He
Yancheng He
Alibaba Group
LLM
Shilong Li
Shilong Li
University of California, Irvine
Software EngineeringAutonomous Driving Systems
H
Hui Huang
Alibaba Group
J
Jiaheng Liu
Alibaba Group
Yucheng Wang
Yucheng Wang
ETH Zürich
Multimodal LLMSpeech UnderstandingHuman-Computer Interaction
C
Chenchen Jing
Zhejiang University
X
Xingwei Qu
M-A-P
X
Xiao Zhang
Alibaba Group
Y
Yingshui Tan
Alibaba Group
Y
Yanan Wu
Alibaba Group
Jihao Gu
Jihao Gu
University College London
Computer Vision
Yangguang Li
Yangguang Li
CUHK
GenAIComputer GraphicsComputer Vision
Jianke Zhu
Jianke Zhu
Professor of Computer Science, Zhejiang University
Computer VisionRobotics