Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models

πŸ“… 2025-03-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the dual challenges of safety alignment difficulty and the trade-off between utility and safety in multimodal large language models (MLLMs), this paper introduces Safe RLHF-Vβ€”the first safety alignment framework specifically designed for MLLMs. Methodologically, it establishes BeaverTails-V, the first dual-dimensional multimodal preference dataset; proposes Beaver-Guard-V, a multi-tier safety guard system with a five-stage filtering-and-regeneration mechanism; and integrates Lagrangian-constrained reinforcement learning, decoupled multimodal reward/cost modeling, and adversarial query-based active defense. Evaluated across multiple mainstream MLLMs, Safe RLHF-V achieves a 34.2% improvement in safety performance, a 34.3% gain in utility, and an average 40.9% enhancement in upstream model safety. All datasets, models, and code are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks. How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards? In a further step, we need to explore how to fine-tune MLLMs to enhance reasoning performance while ensuring they satisfy safety constraints. Fundamentally, this can be formulated as a min-max optimization problem. In this study, we propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimizes helpfulness and safety using separate multimodal reward and cost models within a Lagrangian-based constrained optimization framework. Given that there is a lack of preference datasets that separate helpfulness and safety in multimodal scenarios, we introduce BeaverTails-V, the first open-source dataset with dual preference annotations for helpfulness and safety, along with multi-level safety labels (minor, moderate, severe). Additionally, we design a Multi-level Guardrail System to proactively defend against unsafe queries and adversarial attacks. By applying the Beaver-Guard-V moderation for 5 rounds of filtering and re-generation on the precursor model, the overall safety of the upstream model is significantly improved by an average of 40.9%. Experimental results demonstrate that fine-tuning different MLLMs with Safe RLHF can effectively enhance model helpfulness while ensuring improved safety. Specifically, Safe RLHF-V improves model safety by 34.2% and helpfulness by 34.3%. All of datasets, models, and code can be found at https://github.com/SafeRLHF-V to support the safety development of MLLMs and reduce potential societal risks.
Problem

Research questions and friction points this paper is trying to address.

Ensures MLLMs align safely to prevent harmful behaviors
Optimizes helpfulness and safety via multimodal reward models
Introduces dataset and guardrail system for safety enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal safety alignment with separate reward models
Lagrangian-based constrained optimization framework
Multi-level Guardrail System for proactive defense
πŸ”Ž Similar Papers
No similar papers found.
J
Jiaming Ji
Peking University
X
Xinyu Chen
Peking University
R
Rui Pan
Peking University
H
Han Zhu
Hong Kong University of Science and Technology
C
Conghui Zhang
Peking University
J
Jiahao Li
Peking University
Donghai Hong
Donghai Hong
Peking University
AI SafetyAI AlignmentMulti-Modal Model
B
Boyuan Chen
Peking University
J
Jiayi Zhou
Peking University
Kaile Wang
Kaile Wang
Peking University
J
Juntao Dai
Peking University
Chi-Min Chan
Chi-Min Chan
HKUST
Large Language ModelsPost-TrainingAlignmentLLM Agents
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Y
Yike Guo
Hong Kong University of Science and Technology
Y
Yaodong Yang
Peking University