Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models

📅 2025-03-22

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

To address the dual challenges of safety alignment difficulty and the trade-off between utility and safety in multimodal large language models (MLLMs), this paper introduces Safe RLHF-V—the first safety alignment framework specifically designed for MLLMs. Methodologically, it establishes BeaverTails-V, the first dual-dimensional multimodal preference dataset; proposes Beaver-Guard-V, a multi-tier safety guard system with a five-stage filtering-and-regeneration mechanism; and integrates Lagrangian-constrained reinforcement learning, decoupled multimodal reward/cost modeling, and adversarial query-based active defense. Evaluated across multiple mainstream MLLMs, Safe RLHF-V achieves a 34.2% improvement in safety performance, a 34.3% gain in utility, and an average 40.9% enhancement in upstream model safety. All datasets, models, and code are publicly released.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks. How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards? In a further step, we need to explore how to fine-tune MLLMs to enhance reasoning performance while ensuring they satisfy safety constraints. Fundamentally, this can be formulated as a min-max optimization problem. In this study, we propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimizes helpfulness and safety using separate multimodal reward and cost models within a Lagrangian-based constrained optimization framework. Given that there is a lack of preference datasets that separate helpfulness and safety in multimodal scenarios, we introduce BeaverTails-V, the first open-source dataset with dual preference annotations for helpfulness and safety, along with multi-level safety labels (minor, moderate, severe). Additionally, we design a Multi-level Guardrail System to proactively defend against unsafe queries and adversarial attacks. By applying the Beaver-Guard-V moderation for 5 rounds of filtering and re-generation on the precursor model, the overall safety of the upstream model is significantly improved by an average of 40.9%. Experimental results demonstrate that fine-tuning different MLLMs with Safe RLHF can effectively enhance model helpfulness while ensuring improved safety. Specifically, Safe RLHF-V improves model safety by 34.2% and helpfulness by 34.3%. All of datasets, models, and code can be found at https://github.com/SafeRLHF-V to support the safety development of MLLMs and reduce potential societal risks.

Problem

Research questions and friction points this paper is trying to address.

Ensures MLLMs align safely to prevent harmful behaviors

Optimizes helpfulness and safety via multimodal reward models

Introduces dataset and guardrail system for safety enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal safety alignment with separate reward models

Lagrangian-based constrained optimization framework

Multi-level Guardrail System for proactive defense

🔎 Similar Papers

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models