MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) lack systematic human preference alignment, with existing efforts focusing narrowly on single dimensions—e.g., hallucination mitigation. To address this, we introduce the first large-scale, fine-grained, high-quality multimodal human preference dataset comprising 120K preference pairs. We propose an interpretable critique-based reward model—generating textual critiques before scoring—and a dynamic reward scaling algorithm that adaptively weights high-quality preference pairs. Furthermore, we systematically extend the Reinforcement Learning from Human Feedback (RLHF) framework to multimodal settings for the first time. Our approach achieves state-of-the-art performance across 27 benchmarks: after fine-tuning, LLaVA-ov-7B shows a 19.5% improvement in dialogue capability and a 60% gain in safety. All data, models, and code are publicly released.

Technology Category

Application Category

📝 Abstract

Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing $mathbf{120k}$ fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across $mathbf{10}$ distinct dimensions and $mathbf{27}$ benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a $mathbf{19.5}$% increase in conversational abilities and a $mathbf{60}$% improvement in safety. We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.

Problem

Research questions and friction points this paper is trying to address.

Aligning Multimodal LLMs with human preferences.

Enhancing MLLM capability through systematic alignment.

Improving reward models and alignment algorithms efficiency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Critique-Based Reward Model

Dynamic Reward Scaling

Human-annotated preference dataset

🔎 Similar Papers

No similar papers found.

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

Research Scientist Intern, Multimodal AI (PhD)