M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high annotation cost and inefficient negative-sample mining in multimodal instruction following for Large Vision-Language Models (LVLMs), this paper proposes M3P-Score—a lightweight, preference-based optimization framework. Our method jointly leverages multimodal alignment fidelity and model self-consistency/confidence to automatically identify “high-confidence erroneous” hard negatives. Built upon models such as LLaVA, it integrates LoRA-based fine-tuning, candidate response pool construction, and Direct Preference Optimization (DPO), enabling end-to-end optimization without hand-crafted reward modeling. Extensive evaluation on MME-Bench, POPE, IFT, and human preference scoring benchmarks demonstrates significant improvements over supervised fine-tuning (SFT), RLHF, standard DPO, and reward-model-augmented DPO (RM-DPO). Results confirm both the effectiveness and generalizability of M3P-Score in enhancing multimodal instruction-following capabilities.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) hold immense potential for complex multimodal instruction following, yet their development is often hindered by the high cost and inconsistency of human annotation required for effective fine-tuning and preference alignment. Traditional supervised fine-tuning (SFT) and existing preference optimization methods like RLHF and DPO frequently struggle to efficiently leverage the model's own generation space to identify highly informative "hard negative" samples. To address these challenges, we propose Multimodal-Model-Guided Preference Optimization (M3PO), a novel and data-efficient method designed to enhance LVLMs' capabilities in visual instruction following. M3PO intelligently selects the most "learning-valuable" preference sample pairs from a diverse pool of LVLM-generated candidates. This selection is driven by a sophisticated mechanism that integrates two crucial signals: a Multimodal Alignment Score (MAS) to assess external quality and the model's Self-Consistency / Confidence (log-probability) to gauge internal belief. These are combined into a novel M3P-Score, which specifically identifies preferred responses and challenging dispreferred responses that the model might confidently generate despite being incorrect. These high-quality preference pairs are then used for efficient Direct Preference Optimization (DPO) fine-tuning on base LVLMs like LLaVA-1.5 (7B/13B) using LoRA. Our extensive experiments demonstrate that M3PO consistently outperforms strong baselines, including SFT, simulated RLHF, vanilla DPO, and RM-DPO, across a comprehensive suite of multimodal instruction following benchmarks (MME-Bench, POPE, IFT, Human Pref. Score).
Problem

Research questions and friction points this paper is trying to address.

Optimizing multimodal instruction following in large vision-language models
Addressing high cost and inconsistency of human annotation for fine-tuning
Identifying challenging hard negative samples from model-generated candidates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal-Model-Guided Preference Optimization method
Selects learning-valuable pairs from LVLM-generated candidates
Uses M3P-Score combining alignment and self-consistency signals
🔎 Similar Papers
No similar papers found.
R
Ruirui Gao
University of Massachusetts, Amherst
E
Emily Johnson
University of Massachusetts, Amherst
Bowen Tan
Bowen Tan
Carnegie Mellon University
Y
Yanfei Qian
University of Massachusetts, Amherst