🤖 AI Summary
To address the high annotation cost and inefficient negative-sample mining in multimodal instruction following for Large Vision-Language Models (LVLMs), this paper proposes M3P-Score—a lightweight, preference-based optimization framework. Our method jointly leverages multimodal alignment fidelity and model self-consistency/confidence to automatically identify “high-confidence erroneous” hard negatives. Built upon models such as LLaVA, it integrates LoRA-based fine-tuning, candidate response pool construction, and Direct Preference Optimization (DPO), enabling end-to-end optimization without hand-crafted reward modeling. Extensive evaluation on MME-Bench, POPE, IFT, and human preference scoring benchmarks demonstrates significant improvements over supervised fine-tuning (SFT), RLHF, standard DPO, and reward-model-augmented DPO (RM-DPO). Results confirm both the effectiveness and generalizability of M3P-Score in enhancing multimodal instruction-following capabilities.
📝 Abstract
Large Vision-Language Models (LVLMs) hold immense potential for complex multimodal instruction following, yet their development is often hindered by the high cost and inconsistency of human annotation required for effective fine-tuning and preference alignment. Traditional supervised fine-tuning (SFT) and existing preference optimization methods like RLHF and DPO frequently struggle to efficiently leverage the model's own generation space to identify highly informative "hard negative" samples. To address these challenges, we propose Multimodal-Model-Guided Preference Optimization (M3PO), a novel and data-efficient method designed to enhance LVLMs' capabilities in visual instruction following. M3PO intelligently selects the most "learning-valuable" preference sample pairs from a diverse pool of LVLM-generated candidates. This selection is driven by a sophisticated mechanism that integrates two crucial signals: a Multimodal Alignment Score (MAS) to assess external quality and the model's Self-Consistency / Confidence (log-probability) to gauge internal belief. These are combined into a novel M3P-Score, which specifically identifies preferred responses and challenging dispreferred responses that the model might confidently generate despite being incorrect. These high-quality preference pairs are then used for efficient Direct Preference Optimization (DPO) fine-tuning on base LVLMs like LLaVA-1.5 (7B/13B) using LoRA. Our extensive experiments demonstrate that M3PO consistently outperforms strong baselines, including SFT, simulated RLHF, vanilla DPO, and RM-DPO, across a comprehensive suite of multimodal instruction following benchmarks (MME-Bench, POPE, IFT, Human Pref. Score).