🤖 AI Summary
Traditional DPO applies binary preference signals only at the full-response level in multimodal settings, neglecting intra-response fine-grained correctness—leading to insufficient supervision and alignment bias. To address this, we propose Sentence-level Adaptive DPO (SA-DPO), the first parameter-free, dynamically adaptive sentence-level reward mechanism that shifts preference modeling from holistic responses to semantic units. SA-DPO integrates multimodal feature alignment with fine-grained response decomposition and evaluation, enabling paragraph-level precise supervision within the DPO framework. Evaluated on multimodal reasoning benchmarks—including MMMU, MME, and OCRBench—SA-DPO significantly improves factual accuracy, logical coherence, and cross-modal consistency. Empirical results demonstrate that fine-grained supervision yields substantial gains in the alignment capability of multimodal large language models, validating its critical role in enhancing grounded, reliable multimodal reasoning.
📝 Abstract
Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong performance. However, traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segment correctness, leading to suboptimal solutions. The root of this issue lies in the absence of fine-grained supervision during the optimization process. To address this, we propose Adaptive Sentence-level Preference Optimization (ASPO), which evaluates individual sentences for more precise preference optimization. By dynamically calculating adaptive rewards at the sentence level based on model predictions, ASPO enhances response content assessment without additional models or parameters. This significantly improves the alignment of multimodal features. Extensive experiments show that ASPO substantially enhances the overall performance of multimodal models.