ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Traditional DPO applies binary preference signals only at the full-response level in multimodal settings, neglecting intra-response fine-grained correctness—leading to insufficient supervision and alignment bias. To address this, we propose Sentence-level Adaptive DPO (SA-DPO), the first parameter-free, dynamically adaptive sentence-level reward mechanism that shifts preference modeling from holistic responses to semantic units. SA-DPO integrates multimodal feature alignment with fine-grained response decomposition and evaluation, enabling paragraph-level precise supervision within the DPO framework. Evaluated on multimodal reasoning benchmarks—including MMMU, MME, and OCRBench—SA-DPO significantly improves factual accuracy, logical coherence, and cross-modal consistency. Empirical results demonstrate that fine-grained supervision yields substantial gains in the alignment capability of multimodal large language models, validating its critical role in enhancing grounded, reliable multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong performance. However, traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segment correctness, leading to suboptimal solutions. The root of this issue lies in the absence of fine-grained supervision during the optimization process. To address this, we propose Adaptive Sentence-level Preference Optimization (ASPO), which evaluates individual sentences for more precise preference optimization. By dynamically calculating adaptive rewards at the sentence level based on model predictions, ASPO enhances response content assessment without additional models or parameters. This significantly improves the alignment of multimodal features. Extensive experiments show that ASPO substantially enhances the overall performance of multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Traditional DPO lacks fine-grained segment correctness evaluation

Absence of fine-grained supervision causes suboptimal multimodal alignment

Need adaptive sentence-level rewards for precise preference optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive sentence-level reward calculation

Fine-grained multimodal feature alignment

No additional models or parameters needed

🔎 Similar Papers

No similar papers found.