🤖 AI Summary
Multimodal large language models (MLLMs) suffer from error propagation due to perceptual inaccuracies in vision–language joint reasoning, and existing reinforcement learning (RL)-based fine-tuning methods fail to mitigate semantic misalignment between visual grounding and symbolic reasoning. To address this, we propose CapPO, a perception-consistent RL framework that explicitly aligns visual content with reasoning chains via image-caption-driven consistency regularization and KL-weighted advantage estimation. CapPO integrates response distribution alignment constraints with conditional consistency modeling to suppress perceptual biases. Evaluated on five mathematical reasoning and five general-purpose reasoning benchmarks, it improves accuracy by 6.0% and 2.4%, respectively, while significantly reducing perception-related errors. This work is the first to systematically incorporate perception consistency modeling into RL-based MLLM fine-tuning, establishing a novel paradigm for robust multimodal reasoning.
📝 Abstract
While multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning, their performance is often undermined by a critical vulnerability: perception-induced errors that propagate through the reasoning chain. Current reinforcement learning (RL) fine-tuning methods, while enhancing reasoning abilities, largely fail to address the underlying misalignment between visual grounding and the subsequent reasoning process. To address this challenge, we propose extbf{Caption-Regularized Policy Optimization (CapPO)}, a novel RL framework that explicitly enforces perceptual consistency during policy optimization. CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, thereby anchoring reasoning to semantically faithful visual content; and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories while suppressing spurious correlations. Extensive experiments on five math-focused and five general reasoning benchmarks demonstrate that CapPO achieves competitive performance, yielding gains of +6.0% accuracy on math-related tasks and +2.4% on general reasoning tasks over the base Qwen2.5-VL-7B model. Moreover, ablation studies further confirm the effectiveness of each component, while error analysis reveals that CapPO significantly reduces perception-related mistakes compared with baselines. Overall, CapPO provides a simple yet effective framework for improving multimodal reasoning.