Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from error propagation due to perceptual inaccuracies in vision–language joint reasoning, and existing reinforcement learning (RL)-based fine-tuning methods fail to mitigate semantic misalignment between visual grounding and symbolic reasoning. To address this, we propose CapPO, a perception-consistent RL framework that explicitly aligns visual content with reasoning chains via image-caption-driven consistency regularization and KL-weighted advantage estimation. CapPO integrates response distribution alignment constraints with conditional consistency modeling to suppress perceptual biases. Evaluated on five mathematical reasoning and five general-purpose reasoning benchmarks, it improves accuracy by 6.0% and 2.4%, respectively, while significantly reducing perception-related errors. This work is the first to systematically incorporate perception consistency modeling into RL-based MLLM fine-tuning, establishing a novel paradigm for robust multimodal reasoning.

Technology Category

Application Category

📝 Abstract
While multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning, their performance is often undermined by a critical vulnerability: perception-induced errors that propagate through the reasoning chain. Current reinforcement learning (RL) fine-tuning methods, while enhancing reasoning abilities, largely fail to address the underlying misalignment between visual grounding and the subsequent reasoning process. To address this challenge, we propose extbf{Caption-Regularized Policy Optimization (CapPO)}, a novel RL framework that explicitly enforces perceptual consistency during policy optimization. CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, thereby anchoring reasoning to semantically faithful visual content; and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories while suppressing spurious correlations. Extensive experiments on five math-focused and five general reasoning benchmarks demonstrate that CapPO achieves competitive performance, yielding gains of +6.0% accuracy on math-related tasks and +2.4% on general reasoning tasks over the base Qwen2.5-VL-7B model. Moreover, ablation studies further confirm the effectiveness of each component, while error analysis reveals that CapPO significantly reduces perception-related mistakes compared with baselines. Overall, CapPO provides a simple yet effective framework for improving multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

Addresses perception errors in multimodal reasoning models
Aligns visual grounding with reasoning to reduce mistakes
Improves accuracy on math and general reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Caption-based consistency regularization for perceptual alignment
KL-weighted advantage estimation to scale reinforcement signals
Minimizing divergence between image and caption conditioned responses
🔎 Similar Papers
No similar papers found.