Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reduced faithfulness of multimodal language models in visual spatial reasoning, often caused by inconsistencies between chain-of-thought rationales and final answers, as well as insufficient grounding in visual evidence. To mitigate these issues, the authors propose Faithful GRPO, which for the first time explicitly incorporates logical consistency and visual grounding as constraints within a Group Relative Policy Optimization framework. These constraints are dynamically weighted via Lagrangian dual ascent during training. Evaluated on seven spatial reasoning benchmarks using Qwen2.5-VL-7B and Qwen2.5-VL-3B, the method reduces inconsistency rates from 24.5% to 1.7%, improves visual grounding scores by 13%, and achieves higher answer accuracy than standard GRPO, thereby simultaneously enhancing both reasoning quality and correctness.
📝 Abstract
Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.
Problem

Research questions and friction points this paper is trying to address.

visual spatial reasoning
multimodal language models
Chain-of-Thought
logical consistency
visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Faithful GRPO
visual grounding
logical consistency
constrained policy optimization
multimodal reasoning
🔎 Similar Papers
No similar papers found.