Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large vision-language models often lack explicit modeling of visual information in reinforcement learning, leading to a disconnect between visual representations and the optimization process, which limits multimodal reasoning performance. To address this, this work proposes KAWHI—a plug-and-play reward reweighting mechanism that hierarchically aggregates geometric cues to localize critical visual regions, identifies vision-sensitive attention heads via structured attribution, and realigns credit assignment at the paragraph level to align spatial visual evidence with key reasoning steps. KAWHI is the first framework to explicitly integrate structured visual information into a unified reward-based optimization paradigm, enabling tight coupling between visual representation learning and reinforcement learning. Compatible with general algorithms such as GRPO and GSPO, KAWHI consistently achieves significant performance gains across multiple multimodal reasoning benchmarks, demonstrating its effectiveness and generalizability as a universal enhancement module.
📝 Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has substantially enhanced the reasoning capabilities of large language models in abstract reasoning tasks. However, its application to Large Vision-Language Models (LVLMs) remains constrained by a structural representational bottleneck. Existing approaches generally lack explicit modeling and effective utilization of visual information, preventing visual representations from being tightly coupled with the reinforcement learning optimization process and thereby limiting further improvements in multimodal reasoning performance. To address this limitation, we propose KAWHI (Key-Region Aligned Weighted Harmonic Incentive), a plug-and-play reward reweighting mechanism that explicitly incorporates structured visual information into uniform reward policy optimization methods (e.g., GRPO and GSPO). The method adaptively localizes semantically salient regions through hierarchical geometric aggregation, identifies vision-critical attention heads via structured attribution, and performs paragraph-level credit reallocation to align spatial visual evidence with semantically decisive reasoning steps. Extensive empirical evaluations on diverse reasoning benchmarks substantiate KAWHI as a general-purpose enhancement module, consistently improving the performance of various uniform reward optimization methods. Project page: KAWHI (https://kawhiiiileo.github.io/KAWHI_PAGE/)
Problem

Research questions and friction points this paper is trying to address.

Visual Representation
Reinforcement Learning
Large Vision-Language Models
Multimodal Reasoning
Reward Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning from Verifiable Rewards
Large Vision-Language Models
Visual Representation Alignment
Reward Reweighting
Structured Attribution
🔎 Similar Papers