🤖 AI Summary
This work addresses the challenge in multimodal large language models where reasoning chains often become decoupled from visual evidence due to ineffective fusion of visual information during reinforcement learning. To bridge this gap, the authors propose Trajectory-Guided Reinforcement Learning (TGRL), which introduces expert reasoning trajectories into the Multimodal Verifiable Reward Reinforcement Learning (RLVR) framework for the first time. TGRL employs a fine-grained guidance strategy to align the policy model with visual inputs and integrates token-level reweighting with trajectory filtering to optimize training dynamics. This approach achieves deep integration of visual perception and logical reasoning, significantly outperforming prior methods on multiple multimodal reasoning benchmarks and overcoming the limitations of conventional approaches that focus narrowly on visual grounding without holistic reasoning alignment.
📝 Abstract
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.