🤖 AI Summary
This work addresses the credit assignment challenge in multimodal large language models under cross-modal attention mechanisms within reinforcement learning. The study reveals that visual reasoning performance hinges primarily on the alignment quality of a few highly connected anchor tokens across modalities, rather than the total number of participating tokens. Building on this insight, the authors propose AT-RL, a lightweight Anchor-Token Reinforcement Learning framework that leverages graph clustering to analyze attention topology, accurately identify critical anchor tokens, and selectively reinforce them. Coupled with a verifiable reward mechanism, AT-RL achieves efficient training with only 1.2% additional computational overhead. Evaluated on MathVista, a 32B-parameter model trained with AT-RL attains a score of 80.2, surpassing the 72B-Instruct baseline, and demonstrates consistent performance gains across STEM, video, and general multimodal tasks.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet how visual evidence is integrated during reasoning remains poorly understood. We explore multimodal RLVR through the lens of cross-modal attention connectivity and find that only a small fraction of tokens (approximately 15%) exhibit strong visual-textual coupling. These high-connectivity tokens act as anchors that ground reasoning in the image, while the majority follow linguistic patterns. During RLVR training, credit assignment naturally concentrates on these anchors, sharpening their visual grounding over time. Building on this insight, we propose Anchor-Token Reinforcement Learning (AT-RL), a lightweight framework that selectively reinforces high-connectivity tokens via graph-based clustering of attention topology. Evaluated across the series (3B-32B), AT-RL introduces only 1.2% overhead yet enables the 32B model to surpass the 72B-Instruct baseline on MathVista (80.2), with consistent gains observed across STEM, video and general tasks. Conversely, training solely on low-connectivity tokens causes severe degradation, confirming that effective multimodal RL hinges on precise credit assignment to visual anchors. Our work reveals that reasoning quality is governed not by token quantity but by the fidelity of cross-modal anchoring.