🤖 AI Summary
This work addresses the challenge that multimodal large language models often fail to improve—or even degrade—their perceptual capabilities during reinforcement learning–based post-training due to generating excessively verbose reasoning trajectories. To overcome this limitation, the paper introduces a novel paradigm termed “attention policy,” which, for the first time, applies policy gradient reinforcement learning directly to optimize the model’s internal cross-modal attention distributions rather than its output sequences. The approach further incorporates online attention distillation to enhance modality alignment. Evaluated across multiple image and video benchmarks, the proposed method significantly outperforms strong baselines such as GRPO, demonstrating consistent gains in both perception and reasoning abilities.
📝 Abstract
Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.