🤖 AI Summary
Multimodal large language models (MLLMs) suffer from limited exploration, saturated visual reasoning capabilities, and inefficient training during reinforcement learning (RL) fine-tuning, primarily due to reliance on policy self-sampling. Method: This paper proposes an external knowledge-injected policy learning framework that, for the first time, incorporates high-quality action sequences generated by auxiliary models as external guidance signals into RL training—thereby explicitly expanding the action space and optimizing reasoning trajectories. Technically, it integrates GRPO-style RL, cross-model knowledge distillation, and action-space guidance mechanisms. Contribution/Results: On the Reason-RFT-CoT benchmark, our method surpasses the state-of-the-art by 5%, while significantly accelerating training convergence and improving sample efficiency.
📝 Abstract
Visual reasoning is crucial for understanding complex multimodal data and advancing Artificial General Intelligence. Existing methods enhance the reasoning capability of Multimodal Large Language Models (MLLMs) through Reinforcement Learning (RL) fine-tuning (e.g., GRPO). However, current RL approaches sample action groups solely from the policy model itself, which limits the upper boundary of the model's reasoning capability and leads to inefficient training. To address these limitations, this paper proposes a novel RL framework called extbf{Vision-EKIPL}. The core of this framework lies in introducing high-quality actions generated by external auxiliary models during the RL training process to guide the optimization of the policy model. The policy learning with knowledge infusion from external models significantly expands the model's exploration space, effectively improves the reasoning boundary, and substantially accelerates training convergence speed and efficiency. Experimental results demonstrate that our proposed Vision-EKIPL achieved up to a 5% performance improvement on the Reason-RFT-CoT Benchmark compared to the state-of-the-art (SOTA). It reveals that Vision-EKIPL can overcome the limitations of traditional RL methods, significantly enhance the visual reasoning performance of MLLMs, and provide a new effective paradigm for research in this field.