🤖 AI Summary
This work addresses the tendency of vision-language models trained via reinforcement learning to suffer from diversity collapse, leading to premature convergence onto a limited set of reasoning paths, suboptimal local solutions, and restricted scalability. To mitigate this issue, the authors propose Multi-group Policy Optimization (MUPO), a novel reinforcement learning framework that explicitly encourages divergent thinking by promoting diverse reasoning across multiple solution spaces. MUPO extends the existing Group Relative Policy Optimization (GRPO) algorithm and reveals fundamental differences between reinforcement learning agents and baseline models in terms of both reasoning breadth and depth. Experimental results demonstrate that MUPO significantly enhances reasoning diversity, overall performance, and model scalability on standard benchmarks.
📝 Abstract
Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/