🤖 AI Summary
This work addresses the challenge of interaction collapse in multimodal agents within reinforcement learning, which often leads to reduced tool usage and diminished multi-step reasoning, thereby compromising agent autonomy. To mitigate this, the authors propose the PyVision-RL framework, which incorporates an oversampling-filtering-ranking rollout strategy and a cumulative tool-based reward mechanism to stabilize training and sustain prolonged interaction. Additionally, they introduce an on-demand context construction mechanism that substantially reduces visual token consumption. The framework unifies support for both image and video understanding, achieving significant improvements in performance and efficiency on PyVision-Image and PyVision-Video benchmarks. These results underscore the critical importance of sustained interaction and on-demand visual processing for effective multimodal agent design.
📝 Abstract
Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.