PyVision-RL: Forging Open Agentic Vision Models via RL

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of interaction collapse in multimodal agents within reinforcement learning, which often leads to reduced tool usage and diminished multi-step reasoning, thereby compromising agent autonomy. To mitigate this, the authors propose the PyVision-RL framework, which incorporates an oversampling-filtering-ranking rollout strategy and a cumulative tool-based reward mechanism to stabilize training and sustain prolonged interaction. Additionally, they introduce an on-demand context construction mechanism that substantially reduces visual token consumption. The framework unifies support for both image and video understanding, achieving significant improvements in performance and efficiency on PyVision-Image and PyVision-Video benchmarks. These results underscore the critical importance of sustained interaction and on-demand visual processing for effective multimodal agent design.

Technology Category

Application Category

📝 Abstract
Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.
Problem

Research questions and friction points this paper is trying to address.

interaction collapse
agentic multimodal models
tool usage
multi-turn reasoning
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning
agentic multimodal models
interaction collapse
on-demand context construction
tool usage
🔎 Similar Papers
No similar papers found.