PyVision-RL: Forging Open Agentic Vision Models via RL

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of interaction collapse in multimodal agents within reinforcement learning, which often leads to reduced tool usage and diminished multi-step reasoning, thereby compromising agent autonomy. To mitigate this, the authors propose the PyVision-RL framework, which incorporates an oversampling-filtering-ranking rollout strategy and a cumulative tool-based reward mechanism to stabilize training and sustain prolonged interaction. Additionally, they introduce an on-demand context construction mechanism that substantially reduces visual token consumption. The framework unifies support for both image and video understanding, achieving significant improvements in performance and efficiency on PyVision-Image and PyVision-Video benchmarks. These results underscore the critical importance of sustained interaction and on-demand visual processing for effective multimodal agent design.

Technology Category

Application Category

📝 Abstract

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

Problem

Research questions and friction points this paper is trying to address.

interaction collapse

agentic multimodal models

tool usage

multi-turn reasoning

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning

agentic multimodal models

interaction collapse