ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

๐Ÿ“… 2026-05-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

223K/year
๐Ÿค– AI Summary
Existing video reinforcement learning methods support only sequential tool invocation, which is prone to error accumulation, context contamination, and high inference costs due to a single erroneous cropping decision. This work proposes ParaVT, the first end-to-end multi-agent reinforcement learning framework that enables parallel invocation of multiple video cropping tools within a single reasoning round. To address format collapse and reward shortcut issues stemming from pretrained tool priors, we introduce the PARA-GRPO algorithm, which incorporates structure-aware positional rewards and frame-budget randomization to jointly mitigate these challenges. Evaluated across six long-form video understanding benchmarks, our approach achieves an average performance gain of 7.9% and significantly improves format compliance during trainingโ€”from 0.13 to 0.64.
๐Ÿ“ Abstract
Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Tool Prior Paradox
Parallel Tool Use
Video Reinforcement Learning
Format Collapse
Long-video Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel Tool Use
Tool Prior Paradox
Reinforcement Learning
Long-Video Understanding
Multi-Agent RL