VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge that existing video reasoning models struggle to reliably ground spatiotemporal evidence and often rely on costly large-scale annotations or external perception tools during inference. To overcome this, we propose VisionCoach, an input-adaptive reinforcement learning framework that incorporates visual prompts as dynamic guidance during training to steer the model toward critical evidence while suppressing distractions. The approach employs an object-aware reward mechanism—combining object consistency and multi-region bounding box overlap—to refine spatiotemporal reasoning, and leverages self-distillation to enable accurate grounding at inference time without external prompts. By integrating visual prompts into the training guidance mechanism for the first time, VisionCoach achieves state-of-the-art performance across six benchmarks, including V-STAR and VideoMME, offering both efficient inference and low resource dependency.

Technology Category

Application Category

📝 Abstract

Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.

Problem

Research questions and friction points this paper is trying to address.

video reasoning

spatio-temporal grounding

visual perception

reinforcement learning

object tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-perception prompting

reinforcement learning

spatio-temporal grounding

self-distillation

input-adaptive RL

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

2024-06-19arXiv.orgCitations: 5

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30