SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

๐Ÿ“… 2025-12-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing long-video understanding approaches struggle to balance inference efficiency and adaptability across diverse video durations. Method: This paper proposes the any-horizon video reasoning paradigm, introducing an adaptive multi-turn video reasoning agent. The agent leverages Gemini-2.5-Flash to generate high-quality synthetic data and integrates the SAGE-MM multimodal large model, task-oriented dynamic frame sampling, reinforcement learning (a PPO variant) for post-training, and synthetic data distillationโ€”enabling duration-aware decisions to skip, closely examine, or process frames stepwise. Contribution/Results: On open-ended video question answering, the method achieves a 6.1% overall performance gain and an 8.2% improvement on videos exceeding 10 minutes. We introduce SAGE-Benchโ€”a new benchmark with an average video length exceeding 700 secondsโ€”demonstrating strong effectiveness and generalization in realistic long-video scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes.
Problem

Research questions and friction points this paper is trying to address.

Develop any-horizon video reasoning systems like humans
Train agents to reason flexibly across different video durations
Improve performance on long videos with multi-turn reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

SAGE agent system for multi-turn long video reasoning
Synthetic data pipeline with Gemini-2.5-Flash for orchestrator training
RL post-training recipe to enable any-horizon reasoning ability
๐Ÿ”Ž Similar Papers
No similar papers found.