EVA: Efficient Reinforcement Learning for End-to-End Video Agent

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the inefficiency of existing video understanding methods in processing long videos, their reliance on handcrafted pipelines, and their inability to support query-driven adaptive comprehension. To overcome these limitations, the authors propose EVA, an end-to-end video agent that introduces, for the first time, a planning-first iterative reasoning framework following a “summarize–plan–act–reflect” cycle, enabling autonomous decisions on what to watch, when, and how. EVA is trained via a three-stage paradigm—supervised fine-tuning (SFT), Kahneman-Tversky optimization (KTO), and generalized reward policy optimization (GRPO)—leveraging high-quality multi-stage data. Evaluated across six video understanding benchmarks, EVA substantially outperforms general-purpose multimodal large language models by 6–12% and further advances state-of-the-art adaptive agent approaches by an additional 1–3%.

Technology Category

Application Category

📝 Abstract

Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.

Problem

Research questions and friction points this paper is trying to address.

video understanding

multimodal large language models

long video

adaptive reasoning

temporal dependencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient Reinforcement Learning

End-to-End Video Agent

Planning-before-Perception