EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
This work addresses the heavy reliance on manual annotations in video temporal grounding by proposing EvoGround, a framework that achieves fully unsupervised temporal localization for the first time. EvoGround introduces two self-evolving agents—a proposer and a solver—that collaboratively refine their strategies through reinforcement learning on unlabeled videos, implicitly aligning visual and linguistic semantics without explicit supervision. Remarkably, using only 2.5K unlabeled videos, EvoGround matches or surpasses the performance of fully supervised models across multiple benchmarks while simultaneously attaining state-of-the-art capabilities in fine-grained video captioning.
📝 Abstract
Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.
Problem

Research questions and friction points this paper is trying to address.

video temporal grounding
natural-language query
temporal moment localization
manual annotation
untrimmed video
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving agents
video temporal grounding
unsupervised learning
reinforcement learning
fine-grained video captioning