EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the heavy reliance on manual annotations in video temporal grounding by proposing EvoGround, a framework that achieves fully unsupervised temporal localization for the first time. EvoGround introduces two self-evolving agents—a proposer and a solver—that collaboratively refine their strategies through reinforcement learning on unlabeled videos, implicitly aligning visual and linguistic semantics without explicit supervision. Remarkably, using only 2.5K unlabeled videos, EvoGround matches or surpasses the performance of fully supervised models across multiple benchmarks while simultaneously attaining state-of-the-art capabilities in fine-grained video captioning.

📝 Abstract

Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.

Problem

Research questions and friction points this paper is trying to address.

video temporal grounding

natural-language query

temporal moment localization

manual annotation

untrimmed video

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving agents

video temporal grounding

unsupervised learning

reinforcement learning

fine-grained video captioning

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

2024-10-04arXiv.orgCitations: 36