EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the limitations of existing video large language models, which rely on costly human annotations and struggle to effectively capture temporal dynamics in videos. The authors propose the first temporally centered self-evolution framework tailored for the video modality, leveraging a Questioner-Solver self-play mechanism to automatically generate temporally grounded questions from raw unlabeled videos and provide intrinsic supervision. The approach introduces novel reward functions: a temporal-aware question-generation reward and a solver reward based on video segment localization, enabling an end-to-end unsupervised training pipeline. Extensive experiments across four foundational models and six benchmarks demonstrate consistent performance gains, achieving results comparable to supervised methods and confirming the framework’s effectiveness and scalability.

📝 Abstract

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner-Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose $\textbf{EvoVid}$, a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that encourages temporally dependent question generation through temporal perturbation sensitivity, and a temporal-grounded Solver reward that provides automatic temporal supervision via inherent video segment localization. Extensive experiments across four base models and six benchmarks demonstrate consistent improvements over both base models and existing self-evolving baselines, achieving competitive performance with supervised methods. These results highlight temporal-centric self-evolution as an effective and scalable paradigm for video understanding and reasoning.

Problem

Research questions and friction points this paper is trying to address.

Video Large Language Models

temporal dynamics

self-evolution

video reasoning

unannotated videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal-centric

self-evolution

video reasoning