Video-Zero: Self-Evolution Video Understanding

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing video self-evolution approaches suffer from weak supervision and limited gains in reasoning due to their reliance on static cues or linguistic priors without sufficient temporal evidence. This work proposes the first question-answering collaborative self-evolution framework that explicitly focuses on temporal local evidence: a Questioner module automatically identifies critical video segments and generates evidence-anchored questions, while a Solver learns to align its predictions with the provided evidence, establishing a closed-loop iterative optimization process. Emphasizing evidence-anchored supervision over mere question difficulty, the method integrates a QA collaboration mechanism, a temporal evidence discovery module, and an evidence-alignment learning strategy to enable high-quality self-supervised training. Evaluated across 13 benchmarks spanning temporal localization, long-form video understanding, and reasoning, the approach consistently enhances diverse vision-language models, demonstrating the effectiveness and generalizability of evidence-centric self-evolution.

📝 Abstract

Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can therefore produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence. In this work, we argue that the key bottleneck of video self-evolution is not difficulty alone, but grounding. We propose Video-Zero, an annotation-free Questioner--Solver co-evolution framework that centers self-evolution on temporally localized evidence. The Questioner discovers informative evidence segments and generates evidence-grounded questions, while the Solver learns to answer and align its predictions with the supporting evidence. This closes an iterative loop of evidence discovery, grounded supervision, and evidence-aligned learning. Across 13 benchmarks spanning temporal grounding, long-video understanding, and video reasoning, Video-Zero consistently improves multiple video VLM backbones, demonstrating the effectiveness and transferability of evidence-centered self-evolution.

Problem

Research questions and friction points this paper is trying to address.

video understanding

self-evolution

temporal grounding

evidence-grounded reasoning

annotation-free learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolution

video understanding

temporal grounding

evidence-grounded learning

annotation-free

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30