Video-Zero: Self-Evolution Video Understanding

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Existing video self-evolution approaches suffer from weak supervision and limited gains in reasoning due to their reliance on static cues or linguistic priors without sufficient temporal evidence. This work proposes the first question-answering collaborative self-evolution framework that explicitly focuses on temporal local evidence: a Questioner module automatically identifies critical video segments and generates evidence-anchored questions, while a Solver learns to align its predictions with the provided evidence, establishing a closed-loop iterative optimization process. Emphasizing evidence-anchored supervision over mere question difficulty, the method integrates a QA collaboration mechanism, a temporal evidence discovery module, and an evidence-alignment learning strategy to enable high-quality self-supervised training. Evaluated across 13 benchmarks spanning temporal localization, long-form video understanding, and reasoning, the approach consistently enhances diverse vision-language models, demonstrating the effectiveness and generalizability of evidence-centric self-evolution.
📝 Abstract
Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can therefore produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence. In this work, we argue that the key bottleneck of video self-evolution is not difficulty alone, but grounding. We propose Video-Zero, an annotation-free Questioner--Solver co-evolution framework that centers self-evolution on temporally localized evidence. The Questioner discovers informative evidence segments and generates evidence-grounded questions, while the Solver learns to answer and align its predictions with the supporting evidence. This closes an iterative loop of evidence discovery, grounded supervision, and evidence-aligned learning. Across 13 benchmarks spanning temporal grounding, long-video understanding, and video reasoning, Video-Zero consistently improves multiple video VLM backbones, demonstrating the effectiveness and transferability of evidence-centered self-evolution.
Problem

Research questions and friction points this paper is trying to address.

video understanding
self-evolution
temporal grounding
evidence-grounded reasoning
annotation-free learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolution
video understanding
temporal grounding
evidence-grounded learning
annotation-free
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30