Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in

๐Ÿ“… 2025-12-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large Video-Language Models (LVLMs) exhibit weak temporal awareness in Grounded Video Question Answering (GVQA), leading to temporal misalignment and answer hallucination. To address this, we propose a two-stage โ€œcoarse localization โ†’ fine-grained focusingโ€ understanding framework. Our method introduces a novel scale-aware precision reward mechanism and a token-selective credit assignment strategy to enhance temporal localization fidelity and decouple multi-objective reinforcement signals. We further improve Group Relative Policy Optimization (GRPO) and integrate a temporal zoom-in mechanism for refined frame-level reasoning. Evaluated on NExT-GQA and ReXTime, our approach achieves +5.2% and +4.6% gains in temporal localization accuracy, respectively, +2.4% improvement in answer accuracy, and an average +6.4% boost in long-video understanding performance. These results demonstrate significant advances in temporally grounded video comprehension for LVLMs.

Technology Category

Application Category

๐Ÿ“ Abstract
Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO's issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2% on NExT-GQA and 4.6% on ReXTime, while also enhancing average answer accuracy by 2.4%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4% on long-video benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Improves temporal grounding in video question answering
Addresses mislocalization and hallucinations in large video-language models
Enhances fine-grained visual verification via coarse-to-fine zoom-in
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine framework with temporal zoom-in for verification
Zoom-in accuracy reward to validate temporal grounding fidelity
Token-selective credit assignment for multi-faceted reward attribution
๐Ÿ”Ž Similar Papers
No similar papers found.