Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
Existing video understanding methods struggle to model objects undergoing semantically significant changes over time and lack explicit reasoning over critical visual evidence. This work proposes a search-guided, progressive object grounding framework that incrementally anchors task-relevant visual regions through a reinforcement learning–driven search controller coupled with a novel formatting reward mechanism. By explicitly incentivizing the model to focus on authentic visual evidence, the approach constructs spatially grounded, multi-step reasoning trajectories. Notably, it introduces formatting rewards into visual reasoning for the first time. The method consistently achieves performance gains across multiple benchmarks—including NExTQA, Video-Holmes, CG-Bench Reasoning, and VRBench—demonstrating superior accuracy, interpretability, robustness, and cross-domain generalization.

Technology Category

Application Category

📝 Abstract
Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

video understanding
object-grounded reasoning
visual object variation
multi-step reasoning
semantic discrimination
Innovation

Methods, ideas, or system contributions that make the work stand out.

object-grounded reasoning
search-guided controller
reinforcement learning
visual grounding
multi-step decision-making