GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?

📅 2025-07-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video benchmarks predominantly rely on keyframe sampling, failing to rigorously assess the temporal reasoning capabilities of Large Vision-Language Models (LVLMs). To address this limitation, we propose GLIMPSE—the first benchmark explicitly designed for comprehensive video understanding, where every question necessitates analysis of the full video sequence. It encompasses 11 highly visual tasks, including trajectory analysis and temporal causal reasoning. Built upon 3,269 human-annotated videos and 4,342 questions, GLIMPSE enforces end-to-end temporal modeling and eliminates shortcuts via static images or sparse frame sampling. Human performance achieves 94.82% accuracy, while the current state-of-the-art model, GPT-4o, attains only 66.43%, revealing a fundamental gap in LVLMs’ deep video comprehension. GLIMPSE thus provides a rigorous, temporally grounded evaluation framework that exposes critical weaknesses in existing models’ spatiotemporal reasoning.

Technology Category

Application Category

📝 Abstract
Existing video benchmarks often resemble image-based benchmarks, with question types like "What actions does the person perform throughout the video?" or "What color is the woman's dress in the video?" For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context-this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos.
Problem

Research questions and friction points this paper is trying to address.

Assessing if LVLMs genuinely understand videos beyond key frames
Creating a benchmark for deep temporal reasoning in video analysis
Evaluating LVLMs' ability to process full video context comprehensively
Innovation

Methods, ideas, or system contributions that make the work stand out.

GLIMPSE benchmark for deep video understanding
Human-crafted questions require full video context
Emphasizes temporal reasoning over static frames
🔎 Similar Papers
No similar papers found.