An Empirical Study on How Video-LLMs Answer Video Questions

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work on video large language models (Video-LLMs) emphasizes performance improvement while lacking systematic insight into their internal mechanisms. Method: We introduce attention ablation—specifically three fine-grained variants (temporal, spatial, and language-to-video)—combined with global and intra-layer windowed analysis. Contribution/Results: Our analysis uncovers a two-stage reasoning process for video question answering: an early stage relying on language-guided cross-modal retrieval, followed by lightweight spatiotemporal integration in later stages; we further identify critical intermediate layers whose ablation most severely degrades performance. Crucially, we find that spatiotemporal modeling is predominantly driven by language signals rather than computationally expensive self-attention. Leveraging this insight, we achieve significant attention computation compression—reducing inference cost substantially—while preserving model accuracy.

Technology Category

Application Category

📝 Abstract
Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction primarily occurs in early layers, forming a clear two-stage process -- lower layers focus on perceptual encoding, while higher layers handle abstract reasoning; (2) In the fine-grained setting, certain intermediate layers exert an outsized impact on video question answering, acting as critical outliers, whereas most other layers contribute minimally; (3) In both settings, we observe that spatial-temporal modeling relies more on language-guided retrieval than on intra- and inter-frame self-attention among video tokens, despite the latter's high computational cost. Finally, we demonstrate that these insights can be leveraged to reduce attention computation in Video-LLMs. To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content, offering interpretability and efficiency perspectives for future research.
Problem

Research questions and friction points this paper is trying to address.

Understanding internal mechanisms of Video-LLMs answering video questions
Analyzing spatial-temporal modeling through attention knockout techniques
Identifying critical computational layers for efficient video processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention knockouts analyze Video-LLM internal mechanisms
Layer-specific knockout variants reveal processing stages
Language-guided retrieval dominates spatiotemporal modeling
🔎 Similar Papers
No similar papers found.