SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

📅 2026-02-03
🏛️ Neural Information Processing Systems
📈 Citations: 17
Influential: 0
📄 PDF
🤖 AI Summary
Existing video large language models struggle to simultaneously preserve frame-level semantic details and capture video-level temporal structure, limiting their fine-grained understanding capabilities. To address this challenge, this work proposes SlowFocus, a mechanism that identifies question-relevant temporal segments and applies dense sampling, integrated with a multi-band hybrid attention module to effectively fuse local high-frequency visual details with global low-frequency contextual information. This approach significantly enhances the effective sampling rate without compromising the quality of frame-level visual tokens. Additionally, we introduce a training strategy tailored for fine-grained temporal reasoning and construct a new benchmark, FineAction-CGR. Extensive experiments demonstrate consistent and substantial performance gains across multiple established video understanding benchmarks as well as FineAction-CGR, confirming the superiority of our method in fine-grained temporal understanding tasks.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated exceptional capabilities in text understanding, which has paved the way for their expansion into video LLMs (Vid-LLMs) to analyze video data. However, current Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information (i.e., a sufficient number of tokens per frame) and comprehensive video-level temporal information (i.e., an adequate number of sampled frames per video). This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding. To address this issue, we introduce the SlowFocus mechanism, which significantly enhances the equivalent sampling frequency without compromising the quality of frame-level visual tokens. SlowFocus begins by identifying the query-related temporal segment based on the posed question, then performs dense sampling on this segment to extract local high-frequency features. A multi-frequency mixing attention module is further leveraged to aggregate these local high-frequency details with global low-frequency contexts for enhanced temporal comprehension. Additionally, to tailor Vid-LLMs to this innovative mechanism, we introduce a set of training strategies aimed at bolstering both temporal grounding and detailed temporal reasoning capabilities. Furthermore, we establish FineAction-CGR, a benchmark specifically devised to assess the ability of Vid-LLMs to process fine-grained temporal understanding tasks. Comprehensive experiments demonstrate the superiority of our mechanism across both existing public video understanding benchmarks and our proposed FineAction-CGR.
Problem

Research questions and friction points this paper is trying to address.

video LLM
fine-grained temporal understanding
frame-level semantics
temporal information
video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

SlowFocus
fine-grained temporal understanding
video LLM
multi-frequency mixing attention
temporal grounding
🔎 Similar Papers
No similar papers found.