Token-Efficient Long Video Understanding for Multimodal LLMs

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing video multimodal large language models (MLLMs) struggle to capture long-video dynamic patterns due to the lack of explicit temporal modeling and incur high computational overhead. To address this, we propose STORM—a novel architecture that inserts a Mamba-based state-space temporal encoder between the visual encoder and the large language model (LLM), enabling efficient inter-frame dynamic information fusion and token compression. Our key innovations include a dual-path token reduction mechanism: hierarchical spatio-temporal pooling and test-time adaptive frame sampling—both preserving critical dynamic semantics while substantially reducing LLM input length. Experiments demonstrate that STORM achieves average accuracy gains exceeding 5% on MLVU and LongVideoBench, reduces training and inference latency by 2.4–2.9×, and cuts computational cost by up to 8× compared to prior approaches.

Technology Category

Application Category

📝 Abstract

Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM ( extbf{S}patiotemporal extbf{TO}ken extbf{R}eduction for extbf{M}ultimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to $8 imes$ and the decoding latency by 2.4-2.9$ imes$ for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm

Problem

Research questions and friction points this paper is trying to address.

Enhances video understanding by integrating temporal dynamics.

Reduces computational demands for long video processing.

Improves performance and efficiency in multimodal LLMs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces STORM with temporal encoder for Video-LLMs

Uses Mamba State Space Model for temporal integration

Implements token reduction strategies to lower computational costs

🔎 Similar Papers

From Image to Video, what do we need in multimodal LLMs?