MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the token explosion problem in multimodal large models induced by long and high-density video inputs, this paper proposes the first frame-level spatiotemporal compression framework built upon bidirectional state space models (SSMs). The method employs periodic learnable query embeddings, bidirectional SSM blocks, gated skip connections, and learnable weighted average pooling to achieve hierarchical downsampling—significantly reducing token count while preserving semantic fidelity. Compared to Transformer-based compression approaches, our method achieves state-of-the-art performance across multiple long-video and dense-video understanding benchmarks. It delivers substantial inference speedup and demonstrates strong generalization across diverse model architectures and video datasets. Experimental results validate the unique efficacy and irreplaceability of SSMs for video feature compression.

Technology Category

Application Category

📝 Abstract
We propose an efficient framework to compress multiple video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from long or dense videos. Our design leverages a bidirectional state-space-based block equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging long and dense video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our proposed state-space block with a conventional Transformer results in substantial performance degradation, highlighting the advantages of state-space modeling for effectively compressing multi-frame video data. Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments. We validate its scalability and generality across multiple benchmarks, achieving the dual objectives of efficient resource usage and comprehensive video understanding.
Problem

Research questions and friction points this paper is trying to address.

Compress video-frame features to reduce token explosion in multimodal models
Enable hierarchical downsampling across spatial and temporal dimensions efficiently
Achieve competitive video understanding with reduced token budget
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional state-space block with gated skip
Learnable weighted-average pooling for queries
Hierarchical downsampling in spatial-temporal dimensions
🔎 Similar Papers
No similar papers found.