🤖 AI Summary
Transformer-based models face prohibitive computational overhead in hour-long video understanding due to quadratic complexity in sequence length, while existing token compression methods sacrifice spatiotemporal fidelity and still struggle with scalability.
Method: We propose a novel hybrid Mamba–Transformer architecture that eliminates token compression entirely. It innovatively integrates the linear-complexity Mamba-2 state space model with Transformer layers via a hybrid attention mechanism, enabling efficient encoding of >1024 high-resolution frames at native spatiotemporal resolution. Coupled with an end-to-end multimodal encoder and long-sequence optimization strategies, it supports hour-long video encoding on a single GPU.
Results: Our method reduces training/inference memory consumption by 50% and accelerates per-step throughput by nearly 2×. On LVBench, it achieves a 4.3% accuracy gain over prior state-of-the-art efficient video LLMs, while maintaining strong generalization across both long- and short-video understanding tasks.
📝 Abstract
State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$ imes$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.