Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transformer-based models face prohibitive computational overhead in hour-long video understanding due to quadratic complexity in sequence length, while existing token compression methods sacrifice spatiotemporal fidelity and still struggle with scalability. Method: We propose a novel hybrid Mamba–Transformer architecture that eliminates token compression entirely. It innovatively integrates the linear-complexity Mamba-2 state space model with Transformer layers via a hybrid attention mechanism, enabling efficient encoding of >1024 high-resolution frames at native spatiotemporal resolution. Coupled with an end-to-end multimodal encoder and long-sequence optimization strategies, it supports hour-long video encoding on a single GPU. Results: Our method reduces training/inference memory consumption by 50% and accelerates per-step throughput by nearly 2×. On LVBench, it achieves a 4.3% accuracy gain over prior state-of-the-art efficient video LLMs, while maintaining strong generalization across both long- and short-video understanding tasks.

Technology Category

Application Category

📝 Abstract
State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$ imes$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.
Problem

Research questions and friction points this paper is trying to address.

Handles hour-long videos efficiently with linear complexity
Reduces GPU memory usage by 50% during training and inference
Improves accuracy by 4.3% on long video understanding benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mamba-Transformer model for video encoding
Linear complexity encoding with Mamba-2 blocks
Reduced GPU memory usage and faster training
🔎 Similar Papers
No similar papers found.