Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Transformer-based models face prohibitive computational overhead in hour-long video understanding due to quadratic complexity in sequence length, while existing token compression methods sacrifice spatiotemporal fidelity and still struggle with scalability. Method: We propose a novel hybrid Mamba–Transformer architecture that eliminates token compression entirely. It innovatively integrates the linear-complexity Mamba-2 state space model with Transformer layers via a hybrid attention mechanism, enabling efficient encoding of >1024 high-resolution frames at native spatiotemporal resolution. Coupled with an end-to-end multimodal encoder and long-sequence optimization strategies, it supports hour-long video encoding on a single GPU. Results: Our method reduces training/inference memory consumption by 50% and accelerates per-step throughput by nearly 2×. On LVBench, it achieves a 4.3% accuracy gain over prior state-of-the-art efficient video LLMs, while maintaining strong generalization across both long- and short-video understanding tasks.

Technology Category

Application Category

📝 Abstract

State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$ imes$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.

Problem

Research questions and friction points this paper is trying to address.

Handles hour-long videos efficiently with linear complexity

Reduces GPU memory usage by 50% during training and inference

Improves accuracy by 4.3% on long video understanding benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mamba-Transformer model for video encoding

Linear complexity encoding with Mamba-2 blocks

Reduced GPU memory usage and faster training

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs