ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of conventional Transformers in video captioning, which arises from modeling long visual sequences and complex temporal dependencies. To overcome this limitation, the authors propose a fully open-source multimodal large language model that replaces the standard attention mechanism with a deep state space model. Integrated with an aligned hierarchical bidirectional scanning module, the architecture efficiently captures long-range video context across multiple temporal resolutions while maintaining linear computational complexity in sequence length. Evaluated on standard benchmarks such as VATEX and MSR-VTT, the model generates high-quality captions and achieves a throughput three times higher than typical multimodal large language models, substantially enhancing both efficiency and scalability.
📝 Abstract
In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.
Problem

Research questions and friction points this paper is trying to address.

video captioning
multimodal large language models
computational complexity
temporal dependencies
sequence length
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba
Multimodal Large Language Model
Linear Complexity
Video Captioning
Hierarchical Bidirectional Scan
🔎 Similar Papers
No similar papers found.
D
Daichi Yashima
Keio University, 3-14-1, Kohoku Ward, Yokohama, Kanagawa 223-8522, Japan
Shuhei Kurita
Shuhei Kurita
National Institute of Informatics
Deep LearningLarge Language ModelsComputer Vision
Yusuke Oda
Yusuke Oda
NAIST
Natural Language ProcessingMachine TranslationSoftware Engineering
S
Shuntaro Suzuki
Keio University, 3-14-1, Kohoku Ward, Yokohama, Kanagawa 223-8522, Japan
S
Seitaro Otsuki
Keio University, 3-14-1, Kohoku Ward, Yokohama, Kanagawa 223-8522, Japan
Komei Sugiura
Komei Sugiura
Professor, Keio University
Multimodal AIRobot LearningEmbodied AIMachine Learning