🤖 AI Summary
Existing video-language models (VLMs) suffer from prohibitive computational overhead, excessive GPU memory consumption, and inadequate joint understanding of actions, objects, and background—particularly for long-range temporal dependencies. To address these limitations, we propose MambaVL, the first VLM to integrate selective state space models (SSMs) into multimodal video-language modeling. MambaVL introduces a cross-modal shared state transition matrix that enables efficient, synergistic representation learning across vision and language modalities. Furthermore, we design a novel action-oriented structured question-answering pretraining paradigm that explicitly encodes ternary semantic relations among actions, objects, and environments. Through multi-stage question-driven alignment and joint video-language embedding optimization, MambaVL achieves state-of-the-art action recognition performance on Epic-Kitchens-100 and significantly outperforms Transformer-based and mainstream VLM baselines in action prediction.
📝 Abstract
Video Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture information about actions from multiple perspectives within the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide critical information about actions, objects, and environmental context, leading to enhanced performance. As a result, MambaVL achieves state-of-the-art performance in action recognition on the Epic-Kitchens-100 dataset and outperforms baseline methods in action anticipation.