Mamba Fusion: Learning Actions Through Questioning

📅 2024-09-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-language models (VLMs) suffer from prohibitive computational overhead, excessive GPU memory consumption, and inadequate joint understanding of actions, objects, and background—particularly for long-range temporal dependencies. To address these limitations, we propose MambaVL, the first VLM to integrate selective state space models (SSMs) into multimodal video-language modeling. MambaVL introduces a cross-modal shared state transition matrix that enables efficient, synergistic representation learning across vision and language modalities. Furthermore, we design a novel action-oriented structured question-answering pretraining paradigm that explicitly encodes ternary semantic relations among actions, objects, and environments. Through multi-stage question-driven alignment and joint video-language embedding optimization, MambaVL achieves state-of-the-art action recognition performance on Epic-Kitchens-100 and significantly outperforms Transformer-based and mainstream VLM baselines in action prediction.

Technology Category

Application Category

📝 Abstract
Video Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture information about actions from multiple perspectives within the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide critical information about actions, objects, and environmental context, leading to enhanced performance. As a result, MambaVL achieves state-of-the-art performance in action recognition on the Epic-Kitchens-100 dataset and outperforms baseline methods in action anticipation.
Problem

Research questions and friction points this paper is trying to address.

Video Language Models
Long-Term Information Processing
Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

MambaVL
perspective-taking mechanism
action recognition
🔎 Similar Papers
No similar papers found.
Z
Zhikang Dong
Georgia Institute of Technology, Stony Brook University
A
Apoorva Beedu
Georgia Institute of Technology
J
Jason Sheinkopf
Georgia Institute of Technology
Irfan Essa
Irfan Essa
Distinguished Professor of Computing, Georgia Tech / Research Scientist, Google
Computer VisionArtificial IntelligenceMachine LearningComputer GraphicsRobotics