Mamba Fusion: Learning Actions Through Questioning

📅 2024-09-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing video-language models (VLMs) suffer from prohibitive computational overhead, excessive GPU memory consumption, and inadequate joint understanding of actions, objects, and background—particularly for long-range temporal dependencies. To address these limitations, we propose MambaVL, the first VLM to integrate selective state space models (SSMs) into multimodal video-language modeling. MambaVL introduces a cross-modal shared state transition matrix that enables efficient, synergistic representation learning across vision and language modalities. Furthermore, we design a novel action-oriented structured question-answering pretraining paradigm that explicitly encodes ternary semantic relations among actions, objects, and environments. Through multi-stage question-driven alignment and joint video-language embedding optimization, MambaVL achieves state-of-the-art action recognition performance on Epic-Kitchens-100 and significantly outperforms Transformer-based and mainstream VLM baselines in action prediction.

Technology Category

Application Category

📝 Abstract

Video Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture information about actions from multiple perspectives within the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide critical information about actions, objects, and environmental context, leading to enhanced performance. As a result, MambaVL achieves state-of-the-art performance in action recognition on the Epic-Kitchens-100 dataset and outperforms baseline methods in action anticipation.

Problem

Research questions and friction points this paper is trying to address.

Video Language Models

Long-Term Information Processing

Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

MambaVL

perspective-taking mechanism

action recognition

🔎 Similar Papers

No similar papers found.