H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of modeling long-term temporal dependencies and multi-granularity motion dynamics in multimodal video understanding for autonomous driving under dynamic, complex scenarios, this paper proposes a hierarchical Mamba adaptation framework. The method introduces a dual-module协同 mechanism—Context Mamba (C-Mamba) and Query Mamba (Q-Mamba)—to enable adaptive cross-temporal-resolution context integration. Leveraging structured state space models (SSMs), it constructs a hierarchical Mamba architecture and adopts a plug-and-play paradigm to seamlessly integrate multimodal large language models (MLLMs). Evaluated on hazardous object detection, the framework achieves a 5.5% improvement in mean Intersection-over-Union (mIoU) over the prior state-of-the-art, demonstrating its effectiveness and advancement in jointly modeling dynamic video semantics and motion-semantic coupling.

Technology Category

Application Category

📝 Abstract
With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively selects multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.
Problem

Research questions and friction points this paper is trying to address.

Autonomous Driving
Complex Dynamic Changes
Environmental Prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

H-MBA
Multi-Timescale Information
Risk Object Recognition
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
Siran Chen
Siran Chen
University of Chinese Academy of Science
semiconductor,AI model
Y
Yuxiao Luo
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; The Hong Kong Polytechnic University, Hong Kong, China
Yue Ma
Yue Ma
Bytedance
NLPDialogue SystemLLM
Y
Yu Qiao
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China
Y
Yali Wang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China