🤖 AI Summary
To address the challenge of modeling long-term temporal dependencies and multi-granularity motion dynamics in multimodal video understanding for autonomous driving under dynamic, complex scenarios, this paper proposes a hierarchical Mamba adaptation framework. The method introduces a dual-module协同 mechanism—Context Mamba (C-Mamba) and Query Mamba (Q-Mamba)—to enable adaptive cross-temporal-resolution context integration. Leveraging structured state space models (SSMs), it constructs a hierarchical Mamba architecture and adopts a plug-and-play paradigm to seamlessly integrate multimodal large language models (MLLMs). Evaluated on hazardous object detection, the framework achieves a 5.5% improvement in mean Intersection-over-Union (mIoU) over the prior state-of-the-art, demonstrating its effectiveness and advancement in jointly modeling dynamic video semantics and motion-semantic coupling.
📝 Abstract
With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively selects multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.