Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Vision Transformer (ViT)-based methods for RGB-event multimodal tracking suffer from high computational overhead and weak cross-modal interaction. To address these issues, this paper proposes FEMamba, an efficient multimodal tracking framework built upon the Vision Mamba architecture. Its core contributions are: (1) a lightweight prompt generator coupled with a shared prompt pool that dynamically produces modality-specific, learnable prompt vectors to enable prompt-guided cross-modal feature fusion; and (2) a linear-complexity FEMamba backbone that jointly performs cross-modal feature extraction, interaction, and fusion in a unified manner. Evaluated on standard benchmarks—including COESOT, FE108, and FELT V2—FEMamba achieves state-of-the-art accuracy while reducing floating-point operations (FLOPs) by approximately 62% compared to prior ViT-based approaches, thereby delivering both superior performance and high computational efficiency.

Technology Category

Application Category

📝 Abstract
Combining traditional RGB cameras with bio-inspired event cameras for robust object tracking has garnered increasing attention in recent years. However, most existing multimodal tracking algorithms depend heavily on high-complexity Vision Transformer architectures for feature extraction and fusion across modalities. This not only leads to substantial computational overhead but also limits the effectiveness of cross-modal interactions. In this paper, we propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network, termed Mamba-FETrack V2. Specifically, we first design a lightweight Prompt Generator that utilizes embedded features from each modality, together with a shared prompt pool, to dynamically generate modality-specific learnable prompt vectors. These prompts, along with the modality-specific embedded features, are then fed into a Vision Mamba-based FEMamba backbone, which facilitates prompt-guided feature extraction, cross-modal interaction, and fusion in a unified manner. Finally, the fused representations are passed to the tracking head for accurate target localization. Extensive experimental evaluations on multiple RGB-Event tracking benchmarks, including short-term COESOT dataset and long-term datasets, i.e., FE108 and FELT V2, demonstrate the superior performance and efficiency of the proposed tracking framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/Mamba_FETrack
Problem

Research questions and friction points this paper is trying to address.

Combining RGB and event cameras for robust object tracking
Reducing computational overhead in multimodal tracking algorithms
Enhancing cross-modal interaction efficiency in visual tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Prompt Generator for dynamic prompts
Vision Mamba-based FEMamba backbone for fusion
Linear-complexity Vision Mamba network for efficiency
🔎 Similar Papers
No similar papers found.
Shiao Wang
Shiao Wang
安徽大学
Deep Learning
J
Ju Huang
School of Computer Science and Technology, Anhui University, Hefei 230601, China
Qingchuan Ma
Qingchuan Ma
ahu
llm
J
Jinfeng Gao
School of Computer Science and Technology, Anhui University, Hefei 230601, China
C
Chunyi Xu
School of Computer Science and Technology, Anhui University, Hefei 230601, China
X
Xiao Wang
School of Computer Science and Technology, Anhui University, Hefei 230601, China
Lan Chen
Lan Chen
Communication University of China
Image/Video generation and editing
B
Bo Jiang
School of Computer Science and Technology, Anhui University, Hefei 230601, China