🤖 AI Summary
Existing Vision Transformer (ViT)-based methods for RGB-event multimodal tracking suffer from high computational overhead and weak cross-modal interaction. To address these issues, this paper proposes FEMamba, an efficient multimodal tracking framework built upon the Vision Mamba architecture. Its core contributions are: (1) a lightweight prompt generator coupled with a shared prompt pool that dynamically produces modality-specific, learnable prompt vectors to enable prompt-guided cross-modal feature fusion; and (2) a linear-complexity FEMamba backbone that jointly performs cross-modal feature extraction, interaction, and fusion in a unified manner. Evaluated on standard benchmarks—including COESOT, FE108, and FELT V2—FEMamba achieves state-of-the-art accuracy while reducing floating-point operations (FLOPs) by approximately 62% compared to prior ViT-based approaches, thereby delivering both superior performance and high computational efficiency.
📝 Abstract
Combining traditional RGB cameras with bio-inspired event cameras for robust object tracking has garnered increasing attention in recent years. However, most existing multimodal tracking algorithms depend heavily on high-complexity Vision Transformer architectures for feature extraction and fusion across modalities. This not only leads to substantial computational overhead but also limits the effectiveness of cross-modal interactions. In this paper, we propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network, termed Mamba-FETrack V2. Specifically, we first design a lightweight Prompt Generator that utilizes embedded features from each modality, together with a shared prompt pool, to dynamically generate modality-specific learnable prompt vectors. These prompts, along with the modality-specific embedded features, are then fed into a Vision Mamba-based FEMamba backbone, which facilitates prompt-guided feature extraction, cross-modal interaction, and fusion in a unified manner. Finally, the fused representations are passed to the tracking head for accurate target localization. Extensive experimental evaluations on multiple RGB-Event tracking benchmarks, including short-term COESOT dataset and long-term datasets, i.e., FE108 and FELT V2, demonstrate the superior performance and efficiency of the proposed tracking framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/Mamba_FETrack