Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing Vision Transformer (ViT)-based methods for RGB-event multimodal tracking suffer from high computational overhead and weak cross-modal interaction. To address these issues, this paper proposes FEMamba, an efficient multimodal tracking framework built upon the Vision Mamba architecture. Its core contributions are: (1) a lightweight prompt generator coupled with a shared prompt pool that dynamically produces modality-specific, learnable prompt vectors to enable prompt-guided cross-modal feature fusion; and (2) a linear-complexity FEMamba backbone that jointly performs cross-modal feature extraction, interaction, and fusion in a unified manner. Evaluated on standard benchmarks—including COESOT, FE108, and FELT V2—FEMamba achieves state-of-the-art accuracy while reducing floating-point operations (FLOPs) by approximately 62% compared to prior ViT-based approaches, thereby delivering both superior performance and high computational efficiency.

Technology Category

Application Category

📝 Abstract

Combining traditional RGB cameras with bio-inspired event cameras for robust object tracking has garnered increasing attention in recent years. However, most existing multimodal tracking algorithms depend heavily on high-complexity Vision Transformer architectures for feature extraction and fusion across modalities. This not only leads to substantial computational overhead but also limits the effectiveness of cross-modal interactions. In this paper, we propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network, termed Mamba-FETrack V2. Specifically, we first design a lightweight Prompt Generator that utilizes embedded features from each modality, together with a shared prompt pool, to dynamically generate modality-specific learnable prompt vectors. These prompts, along with the modality-specific embedded features, are then fed into a Vision Mamba-based FEMamba backbone, which facilitates prompt-guided feature extraction, cross-modal interaction, and fusion in a unified manner. Finally, the fused representations are passed to the tracking head for accurate target localization. Extensive experimental evaluations on multiple RGB-Event tracking benchmarks, including short-term COESOT dataset and long-term datasets, i.e., FE108 and FELT V2, demonstrate the superior performance and efficiency of the proposed tracking framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/Mamba_FETrack

Problem

Research questions and friction points this paper is trying to address.

Combining RGB and event cameras for robust object tracking

Reducing computational overhead in multimodal tracking algorithms

Enhancing cross-modal interaction efficiency in visual tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Prompt Generator for dynamic prompts

Vision Mamba-based FEMamba backbone for fusion

Linear-complexity Vision Mamba network for efficiency

🔎 Similar Papers

Exploring Learning-based Motion Models in Multi-Object Tracking

2024-03-16arXiv.orgCitations: 4

ByteDance

San Jose

Master Thesis AI-Based Keypoint Refinement for Autonomous Driving

Bosch Group

Hildesheim, NDS, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)