UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing general-purpose multimodal tracking methods struggle to effectively model spatiotemporal cues, limiting cross-modal fusion and tracking performance. This work proposes UBATrack, a novel framework that introduces the Mamba state space model into multimodal tracking for the first time. By leveraging a lightweight Spatio-temporal Mamba Adapter and a Dynamic Multi-modal Feature Mixer within an adapter-based fine-tuning paradigm, UBATrack jointly captures long-range spatiotemporal dependencies and cross-modal interactions without requiring full-model fine-tuning. This approach significantly enhances training efficiency and generalization capability. Extensive experiments demonstrate that UBATrack consistently outperforms state-of-the-art methods across multiple benchmarks, including LasHeR, RGBT234, RGBT210, DepthTrack, VOT-RGBD22, and VisEvent, achieving leading tracking performance.

Technology Category

Application Category

📝 Abstract
Multi-modal object tracking has attracted considerable attention by integrating multiple complementary inputs (e.g., thermal, depth, and event data) to achieve outstanding performance. Although current general-purpose multi-modal trackers primarily unify various modal tracking tasks (i.e., RGB-Thermal infrared, RGB-Depth or RGB-Event tracking) through prompt learning, they still overlook the effective capture of spatio-temporal cues. In this work, we introduce a novel multi-modal tracking framework based on a mamba-style state space model, termed UBATrack. Our UBATrack comprises two simple yet effective modules: a Spatio-temporal Mamba Adapter (STMA) and a Dynamic Multi-modal Feature Mixer. The former leverages Mamba's long-sequence modeling capability to jointly model cross-modal dependencies and spatio-temporal visual cues in an adapter-tuning manner. The latter further enhances multi-modal representation capacity across multiple feature dimensions to improve tracking robustness. In this way, UBATrack eliminates the need for costly full-parameter fine-tuning, thereby improving the training efficiency of multi-modal tracking algorithms. Experiments show that UBATrack outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks, achieving outstanding results on the LasHeR, RGBT234, RGBT210, DepthTrack, VOT-RGBD22, and VisEvent datasets.
Problem

Research questions and friction points this paper is trying to address.

multi-modal tracking
spatio-temporal cues
state space model
object tracking
cross-modal dependencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

state space model
multi-modal tracking
Mamba
spatio-temporal modeling
adapter tuning
🔎 Similar Papers
No similar papers found.
Q
Qihua Liang
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China; Guangxi Key Laboratory of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China; University Engineering Research Center of Educational Intelligent Technology, Guangxi Normal University, Guilin 541004, China
L
Liang Chen
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China; Guangxi Key Laboratory of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China; University Engineering Research Center of Educational Intelligent Technology, Guangxi Normal University, Guilin 541004, China
Yaozong Zheng
Yaozong Zheng
Guangxi Normal University
Visual TrackingMultimodal Tracking
J
Jian Nong
Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University, Wuzhou 543002, China
Z
Zhiyi Mo
Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University, Wuzhou 543002, China
B
Bineng Zhong
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China; Guangxi Key Laboratory of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China; University Engineering Research Center of Educational Intelligent Technology, Guangxi Normal University, Guilin 541004, China