🤖 AI Summary
This work addresses the challenge of developing a unified, modality-agnostic video tracker capable of handling diverse input modalities—including RGB, RGB-thermal, RGB-depth, and RGB-event—within a single architecture with shared parameters. Methodologically, it introduces video-level sampling and online dense temporal token propagation to jointly model appearance and motion dynamics; designs a gated perceptron for adaptive cross-modal representation fusion and parameter sharing; and adopts a one-stage end-to-end training paradigm. The resulting framework enables “train-once, infer-many” across tracking tasks. Evaluated on visible-light and multimodal benchmarks (VTUAV, RGBT234, GTOT), it achieves state-of-the-art performance, demonstrating superior generalization, inference efficiency, and training scalability, while substantially reducing modeling and optimization complexity in multimodal visual tracking.
📝 Abstract
We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: extbf{Video-level Sampling}. We expand the model's inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. extbf{Video-level Association}. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. extbf{Modality Scalable}. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our {modaltracker} achieves a new extit{SOTA} performance. The code will be available at https://github.com/GXNU-ZhongLab/ODTrack.