Towards Universal Modal Tracking with Online Dense Temporal Token Learning

📅 2025-07-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of developing a unified, modality-agnostic video tracker capable of handling diverse input modalities—including RGB, RGB-thermal, RGB-depth, and RGB-event—within a single architecture with shared parameters. Methodologically, it introduces video-level sampling and online dense temporal token propagation to jointly model appearance and motion dynamics; designs a gated perceptron for adaptive cross-modal representation fusion and parameter sharing; and adopts a one-stage end-to-end training paradigm. The resulting framework enables “train-once, infer-many” across tracking tasks. Evaluated on visible-light and multimodal benchmarks (VTUAV, RGBT234, GTOT), it achieves state-of-the-art performance, demonstrating superior generalization, inference efficiency, and training scalability, while substantially reducing modeling and optimization complexity in multimodal visual tracking.

Technology Category

Application Category

📝 Abstract
We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: extbf{Video-level Sampling}. We expand the model's inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. extbf{Video-level Association}. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. extbf{Modality Scalable}. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our {modaltracker} achieves a new extit{SOTA} performance. The code will be available at https://github.com/GXNU-ZhongLab/ODTrack.
Problem

Research questions and friction points this paper is trying to address.

Develop universal video tracking model for multiple modalities
Enable online dense temporal token learning for context
Adaptive cross-modal learning with one-shot training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online dense temporal token learning
Video-level sampling and association
Gated attention for cross-modal learning
🔎 Similar Papers
No similar papers found.
Yaozong Zheng
Yaozong Zheng
Guangxi Normal University
Visual TrackingMultimodal Tracking
B
Bineng Zhong
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, and the Guangxi Key Laboratory of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China
Q
Qihua Liang
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, and the Guangxi Key Laboratory of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China
Shengping Zhang
Shengping Zhang
Professor, Harbin Institute of Technology, China
Computer VisionPattern RecognitionMachine Learning
Guorong Li
Guorong Li
University of Chinese Academy of Sciences
Computer VisionVisual TrackingMachine Learning
X
Xianxian Li
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, and the Guangxi Key Laboratory of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China
R
Rongrong Ji
Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University, 361005, China