SUTrack: Towards Simple and Unified Single Object Tracking

📅 2024-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-object tracking (SOT) methods require modality-specific models for RGB, depth, thermal, event-camera, and language-guided inputs—leading to redundant training and hindering cross-modal complementarity exploitation. This paper introduces SUTrack, the first unified SOT framework that jointly handles all five modalities within a single model and a single training procedure. SUTrack achieves this through cross-modal unified representation learning, a shared backbone, task-conditioned attention, lightweight soft token-type embeddings, and joint multimodal alignment optimization—enabling multi-task collaborative learning without architectural modification. Evaluated on 11 cross-modal benchmarks, SUTrack consistently outperforms dedicated unimodal trackers across all modalities. Furthermore, it provides multiple model variants optimized for both edge devices and GPU platforms, achieving an exceptional balance between accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract
In this paper, we propose a simple yet unified single object tracking (SOT) framework, dubbed SUTrack. It consolidates five SOT tasks (RGB-based, RGB-Depth, RGB-Thermal, RGB-Event, RGB-Language Tracking) into a unified model trained in a single session. Due to the distinct nature of the data, current methods typically design individual architectures and train separate models for each task. This fragmentation results in redundant training processes, repetitive technological innovations, and limited cross-modal knowledge sharing. In contrast, SUTrack demonstrates that a single model with a unified input representation can effectively handle various common SOT tasks, eliminating the need for task-specific designs and separate training sessions. Additionally, we introduce a task-recognition auxiliary training strategy and a soft token type embedding to further enhance SUTrack's performance with minimal overhead. Experiments show that SUTrack outperforms previous task-specific counterparts across 11 datasets spanning five SOT tasks. Moreover, we provide a range of models catering edge devices as well as high-performance GPUs, striking a good trade-off between speed and accuracy. We hope SUTrack could serve as a strong foundation for further compelling research into unified tracking models. Code and models are available at github.com/chenxin-dlut/SUTrack.
Problem

Research questions and friction points this paper is trying to address.

Single-Target Tracking
Cross-Modal Complementarity
Unified Modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Tracking
Cross-Modal Adaptability
Efficient Computation
🔎 Similar Papers
No similar papers found.
X
Xin Chen
Dalian University of Technology
Ben Kang
Ben Kang
Dalian University of Technology
computer vision
W
Wanting Geng
Dalian University of Technology
Jiawen Zhu
Jiawen Zhu
Dalian University of Technology
computer visionobject trackingmulti-modal learning
Y
Yi Liu
Baidu Inc.
D
Dong Wang
Dalian University of Technology
H
Huchuan Lu
Dalian University of Technology