General Compression Framework for Efficient Transformer Object Tracking

📅 2024-09-26
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address substantial accuracy degradation, training complexity, and strong architectural dependency in compressing Transformer-based visual object trackers, this paper proposes CompressTracker—a general lightweighting framework. Methodologically, it introduces (1) a novel phased teacher-model partitioning and stochastic phase replacement training strategy, and (2) a prediction-guided, multi-stage feature imitation knowledge distillation scheme that eliminates reliance on specific backbone architectures. Without modifying the original backbone, CompressTracker-4—a 4-layer compressed variant derived from OSTrack—achieves 96% of the original performance on LaSOT (66.1% AUC) while accelerating inference by 2.17×. These results significantly outperform state-of-the-art compression methods, demonstrating superior trade-offs among accuracy, efficiency, and architectural generality.

Technology Category

Application Category

📝 Abstract
Transformer-based trackers have established a dominant role in the field of visual object tracking. While these trackers exhibit promising performance, their deployment on resource-constrained devices remains challenging due to inefficiencies. To improve the inference efficiency and reduce the computation cost, prior approaches have aimed to either design lightweight trackers or distill knowledge from larger teacher models into more compact student trackers. However, these solutions often sacrifice accuracy for speed. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce the size of a pre-trained tracking model into a lightweight tracker with minimal performance degradation. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages, enabling the student model to emulate each corresponding teacher stage more effectively. Additionally, we also design a unique replacement training technique that involves randomly substituting specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model's ability to replicate the teacher model's behavior. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model's compression process. Our framework CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of CompressTracker. Our CompressTracker-4 with 4 transformer layers, which is compressed from OSTrack, retains about 96% performance on LaSOT (66.1% AUC) while achieves 2.17x speed up.
Problem

Research questions and friction points this paper is trying to address.

Improve tracking efficiency without sacrificing accuracy
Simplify complex training process of student models
Overcome structural limitations in transformer-based tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stage division strategy for structural flexibility
Replacement training enhances student model replication
Prediction guidance and feature mimicking supervision
🔎 Similar Papers
No similar papers found.
Lingyi Hong
Lingyi Hong
Fudan University
Computer Vision
J
Jinglun Li
Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University, Shanghai, China
X
Xinyu Zhou
Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
Shilin Yan
Shilin Yan
Fudan University
MLLMsComputer VisionMulti-Modal
Pinxue Guo
Pinxue Guo
Fudan University
Multimodal LLMVideo UnderstandingTracking and Segmentation
Kaixun Jiang
Kaixun Jiang
Fudan University
Computer VisionAdversarial Examples
Zhaoyu Chen
Zhaoyu Chen
TikTok
AI SecurityTrustworthy AIMultimodal AIGenerative AI
Shuyong Gao
Shuyong Gao
Fudan University
Human Visual AttentionGenerative ModelWeakly Supervised Learning
W
Wei Zhang
Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
H
Hong Lu
Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
W
Wenqiang Zhang
Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China; Engineering Research Center of AI & Robotics, Ministry of Education, Academy for Engineering & Technology, Fudan University, Shanghai, China