Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking

πŸ“… 2025-03-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high parameter count and slow inference of Vision Transformers (ViTs) in real-time UAV tracking, this paper proposes a similarity-guided layer-adaptive pruning mechanism. Specifically, cosine similarity is employed to quantify inter-layer feature representation redundancy; highly similar layers are dynamically disabled, retaining only the optimal single layer for tracking. Additionally, an end-to-end trainable layer adaptation module is introduced to seamlessly integrate lightweight ViT backbones with the single-layer architecture. This work presents the first incorporation of inter-layer representation similarity measurement and dynamic layer selection into ViT-based trackers. The method achieves real-time state-of-the-art speed (>60 FPS) on six mainstream benchmarks while matching the accuracy of advanced multi-layer ViT trackers. Code and models are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. However, complete ViT architectures are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking which extremely emphasizes efficiency. In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. Based on this observation, we propose a similarity-guided layer adaptation approach to optimize the structure of ViTs. Our approach dynamically disables a large number of representation-similar layers and selectively retains only a single optimal layer among them, aiming to achieve a better accuracy-speed trade-off. By incorporating this approach into existing ViTs, we tailor previously complete ViT architectures into an efficient similarity-guided layer-adaptive framework, namely SGLATrack, for real-time UAV tracking. Extensive experiments on six tracking benchmarks verify the effectiveness of the proposed approach, and show that our SGLATrack achieves a state-of-the-art real-time speed while maintaining competitive tracking precision. Codes and models are available at https://github.com/GXNU-ZhongLab/SGLATrack.
Problem

Research questions and friction points this paper is trying to address.

Optimizes Vision Transformers for UAV tracking efficiency.
Reduces redundant layers in lightweight ViT-based trackers.
Enhances real-time tracking speed without losing precision.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic layer disabling for efficiency
Similarity-guided layer adaptation optimization
Real-time UAV tracking with SGLATrack
πŸ”Ž Similar Papers
No similar papers found.
C
Chaocan Xue
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China
B
Bineng Zhong
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China
Q
Qihua Liang
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China
Yaozong Zheng
Yaozong Zheng
Guangxi Normal University
Visual TrackingMultimodal Tracking
N
Ning Li
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China
Y
Yuanliang Xue
Xi’an Research Institute of High Technology, Xi’an 710025, China
S
Shuxiang Song
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China