Learning Adaptive and View-Invariant Vision Transformer with Multi-Teacher Knowledge Distillation for Real-Time UAV Tracking

📅 2024-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of achieving real-time, robust, and viewpoint-invariant visual tracking on resource-constrained UAV platforms (e.g., mobile devices), this paper proposes AVTrack—an adaptive computation framework—and AVTrack-MD, a multi-teacher knowledge distillation method. Our key contributions are: (1) the first dynamic activation mechanism for Vision Transformer (ViT) modules, integrating adaptive computation with dynamic ViT pruning; (2) a viewpoint-invariant mutual information maximization objective to enhance feature discriminability and geometric robustness; and (3) a soft feature alignment–based multi-teacher knowledge distillation framework to improve generalization of lightweight models. Evaluated on multiple UAV tracking benchmarks, AVTrack maintains baseline accuracy while reducing model parameters by 32% and increasing average tracking speed by 17%. Moreover, it demonstrates significantly improved stability under noisy conditions.

Technology Category

Application Category

📝 Abstract
Visual tracking has made significant strides due to the adoption of transformer-based models. Most state-of-the-art trackers struggle to meet real-time processing demands on mobile platforms with constrained computing resources, particularly for real-time unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we introduce AVTrack, an adaptive computation framework designed to selectively activate transformer blocks for real-time UAV tracking. The proposed Activation Module (AM) dynamically optimizes the ViT architecture by selectively engaging relevant components, thereby enhancing inference efficiency without significant compromise to tracking performance. Furthermore, to tackle the challenges posed by extreme changes in viewing angles often encountered in UAV tracking, the proposed method enhances ViTs' effectiveness by learning view-invariant representations through mutual information (MI) maximization. Two effective design principles are proposed in the AVTrack. Building on it, we propose an improved tracker, dubbed AVTrack-MD, which introduces the novel MI maximization-based multi-teacher knowledge distillation (MD) framework. It harnesses the benefits of multiple teachers, specifically the off-the-shelf tracking models from the AVTrack, by integrating and refining their outputs, thereby guiding the learning process of the compact student network. Specifically, we maximize the MI between the softened feature representations from the multi-teacher models and the student model, leading to improved generalization and performance of the student model, particularly in noisy conditions. Extensive experiments on multiple UAV tracking benchmarks demonstrate that AVTrack-MD not only achieves performance comparable to the AVTrack baseline but also reduces model complexity, resulting in a significant 17% increase in average tracking speed.
Problem

Research questions and friction points this paper is trying to address.

Real-time tracking
Computational efficiency
Drone vision
Innovation

Methods, ideas, or system contributions that make the work stand out.

AVTrack
Multi-model Integration
Efficiency Improvement
🔎 Similar Papers
No similar papers found.
Y
You Wu
College of Computer Science and Engineering, Guilin University of Technology, Guilin 541006, China
Y
Yongxin Li
College of Computer Science and Engineering, Guilin University of Technology, Guilin 541006, China; Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin 541006, China
M
Mengyuan Liu
College of Computer Science and Engineering, Guilin University of Technology, Guilin 541006, China
X
Xucheng Wang
School of Computer Science, Fudan University, Shanghai 200082, China
X
Xiangyang Yang
College of Computer Science and Engineering, Guilin University of Technology, Guilin 541006, China
H
Hengzhou Ye
College of Computer Science and Engineering, Guilin University of Technology, Guilin 541006, China; Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin 541006, China
Dan Zeng
Dan Zeng
Sun Yat-sen University
Biometricscomputer visiondeep learning
Qijun Zhao
Qijun Zhao
Professor of Computer Science, Sichuan University
Biometrics3D VisionObject Detection and RecognitionFace RecognitionFingerprint Recognition
Shuiwang Li
Shuiwang Li
Guilin University of Technology