🤖 AI Summary
To address the challenge of achieving real-time, robust, and viewpoint-invariant visual tracking on resource-constrained UAV platforms (e.g., mobile devices), this paper proposes AVTrack—an adaptive computation framework—and AVTrack-MD, a multi-teacher knowledge distillation method. Our key contributions are: (1) the first dynamic activation mechanism for Vision Transformer (ViT) modules, integrating adaptive computation with dynamic ViT pruning; (2) a viewpoint-invariant mutual information maximization objective to enhance feature discriminability and geometric robustness; and (3) a soft feature alignment–based multi-teacher knowledge distillation framework to improve generalization of lightweight models. Evaluated on multiple UAV tracking benchmarks, AVTrack maintains baseline accuracy while reducing model parameters by 32% and increasing average tracking speed by 17%. Moreover, it demonstrates significantly improved stability under noisy conditions.
📝 Abstract
Visual tracking has made significant strides due to the adoption of transformer-based models. Most state-of-the-art trackers struggle to meet real-time processing demands on mobile platforms with constrained computing resources, particularly for real-time unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we introduce AVTrack, an adaptive computation framework designed to selectively activate transformer blocks for real-time UAV tracking. The proposed Activation Module (AM) dynamically optimizes the ViT architecture by selectively engaging relevant components, thereby enhancing inference efficiency without significant compromise to tracking performance. Furthermore, to tackle the challenges posed by extreme changes in viewing angles often encountered in UAV tracking, the proposed method enhances ViTs' effectiveness by learning view-invariant representations through mutual information (MI) maximization. Two effective design principles are proposed in the AVTrack. Building on it, we propose an improved tracker, dubbed AVTrack-MD, which introduces the novel MI maximization-based multi-teacher knowledge distillation (MD) framework. It harnesses the benefits of multiple teachers, specifically the off-the-shelf tracking models from the AVTrack, by integrating and refining their outputs, thereby guiding the learning process of the compact student network. Specifically, we maximize the MI between the softened feature representations from the multi-teacher models and the student model, leading to improved generalization and performance of the student model, particularly in noisy conditions. Extensive experiments on multiple UAV tracking benchmarks demonstrate that AVTrack-MD not only achieves performance comparable to the AVTrack baseline but also reduces model complexity, resulting in a significant 17% increase in average tracking speed.