Learning Adaptive and View-Invariant Vision Transformer with Multi-Teacher Knowledge Distillation for Real-Time UAV Tracking

📅 2024-12-28

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

To address the challenge of achieving real-time, robust, and viewpoint-invariant visual tracking on resource-constrained UAV platforms (e.g., mobile devices), this paper proposes AVTrack—an adaptive computation framework—and AVTrack-MD, a multi-teacher knowledge distillation method. Our key contributions are: (1) the first dynamic activation mechanism for Vision Transformer (ViT) modules, integrating adaptive computation with dynamic ViT pruning; (2) a viewpoint-invariant mutual information maximization objective to enhance feature discriminability and geometric robustness; and (3) a soft feature alignment–based multi-teacher knowledge distillation framework to improve generalization of lightweight models. Evaluated on multiple UAV tracking benchmarks, AVTrack maintains baseline accuracy while reducing model parameters by 32% and increasing average tracking speed by 17%. Moreover, it demonstrates significantly improved stability under noisy conditions.

Technology Category

Application Category

📝 Abstract

Visual tracking has made significant strides due to the adoption of transformer-based models. Most state-of-the-art trackers struggle to meet real-time processing demands on mobile platforms with constrained computing resources, particularly for real-time unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we introduce AVTrack, an adaptive computation framework designed to selectively activate transformer blocks for real-time UAV tracking. The proposed Activation Module (AM) dynamically optimizes the ViT architecture by selectively engaging relevant components, thereby enhancing inference efficiency without significant compromise to tracking performance. Furthermore, to tackle the challenges posed by extreme changes in viewing angles often encountered in UAV tracking, the proposed method enhances ViTs' effectiveness by learning view-invariant representations through mutual information (MI) maximization. Two effective design principles are proposed in the AVTrack. Building on it, we propose an improved tracker, dubbed AVTrack-MD, which introduces the novel MI maximization-based multi-teacher knowledge distillation (MD) framework. It harnesses the benefits of multiple teachers, specifically the off-the-shelf tracking models from the AVTrack, by integrating and refining their outputs, thereby guiding the learning process of the compact student network. Specifically, we maximize the MI between the softened feature representations from the multi-teacher models and the student model, leading to improved generalization and performance of the student model, particularly in noisy conditions. Extensive experiments on multiple UAV tracking benchmarks demonstrate that AVTrack-MD not only achieves performance comparable to the AVTrack baseline but also reduces model complexity, resulting in a significant 17% increase in average tracking speed.

Problem

Research questions and friction points this paper is trying to address.

Real-time tracking

Computational efficiency

Drone vision

Innovation

Methods, ideas, or system contributions that make the work stand out.

AVTrack

Multi-model Integration

Efficiency Improvement

🔎 Similar Papers

Learning Motion Blur Robust Vision Transformers with Dynamic Early Exit for Real-Time UAV Tracking