Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the computational redundancy in existing Transformer-based visual trackers, which perform fixed-depth inference on all frames of long videos. The authors propose UncL-STARK, a method that enables multi-intermediate-depth inference without altering the network architecture or adding auxiliary modules. By leveraging stochastic depth training and knowledge distillation, the model learns to operate at varying depths. At runtime, it estimates uncertainty from corner heatmaps and dynamically adjusts the encoder and decoder depths for the next frame based on temporal consistency in the video. This approach establishes the first dynamic depth adaptation mechanism that requires no structural modifications, significantly improving inference efficiency: it reduces GFLOPs by up to 12%, latency by 8.9%, and energy consumption by 10.8% on GOT-10k and LaSOT, with negligible accuracy degradation of no more than 0.2%.

Technology Category

Application Category

📝 Abstract

Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12\% GFLOPs reduction, 8.9\% latency reduction, and 10.8\% energy savings while maintaining tracking accuracy within 0.2\% of the full-depth baseline across both short-term and long-term sequences.

Problem

Research questions and friction points this paper is trying to address.

visual tracking

transformer

inference-time depth adaptation

computational efficiency

temporal coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

uncertainty-guided inference

dynamic depth adaptation

transformer-based tracking