🤖 AI Summary
To address the fundamental trade-off between speed and accuracy in visual object tracking under resource-constrained conditions, this paper proposes DyTrack, a dynamic Transformer framework. Methodologically, DyTrack introduces the first dynamic inference path mechanism tailored for tracking, integrating early-exit branches at intermediate layers, cross-layer feature reuse, and target-aware self-distillation to enable frame-wise, complexity-adaptive computation allocation within a single model. It combines dynamic network routing with sequential decision modeling, implemented via a lightweight Transformer architecture. On the LaSOT benchmark, DyTrack achieves 64.9% AUC while running at 256 FPS—significantly outperforming existing methods operating at comparable speeds—and establishes the new state-of-the-art in speed-accuracy trade-off for real-time tracking.
📝 Abstract
The speed-precision tradeoff is a critical problem in visual object tracking, as it typically requires low latency and is deployed on resource-constrained platforms. Existing solutions for efficient tracking primarily focus on lightweight backbones or modules, which, however, come at a sacrifice in precision. In this article, inspired by dynamic network routing, we propose DyTrack, a dynamic transformer framework for efficient tracking. Real-world tracking scenarios exhibit varying levels of complexity. We argue that a simple network is sufficient for easy video frames, while more computational resources should be assigned to difficult ones. DyTrack automatically learns to configure proper reasoning routes for different inputs, thereby improving the utilization of the available computational budget and achieving higher performance at the same running speed. We formulate instance-specific tracking as a sequential decision problem and incorporate terminating branches to intermediate layers of the model. Furthermore, we propose a feature recycling mechanism to maximize computational efficiency by reusing the outputs of predecessors. Additionally, a target-aware self-distillation strategy is designed to enhance the discriminating capabilities of early-stage predictions by mimicking the representation patterns of the deep model. Extensive experiments demonstrate that DyTrack achieves promising speed-precision tradeoffs with only a single model. For instance, DyTrack obtains 64.9% area under the curve (AUC) on LaSOT with a speed of 256fps.