Exploring Dynamic Transformer for Efficient Object Tracking

📅 2024-03-26
🏛️ IEEE Transactions on Neural Networks and Learning Systems
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the fundamental trade-off between speed and accuracy in visual object tracking under resource-constrained conditions, this paper proposes DyTrack, a dynamic Transformer framework. Methodologically, DyTrack introduces the first dynamic inference path mechanism tailored for tracking, integrating early-exit branches at intermediate layers, cross-layer feature reuse, and target-aware self-distillation to enable frame-wise, complexity-adaptive computation allocation within a single model. It combines dynamic network routing with sequential decision modeling, implemented via a lightweight Transformer architecture. On the LaSOT benchmark, DyTrack achieves 64.9% AUC while running at 256 FPS—significantly outperforming existing methods operating at comparable speeds—and establishes the new state-of-the-art in speed-accuracy trade-off for real-time tracking.

Technology Category

Application Category

📝 Abstract
The speed-precision tradeoff is a critical problem in visual object tracking, as it typically requires low latency and is deployed on resource-constrained platforms. Existing solutions for efficient tracking primarily focus on lightweight backbones or modules, which, however, come at a sacrifice in precision. In this article, inspired by dynamic network routing, we propose DyTrack, a dynamic transformer framework for efficient tracking. Real-world tracking scenarios exhibit varying levels of complexity. We argue that a simple network is sufficient for easy video frames, while more computational resources should be assigned to difficult ones. DyTrack automatically learns to configure proper reasoning routes for different inputs, thereby improving the utilization of the available computational budget and achieving higher performance at the same running speed. We formulate instance-specific tracking as a sequential decision problem and incorporate terminating branches to intermediate layers of the model. Furthermore, we propose a feature recycling mechanism to maximize computational efficiency by reusing the outputs of predecessors. Additionally, a target-aware self-distillation strategy is designed to enhance the discriminating capabilities of early-stage predictions by mimicking the representation patterns of the deep model. Extensive experiments demonstrate that DyTrack achieves promising speed-precision tradeoffs with only a single model. For instance, DyTrack obtains 64.9% area under the curve (AUC) on LaSOT with a speed of 256fps.
Problem

Research questions and friction points this paper is trying to address.

Balancing speed and precision in object tracking
Adapting computation for varying tracking complexity
Enhancing efficiency without sacrificing tracking accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic transformer framework for efficient tracking
Feature recycling mechanism to reuse outputs
Target-aware self-distillation strategy for early predictions
🔎 Similar Papers
No similar papers found.
Jiawen Zhu
Jiawen Zhu
Dalian University of Technology
computer visionobject trackingmulti-modal learning
X
Xin Chen
School of Information and Communication Engineering, Dalian University of Technology, Dalian 116024, China
Haiwen Diao
Haiwen Diao
Nanyang Technological University
Computer VisionVision-and-LanguageTransfer LearningMultimodal LLM
S
Shuai Li
Department of Computing, The Hong Kong Polytechnic University, Hong Kong
Jun-Yan He
Jun-Yan He
Tongyi Lab, Alibaba Group
Multimedia ComputingComputer Vision
C
Chenyang Li
DAMO Academy, Alibaba Group, Shenzhen 518000, China
B
Bin Luo
DAMO Academy, Alibaba Group, Shenzhen 518000, China
D
Dong Wang
School of Information and Communication Engineering, Dalian University of Technology, Dalian 116024, China
H
Huchuan Lu
School of Information and Communication Engineering, Dalian University of Technology, Dalian 116024, China