🤖 AI Summary
In aerial remote sensing, lightweight trackers suffer from insufficient decoupling between detection and localization capabilities due to single-stage feature fusion, limiting both robustness and accuracy. To address this, we propose a Target-Aware Bidirectional Fusion Transformer (TBF-Transformer). Our approach features: (1) a dual-stream linear self- and cross-attention network enabling forward–backward multi-stage feature decoupling and fusion; (2) target-aware positional encoding that jointly models local details and global semantics; and (3) a lightweight Transformer design optimized for embedded deployment. Evaluated on UAV-123, UAV20L, and UAVTrack112, our method achieves state-of-the-art performance in both accuracy and robustness. On embedded platforms, it operates at 30.5 FPS, demonstrating an effective balance between high precision and real-time inference capability.
📝 Abstract
The trackers based on lightweight neural networks have achieved great success in the field of aerial remote sensing, most of which aggregate multi-stage deep features to lift the tracking quality. However, existing algorithms usually only generate single-stage fusion features for state decision, which ignore that diverse kinds of features are required for identifying and locating the object, limiting the robustness and precision of tracking. In this paper, we propose a novel target-aware Bidirectional Fusion transformer (BFTrans) for UAV tracking. Specifically, we first present a two-stream fusion network based on linear self and cross attentions, which can combine the shallow and the deep features from both forward and backward directions, providing the adjusted local details for location and global semantics for recognition. Besides, a target-aware positional encoding strategy is designed for the above fusion model, which is helpful to perceive the object-related attributes during the fusion phase. Finally, the proposed method is evaluated on several popular UAV benchmarks, including UAV-123, UAV20L and UAVTrack112. Massive experimental results demonstrate that our approach can exceed other state-of-the-art trackers and run with an average speed of 30.5 FPS on embedded platform, which is appropriate for practical drone deployments.