How Far are Modern Trackers from UAV-Anti-UAV? A Million-Scale Benchmark and New Baseline

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical gap in anti-drone research: prior studies focus predominantly on static ground-based platforms and neglect dynamic adversarial tracking between mobile UAVs. We thus introduce UAV-Anti-UAV multimodal visual tracking—a novel task wherein a pursuing UAV must localize and continuously track a hostile target UAV in real-time video streams. The task is severely challenged by dual-motion-induced dynamic interference. To enable systematic study, we construct the first large-scale, fully annotated dataset comprising 1,810 video sequences, each accompanied by natural-language prompts and 15 fine-grained tracking attributes. We further propose MambaSTS, a baseline method integrating the Mamba state-space model with Transformer architecture to jointly model spatial, temporal, and semantic information over long sequences. Evaluation on our dataset reveals significant room for improvement, establishing a new benchmark and technical foundation for mobile-platform anti-drone tracking.

Technology Category

Application Category

📝 Abstract
Unmanned Aerial Vehicles (UAVs) offer wide-ranging applications but also pose significant safety and privacy violation risks in areas like airport and infrastructure inspection, spurring the rapid development of Anti-UAV technologies in recent years. However, current Anti-UAV research primarily focuses on RGB, infrared (IR), or RGB-IR videos captured by fixed ground cameras, with little attention to tracking target UAVs from another moving UAV platform. To fill this gap, we propose a new multi-modal visual tracking task termed UAV-Anti-UAV, which involves a pursuer UAV tracking a target adversarial UAV in the video stream. Compared to existing Anti-UAV tasks, UAV-Anti-UAV is more challenging due to severe dual-dynamic disturbances caused by the rapid motion of both the capturing platform and the target. To advance research in this domain, we construct a million-scale dataset consisting of 1,810 videos, each manually annotated with bounding boxes, a language prompt, and 15 tracking attributes. Furthermore, we propose MambaSTS, a Mamba-based baseline method for UAV-Anti-UAV tracking, which enables integrated spatial-temporal-semantic learning. Specifically, we employ Mamba and Transformer models to learn global semantic and spatial features, respectively, and leverage the state space model's strength in long-sequence modeling to establish video-level long-term context via a temporal token propagation mechanism. We conduct experiments on the UAV-Anti-UAV dataset to validate the effectiveness of our method. A thorough experimental evaluation of 50 modern deep tracking algorithms demonstrates that there is still significant room for improvement in the UAV-Anti-UAV domain. The dataset and codes will be available at {color{magenta}https://github.com/983632847/Awesome-Multimodal-Object-Tracking}.
Problem

Research questions and friction points this paper is trying to address.

Proposes UAV-Anti-UAV tracking task with dual-dynamic disturbances
Introduces million-scale dataset for UAV tracking from moving platforms
Presents MambaSTS baseline for integrated spatial-temporal-semantic learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces UAV-Anti-UAV multi-modal visual tracking task
Proposes MambaSTS baseline with integrated spatial-temporal-semantic learning
Constructs million-scale dataset with annotated videos and tracking attributes
C
Chunhui Zhang
Shanghai Jiao Tong University, Shanghai, 200240, China, the Hong Kong University of Science and Technology (Guangzhou), Guangzhou, 511458, China, and the CloudWalk Technology Co., Ltd, 201203, China
L
Li Liu
Hong Kong University of Science and Technology (Guangzhou), Guangzhou, 511458, China
Zhipeng Zhang
Zhipeng Zhang
School of Artificial Intelligence, Shanghai Jiao Tong University
Computer Vision,Object Tracking and Segmentation
Y
Yong Wang
School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen, 518107, China
H
Hao Wen
CloudWalk Technology Co., Ltd, 201203, China
X
Xi Zhou
CloudWalk Technology Co., Ltd, 201203, China
Shiming Ge
Shiming Ge
Institute of Information Engineering, Chinese Academy of Sciences
Computer VisionArtificial Intelligence
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University