๐ค AI Summary
This work addresses the vulnerability of existing RGB-based UAV tracking methods to detection failures in long-term urban surveillance scenarios, which often leads to temporal inconsistency and tracking breakdown. To enhance robustness against bounding box initialization errors and varying sequence lengths, the study introduces SAMURAIโa foundational modelโinto UAV tracking for the first time, proposing an augmented architecture that effectively integrates detector outputs. The approach significantly improves zero-shot tracking performance, particularly excelling in challenging conditions involving target exit-and-reentry and extended sequences. Extensive experiments demonstrate its superiority, achieving up to a 0.393 increase in success rate and a 0.475 reduction in miss rate across multiple datasets, thereby validating its effectiveness and state-of-the-art performance in complex urban environments.
๐ Abstract
Robust long-term tracking of drone is a critical requirement for modern surveillance systems, given their increasing threat potential. While detector-based approaches typically achieve strong frame-level accuracy, they often suffer from temporal inconsistencies caused by frequent detection dropouts. Despite its practical relevance, research on RGB-based drone tracking is still limited and largely reliant on conventional motion models. Meanwhile, foundation models like SAMURAI have established their effectiveness across other domains, exhibiting strong category-agnostic tracking performance. However, their applicability in drone-specific scenarios has not been investigated yet. Motivated by this gap, we present the first systematic evaluation of SAMURAI's potential for robust drone tracking in urban surveillance settings. Furthermore, we introduce a detector-augmented extension of SAMURAI to mitigate sensitivity to bounding-box initialization and sequence length. Our findings demonstrate that the proposed extension significantly improves robustness in complex urban environments, with pronounced benefits in long-duration sequences - especially under drone exit-re-entry events. The incorporation of detector cues yields consistent gains over SAMURAI's zero-shot performance across datasets and metrics, with success rate improvements of up to +0.393 and FNR reductions of up to -0.475.