🤖 AI Summary
To address weak detection performance and task conflict between detection and association in end-to-end multi-object tracking (MOT), this paper proposes a self-generated detection prior mechanism. We first uncover and exploit the inherent strong detection capability of MOTR-like models—eliminating the need for external detectors—and enable the Transformer decoder to autonomously generate high-quality detection priors. A dedicated prior fusion strategy is further designed to achieve joint optimization of detection and association. Ablation studies validate the effectiveness of each component. On the DanceTrack benchmark, our method achieves state-of-the-art performance, significantly improving IDF1 (+2.3%) and MOTA (+1.8%) over prior end-to-end approaches. This demonstrates that our framework effectively mitigates task interference while preserving the architectural simplicity and elegance of end-to-end MOT, thereby enhancing tracking robustness and accuracy.
📝 Abstract
Despite progress toward end-to-end tracking with transformer architectures, poor detection performance and the conflict between detection and association in a joint architecture remain critical concerns. Recent approaches aim to mitigate these issues by (i) employing advanced denoising or label assignment strategies, or (ii) incorporating detection priors from external object detectors via distillation or anchor proposal techniques. Inspired by the success of integrating detection priors and by the key insight that MOTR-like models are secretly strong detection models, we introduce SelfMOTR, a novel tracking transformer that relies on self-generated detection priors. Through extensive analysis and ablation studies, we uncover and demonstrate the hidden detection capabilities of MOTR-like models, and present a practical set of tools for leveraging them effectively. On DanceTrack, SelfMOTR achieves strong performance, competing with recent state-of-the-art end-to-end tracking methods.