🤖 AI Summary
To address poor tracking stability and deployment challenges under low-frequency detection (e.g., 1 Hz) in multi-object tracking (MOT), this paper proposes a novel two-stage matching framework. In the first stage, a geometry-based bounding-box distance metric replaces the conventional Mahalanobis distance, reducing reliance on motion models. In the second stage, Re-ID appearance features are jointly optimized with Kalman filter-predicted states to enforce both motion and visual consistency across frames. The method significantly improves trajectory continuity and identity stability under sparse detections. Evaluated on MOT17-val at 1 Hz detection frequency, it achieves a +11.6% HOTA gain over baseline methods. Moreover, it maintains state-of-the-art performance on full-frame benchmarks—including MOT17, MOT20, and DanceTrack—demonstrating strong generalization and practical applicability for resource-constrained, low-frequency MOT scenarios.
📝 Abstract
Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. To address this issue, we propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. We propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving $ extit{11.6%}$ HOTA improvement at $ extit{1}$ Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.