🤖 AI Summary
To address association errors in multi-object tracking (MOT) caused by occlusion and appearance similarity, this paper proposes a training-free, depth-aware tracking framework. Methodologically, it introduces monocular depth estimation—performed in a zero-shot manner—as an independent geometric cue into the data association stage, yielding 3D spatial priors without supervision. A parameter-free hierarchical alignment scoring mechanism is designed, jointly optimizing coarse-grained IoU-based spatial matching and fine-grained pixel-level appearance alignment to enable synergistic geometric-appearance modeling. Additionally, unsupervised motion modeling is integrated to further improve robustness under dynamic scenes. The framework achieves state-of-the-art performance on challenging benchmarks including MOT17 and MOT20, with no training or fine-tuning required at any stage. All code is publicly released.
📝 Abstract
Current motion-based multiple object tracking (MOT) approaches rely heavily on Intersection-over-Union (IoU) for object association. Without using 3D features, they are ineffective in scenarios with occlusions or visually similar objects. To address this, our paper presents a novel depth-aware framework for MOT. We estimate depth using a zero-shot approach and incorporate it as an independent feature in the association process. Additionally, we introduce a Hierarchical Alignment Score that refines IoU by integrating both coarse bounding box overlap and fine-grained (pixel-level) alignment to improve association accuracy without requiring additional learnable parameters. To our knowledge, this is the first MOT framework to incorporate 3D features (monocular depth) as an independent decision matrix in the association step. Our framework achieves state-of-the-art results on challenging benchmarks without any training nor fine-tuning. The code is available at https://github.com/Milad-Khanchi/DepthMOT