🤖 AI Summary
Multi-object tracking (MOT) suffers from heavy reliance on detection accuracy and poor generalization across datasets. To address this, we propose a novel “Tracking by Segmentation” paradigm that bypasses conventional detection-driven pipelines and directly generates tracking bounding boxes from segmentation masks. Our method builds upon the SAM2 architecture and incorporates a trajectory manager, a cross-object interaction module, and a mask-to-bounding-box mapping mechanism to enable end-to-end, zero-shot cross-dataset tracking. The core innovation lies in segmentation-driven tracking modeling, which significantly improves robustness to occlusion and enhances modeling of object lifecycles. Extensive experiments demonstrate state-of-the-art performance on DanceTrack, UAVDT, and BDD100K: on DanceTrack, our approach achieves +2.1% HOTA and +4.5% IDF1 over prior art.
📝 Abstract
Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation. To extend this to multi-object tracking (MOT), we propose SAM2MOT, introducing a novel Tracking by Segmentation paradigm. Unlike Tracking by Detection or Tracking by Query, SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy. SAM2MOT has two key advantages: zero-shot generalization, allowing it to work across datasets without fine-tuning, and strong object association, inherited from SAM2. To further improve performance, we integrate a trajectory manager system for precise object addition and removal, and a cross-object interaction module to handle occlusions. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT.