🤖 AI Summary
This work addresses two key challenges in visual object tracking (VOT): weak long-term temporal consistency and insufficient cross-modal fusion. To this end, it introduces Segment Anything Model 2 (SAM2) to VOT for the first time. The proposed method features a dynamic memory prompting mechanism that enables online temporal prompt updating and cross-modal feature alignment. It further integrates contrastive learning-guided mask refinement with multimodal temporal memory enhancement to significantly improve tracking robustness and accuracy. By preserving SAM2’s strong generalization capability, the approach effectively models target evolution and modality complementarity. Evaluated on the 2024 ICPR Multimodal Object Tracking Challenge, the method achieves state-of-the-art performance with an AUC of 89.4%, substantially outperforming baseline approaches.
📝 Abstract
We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-modal Object Tracking challenge, demonstrating the effectiveness of our approach. This paper details our methodology, the specific enhancements made to SAM2, and a comprehensive analysis of our results in the context of VOT solutions along with the multi-modality aspect of the dataset.