Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in visual object tracking (VOT): weak long-term temporal consistency and insufficient cross-modal fusion. To this end, it introduces Segment Anything Model 2 (SAM2) to VOT for the first time. The proposed method features a dynamic memory prompting mechanism that enables online temporal prompt updating and cross-modal feature alignment. It further integrates contrastive learning-guided mask refinement with multimodal temporal memory enhancement to significantly improve tracking robustness and accuracy. By preserving SAM2’s strong generalization capability, the approach effectively models target evolution and modality complementarity. Evaluated on the 2024 ICPR Multimodal Object Tracking Challenge, the method achieves state-of-the-art performance with an AUC of 89.4%, substantially outperforming baseline approaches.

Technology Category

Application Category

📝 Abstract
We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-modal Object Tracking challenge, demonstrating the effectiveness of our approach. This paper details our methodology, the specific enhancements made to SAM2, and a comprehensive analysis of our results in the context of VOT solutions along with the multi-modality aspect of the dataset.
Problem

Research questions and friction points this paper is trying to address.

Adapt SAM2 for Visual Object Tracking task
Enhance SAM2 performance with key techniques
Achieve top score in multi-modal tracking challenge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapt SAM2 for Visual Object Tracking
Leverage pre-trained SAM2 capabilities
Enhance SAM2 with key optimizations
🔎 Similar Papers
No similar papers found.
Cheng-Yen Yang
Cheng-Yen Yang
University of Washington
Computer VisionDeep Learning
Hsiang-Wei Huang
Hsiang-Wei Huang
University of Washington
Computer VisionDeep Learning3D Vision
P
Pyong-Kun Kim
Electronics and Telecommunications Research Institute, Daejeon, South Korea
C
Chien-Kai Kuo
University of Washington, Seattle WA, USA
J
Jui-Wei Chang
University of Washington, Seattle WA, USA
Kwang-Ju Kim
Kwang-Ju Kim
Electronics and Telecommunications Research Institute(ETRI)
Computer VisionMachine Learning
C
Chung-I Huang
National Center for High-performance Computing, Hsinchu, Taiwan
J
Jenq-Neng Hwang
University of Washington, Seattle WA, USA