SAMDWICH: Moment-aware Video-text Alignment for Referring Video Object Segmentation

📅 2025-08-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Referring Video Object Segmentation (RVOS) methods suffer from imprecise language-video semantic alignment due to random frame sampling and coarse-grained global object supervision. To address this, we propose a time-aware training framework. Our method introduces: (1) a time-guided dual-path propagation mechanism that precisely grounds language conditions onto key frames; (2) an object-level selective supervision strategy that applies segmentation loss only to objects semantically relevant to the referring expression; and (3) memory-enhanced object tracking coupled with MeViS-M text-clip alignment pretraining. Evaluated on the MeViS benchmark, our approach achieves state-of-the-art performance, particularly excelling in complex referring expression scenarios—demonstrating substantial improvements in both robustness and accuracy.

Technology Category

Application Category

📝 Abstract
Referring Video Object Segmentation (RVOS) aims to segment and track objects in videos based on natural language expressions, requiring precise alignment between visual content and textual queries. However, existing methods often suffer from semantic misalignment, largely due to indiscriminate frame sampling and supervision of all visible objects during training -- regardless of their actual relevance to the expression. To address this, we introduce a moment-aware RVOS framework named SAMDWICH, along with a newly annotated dataset, MeViS-M, built upon the challenging MeViS benchmark. We manually annotate temporal moments indicating when each object is referred to by the expression, enabling semantically grounded supervision that strengthens video-text alignment. SAMDWICH leverages these aligned text-to-clip pairs to guide training, significantly enhancing referential understanding. Building upon this framework, we propose Moment-guided Dual-path Propagation (MDP), a moment-aware propagation strategy that improves both object grounding and tracking by training on both relevant and irrelevant frames through a moment-centric memory mechanism. In addition, we introduce Object-level Selective Supervision (OSS), an object-level filtering strategy that supervises only the objects temporally aligned with the expression in each training clip. This selective supervision reduces semantic noise and reinforces language-conditioned learning. Extensive experiments show that SAMDWICH achieves state-of-the-art performance on challenging MeViS benchmark, particularly excelling in complex scenarios involving diverse expressions.
Problem

Research questions and friction points this paper is trying to address.

Addresses semantic misalignment in video-text object segmentation
Introduces moment-aware training for precise video-text alignment
Reduces semantic noise with selective object-level supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Moment-aware RVOS framework SAMDWICH
Moment-guided Dual-path Propagation (MDP)
Object-level Selective Supervision (OSS)
🔎 Similar Papers
No similar papers found.
Seunghun Lee
Seunghun Lee
Korea University
J
Jiwan Seo
DGIST, Daegu, Republic of Korea
J
Jeonghoon Kim
DGIST, Daegu, Republic of Korea
S
Siwon Kim
DGIST, Daegu, Republic of Korea
H
Haeun Yun
DGIST, Daegu, Republic of Korea
H
Hyogyeong Jeon
DGIST, Daegu, Republic of Korea
W
Wonhyeok Choi
DGIST, Daegu, Republic of Korea
J
Jaehoon Jeong
DGIST, Daegu, Republic of Korea
Zane Durante
Zane Durante
Stanford University
Machine LearningComputer VisionMulti-modal LearningAI + Healthcare
Sang Hyun Park
Sang Hyun Park
Daegu Gyeongbuk Institute of Science & Technology, South Korea
Medical Image AnalysisComputer VisionMachine Learning
Sunghoon Im
Sunghoon Im
EECS, DGIST
Computer VisionDeep LearningRobot Vision