MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RefVOS methods rely on hand-crafted heuristic sampling or external keyframe models, struggling to balance temporal modeling accuracy with architectural simplicity. This paper proposes an end-to-end trainable, LLM-driven framework addressing this limitation. Our method introduces three core innovations: (1) a moment-centric sampling strategy that explicitly models temporal alignment between language expressions and video segments; (2) a bidirectional anchor update propagation mechanism enabling precise key-segment localization and motion-detail preservation—without external models; and (3) a unified temporal similarity matching scheme leveraging [FIND] tokens, combined with dense-sparse hybrid sampling and dynamic anchor optimization. Evaluated on multiple benchmarks, our approach achieves significant improvements in segmentation accuracy (mAP ↑3.2%) and temporal localization (tIoU@0.5 ↑5.7%), while maintaining high inference efficiency and robustness.

Technology Category

Application Category

📝 Abstract
Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated exttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg
Problem

Research questions and friction points this paper is trying to address.

Segmenting video objects using natural language descriptions
Optimizing temporal sentence grounding and video segmentation jointly
Improving sampling strategy for motion details and global context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework jointly optimizes temporal grounding and segmentation
Moment-Centric Sampling strategy balances dense and sparse frame selection
Bidirectional Anchor-updated Propagation enhances tracking stability dynamically
🔎 Similar Papers
No similar papers found.
Ming Dai
Ming Dai
SouthEast University
MLLMVisual GroundingImage Retrieval
S
Sen Yang
Baidu VIS
B
Boqiang Duan
Baidu VIS
W
Wankou Yang
Southeast University
J
Jingdong Wang
Baidu VIS