Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of accurately localizing target segments in long videos using text queries, a task hindered by existing methods’ neglect of caption context and temporal motion dynamics. To overcome this limitation, we propose a two-stage framework: first, a large language model generates short auxiliary video clips conditioned on captions to serve as temporal priors; second, a multimodal controllable Mamba network fuses query semantics with these priors for precise and efficient localization. Our approach innovatively integrates caption-guided video generation with temporal prior enhancement and introduces a video-guided gating mechanism to strengthen Mamba’s capacity for modeling long sequences. Evaluated on the TVR benchmark, our method significantly outperforms state-of-the-art approaches, achieving higher recall on long sequences while reducing computational overhead.

Technology Category

Application Category

📝 Abstract
Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.
Problem

Research questions and friction points this paper is trying to address.

video moment retrieval
temporal grounding
multimodal query
temporal dynamics
long-sequence video
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba
video moment retrieval
temporal grounding
text-to-video generation
multimodal fusion
Y
Yunzhuo Sun
Dalian University of Technology
Xinyue Liu
Xinyue Liu
Amazon
Data MiningMachine Learning
Yanyang Li
Yanyang Li
The Chinese University of Hong Kong
Natural Language Processing
N
Nanding Wu
Dalian University of Technology
Y
Yifang Xu
Fudan University
Linlin Zong
Linlin Zong
大连理工大学
X
Xianchao Zhang
Dalian University of Technology
W
Wenxin Liang
Dalian University of Technology