Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video Moment Retrieval (VMR) typically relies on supervised fine-tuning and suffers from language bias, limiting generalization and requiring extensive annotated data. Method: This paper proposes Moment-GPT, a zero-shot, training-free VMR framework. It freezes a multimodal large language model (MLLM), employs LLaMA-3 for query reformulation to mitigate linguistic bias, and synergistically leverages MiniGPT-v2 and VideoChatGPT for candidate moment generation and video understanding. A lightweight, task-specific span scorer is introduced to achieve semantic alignment and precise ranking. Contribution/Results: Moment-GPT establishes the first end-to-end zero-shot VMR paradigm—requiring neither training data nor parameter updates—by decoupling query refinement, span generation, and video comprehension. It achieves state-of-the-art performance on QVHighlights, ActivityNet-Captions, and Charades-STA, outperforming existing MLLM-based and zero-shot methods in both retrieval accuracy and efficiency.

Technology Category

Application Category

📝 Abstract
The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply VideoChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-ofthe-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.
Problem

Research questions and friction points this paper is trying to address.

Video Moment Retrieval
Pre-trained Language Models
Bias Mitigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Moment-GPT
Pre-trained Language Models
Video Moment Retrieval
🔎 Similar Papers
No similar papers found.
Y
Yifang Xu
Nanjing University
Y
Yunzhuo Sun
Dalian University of Technology
B
Benxiang Zhai
Nanjing University
M
Ming Li
Nanjing University of Information Science and Technology
W
Wenxin Liang
Dalian University of Technology
Y
Yang Li
Nanjing University
Sidan Du
Sidan Du
Nanjing University
Image Processing and ControlMachine Learning