Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-video understanding faces key bottlenecks in interactive retrieval: limited model diversity, storage redundancy, inaccurate temporal localization, and context-agnostic re-ranking. To address these, we propose an efficient and robust moment retrieval framework for long videos. Our approach features four core innovations: (1) multi-granularity model ensemble search, integrating CLIP and BEIT-3; (2) representative keyframe sampling and deduplicated storage based on TransNetV2; (3) dual-query-driven fine-grained temporal localization; and (4) neighbor-frame-aware contextual re-ranking. Extensive experiments demonstrate significant improvements in retrieval accuracy, response latency, and result interpretability across both known-item retrieval and video question answering tasks. The framework effectively supports real-world interactive long-video retrieval with enhanced efficiency and robustness.

Technology Category

Application Category

📝 Abstract
Long-form video understanding presents significant challenges for interactive retrieval systems, as conventional methods struggle to process extensive video content efficiently. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking, limiting their effectiveness. This paper presents a novel framework to enhance interactive video retrieval through four key innovations: (1) an ensemble search strategy that integrates coarse-grained (CLIP) and fine-grained (BEIT3) models to improve retrieval accuracy, (2) a storage optimization technique that reduces redundancy by selecting representative keyframes via TransNetV2 and deduplication, (3) a temporal search mechanism that localizes video segments using dual queries for start and end points, and (4) a temporal reranking approach that leverages neighboring frame context to stabilize rankings. Evaluated on known-item search and question-answering tasks, our framework demonstrates substantial improvements in retrieval precision, efficiency, and user interpretability, offering a robust solution for real-world interactive video retrieval applications.
Problem

Research questions and friction points this paper is trying to address.

Enhance interactive video retrieval accuracy using multi-granularity models
Optimize storage by reducing redundancy in video keyframes
Improve temporal search stability with context-aware reranking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble search with CLIP and BEIT3 models
Storage optimization via TransNetV2 deduplication
Temporal search with dual query localization
H
Huu-Loc Tran
University of Information Technology, VNU-HCM, Vietnam
T
Tinh-Anh Nguyen-Nhu
Ho Chi Minh University of Technology, VNU-HCM, Vietnam
H
Huu-Phong Phan-Nguyen
University of Information Technology, VNU-HCM, Vietnam
T
Tien-Huy Nguyen
University of Information Technology, VNU-HCM, Vietnam
N
Nhat-Minh Nguyen-Dich
Hanoi University of Science and Technology, Hanoi, Vietnam
Anh Dao
Anh Dao
Undergraduate Student, Michigan State University
Vision-languageMultimodal LLMEmbodied AILLM
Q
Quan Nguyen
Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
H
Hoang M. Le
York University, Canada
Q
Q. Dinh
AI VIETNAM Lab