🤖 AI Summary
This work addresses the limitations of existing video moment retrieval methods, which typically operate under a closed-set assumption and thus struggle with out-of-distribution (OOD) queries unrelated to the video content in open-world scenarios, often leading to high-risk false retrievals. To tackle this issue, we introduce Open-Set Video Moment Retrieval (OS-VMR), a novel paradigm that first determines whether a query lies within the in-distribution (ID) support before deciding whether to perform retrieval, explicitly rejecting OOD queries. Our approach integrates normalizing flows to model the ID query distribution, leverages uncertainty scoring and cross-modal matching features, and employs positive-unlabeled learning to refine the decision boundary. Experiments on three standard benchmarks demonstrate that the proposed OpenVMR accurately retrieves moments for ID queries while effectively rejecting OOD ones.
📝 Abstract
Video Moment Retrieval (VMR) targets to retrieve the specific moment corresponding to a sentence query from an untrimmed video. Although recent works have made remarkable progress in this task, they implicitly are rooted in the closed-set assumption that all the given queries as video-relevant\footnote{In this paper, we treat ``video-relevant query'' as ``in-distribution (ID) query'' and ``video-irrelevant query'' as ``out-of-distribution (OOD) query''.}. Given an OOD query in open-set scenarios, they still utilize it for wrong retrieval, which might lead to irrecoverable losses in high-risk scenarios, \textit{e.g.}, criminal activity detection. To this end, we creatively explore a brand-new VMR setting termed Open-Set Video Moment Retrieval (OS-VMR), where we should not only retrieve the precise moments based on ID query, but also reject OOD queries. In this paper, we make the first attempt to step toward OS-VMR and propose a novel model \textbf{OpenVMR}, which first distinguishes ID and OOD queries based on the normalizing flow technology, and then conducts moment retrieval based on ID queries. Specifically, we first learn the ID distribution by constructing a normalizing flow, and assume the ID query distribution obeys the multi-variate Gaussian distribution. Then, we introduce an uncertainty score to search the ID-OOD separating boundary. After that, we refine the ID-OOD boundary by pulling together ID query features. Besides, video-query matching and frame-query matching are designed for coarse-grained and fine-grained cross-modal interaction, respectively. Finally, a positive-unlabeled learning module is introduced for moment retrieval. Experimental results on three VMR datasets show the effectiveness of our OpenVMR.