See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing video moment retrieval (MR) and highlight detection (HD) methods model natural language queries and video clips holistically, neglecting word-level semantic importance and thus yielding coarse-grained contextual understanding. To address this, we propose a fine-grained clip filtering framework featuring a novel keyword-aware mechanism: (1) multimodal large language models enable joint visual–textual scene understanding; (2) a Feature Enhancement Module (FEM) strengthens keyword-driven cross-modal representations; and (3) a Ranking Filtering Module (RFM) performs iterative clip refinement. Crucially, our approach explicitly models word–clip alignment relationships, significantly improving contextual semantic matching accuracy. Extensive experiments demonstrate state-of-the-art performance on major benchmarks—including Charades-STA and QVHighlights—achieving new highs in both moment localization and highlight detection accuracy.

Technology Category

Application Category

📝 Abstract

Video moment retrieval (MR) and highlight detection (HD) with natural language queries aim to localize relevant moments and key highlights in a video clips. However, existing methods overlook the importance of individual words, treating the entire text query and video clips as a black-box, which hinders contextual understanding. In this paper, we propose a novel approach that enables fine-grained clip filtering by identifying and prioritizing important words in the query. Our method integrates image-text scene understanding through Multimodal Large Language Models (MLLMs) and enhances the semantic understanding of video clips. We introduce a feature enhancement module (FEM) to capture important words from the query and a ranking-based filtering module (RFM) to iteratively refine video clips based on their relevance to these important words. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods, achieving superior performance in both MR and HD tasks. Our code is available at: https://github.com/VisualAIKHU/SRF.

Problem

Research questions and friction points this paper is trying to address.

Identifies important words in text queries for video understanding

Filters video clips by relevance to prioritized query words

Enhances moment retrieval and highlight detection via scene analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLMs for scene understanding and semantic enhancement

Introduces FEM to capture important words from queries

Employs RFM for iterative clip filtering and ranking

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs