Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing methods for fine-grained compositional video retrieval—given a query video and a detailed textual modification description, retrieve the target video—struggle to model complex temporal actions and dense semantic modifications. To address this, we introduce Dense-WebVid-CoVR, the first large-scale densely modified video retrieval dataset, comprising 1.6 million samples—seven times larger than the previous largest benchmark. We further propose a cross-attention fusion model grounded in a spatially and temporally aware text encoder, enabling fine-grained alignment between visual and textual modalities at both frame-level and semantic-level granularity. Under the joint vision–language setting, our method achieves a Recall@1 of 71.3%, surpassing the prior state-of-the-art by 3.4 percentage points, and establishes new state-of-the-art performance across multiple metrics.

Technology Category

Application Category

📝 Abstract

Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model are available at :https://github.com/OmkarThawakar/BSE-CoVR

Problem

Research questions and friction points this paper is trying to address.

Retrieving target videos using query videos with detailed textual modifications

Handling fine-grained compositional queries in video retrieval systems

Addressing temporal understanding limitations in composed video retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Attention fusion with grounded text encoder

Dense-WebVid-CoVR dataset with 1.6M samples

Precise alignment between modifications and videos

🔎 Similar Papers

Edit3K: Universal Representation Learning for Video Editing Components