🤖 AI Summary
Existing compositional video retrieval methods support only single-round interaction, failing to accommodate users’ real-world need to iteratively refine their search intent through multi-turn natural language feedback. This work proposes ReCoVR, the first formalization of the interactive compositional video retrieval task, and introduces a reflexive-aware dual-path closed-loop architecture. The intention path integrates multi-source heterogeneous feedback and routes it to complementary retrieval channels, while the reflection path dynamically diagnoses and corrects the system’s own retrieval trajectory. By transcending the limitations of conventional single-channel and open-loop designs, ReCoVR significantly outperforms existing interactive methods across multiple benchmarks, achieving a 74.30% R@1 accuracy on the WebVid-CoVR-Test dataset with just one round of interaction.
📝 Abstract
Composed video retrieval (CoVR) searches for target videos using a reference video and a modification text, but existing methods are restricted to a single interaction round and cannot support the progressive nature of real-world visual search. To bridge this gap, we first formalize interactive composed video retrieval, a multi-turn extension of CoVR, where users progressively refine their search intent through natural-language feedback across turns. Adapting existing interactive retrieval methods to this setting reveals two structural weaknesses: reliance on a single retrieval channel and an open-loop retrieval design that consumes user feedback but does not diagnose whether its own retrieval trajectory is drifting or stagnating. To address these limitations, we propose ReCoVR (Reflexive Composed Video Retrieval), a dual-pathway architecture built on reflexive perception, where the system treats its retrieval history as diagnostic evidence alongside user feedback. Specifically, an Intent Pathway routes heterogeneous feedback to complementary retrieval channels, while a Reflection Pathway performs trajectory-level reflection to monitor result evolution and correct retrieval errors across turns. Experiments on multiple benchmarks show that ReCoVR consistently outperforms interactive baselines, notably achieving 74.30% R@1 after just one interactive round on the WebVid-CoVR-Test dataset.