🤖 AI Summary
Existing compositional video retrieval methods struggle to model the causal and temporal effects implicitly conveyed by textual edits—such as motion or state changes—limiting their retrieval performance. This work proposes a zero-shot, reasoning-first approach that explicitly incorporates causal and temporal reasoning into the task for the first time, leveraging large-scale multimodal models to infer the implicit consequences of editing instructions and align them with candidate videos. To evaluate models’ understanding and generalization of such implicit effects, we introduce CoVR-Reason, a new benchmark featuring structured reasoning trajectories and challenging distractors. Experiments demonstrate that our method significantly outperforms strong baselines on CoVR-Reason, particularly excelling on subsets involving implicit effects, with both automatic and human evaluations confirming its superior reasoning consistency and factual accuracy.
📝 Abstract
Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.