🤖 AI Summary
Existing CoVR benchmarks inadequately evaluate models’ capacity to perceive subtle, rapid temporal discrepancies. This work introduces TF-CoVR—the first large-scale benchmark (180K triplets) for time-fine-grained compositional video retrieval—targeting high-speed dynamic domains such as gymnastics and diving. It supports “reference video + text instruction” queries and cross-video multi-target segment localization. Methodologically, we transcend single-video pairwise matching by proposing a two-stage TF-CoVR-Base framework that jointly models temporal discrimination and cross-modal alignment. We further incorporate fine-grained action pretraining, contrastive learning, and zero-shot transfer plus fine-tuning of multimodal foundation models (e.g., LanguageBind, GME). Experiments demonstrate significant gains: zero-shot mAP@50 reaches 7.51 (+1.59), while fine-tuned performance achieves a new SOTA of 25.82 (+5.99), marking the first systematic evaluation of mainstream vision and multimodal models on this challenging task.
📝 Abstract
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct eachpair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.