🤖 AI Summary
Existing video-text retrieval benchmarks suffer from limited video duration, low-caption quality, and coarse-grained annotations, hindering research on long-video fine-grained understanding and retrieval. To address this, we introduce LoVR—the first fine-grained retrieval benchmark specifically designed for long videos—comprising 467 videos with an average length exceeding one hour and 40,804 high-quality, human-verified temporal segments. We propose a novel vision-language model (VLM)-driven automatic annotation pipeline that integrates caption quality scoring, dynamic optimization, and semantic-fused full-video caption generation to achieve scalable, high-precision fine-grained alignment. Extensive experiments demonstrate substantial performance degradation of mainstream multimodal embedding models on LoVR, confirming its strong challenge. The dataset, source code, and annotation toolkit are fully open-sourced to foster community advancement in long-video retrieval research.
📝 Abstract
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark