LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing video-text retrieval benchmarks suffer from limited video duration, low-caption quality, and coarse-grained annotations, hindering research on long-video fine-grained understanding and retrieval. To address this, we introduce LoVR—the first fine-grained retrieval benchmark specifically designed for long videos—comprising 467 videos with an average length exceeding one hour and 40,804 high-quality, human-verified temporal segments. We propose a novel vision-language model (VLM)-driven automatic annotation pipeline that integrates caption quality scoring, dynamic optimization, and semantic-fused full-video caption generation to achieve scalable, high-precision fine-grained alignment. Extensive experiments demonstrate substantial performance degradation of mainstream multimodal embedding models on LoVR, confirming its strong challenge. The dataset, source code, and annotation toolkit are fully open-sourced to foster community advancement in long-video retrieval research.

Technology Category

Application Category

📝 Abstract

Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark

Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack long video duration and quality captions

Current methods struggle with fine-grained video-text retrieval evaluation

Poor machine-generated annotations limit accurate video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient caption generation with VLM and dynamic refinement

Semantic fusion method for coherent full-video captions

Large-scale dataset with fine-grained clips and high-quality captions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs