LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-text retrieval benchmarks suffer from limited video duration, low-caption quality, and coarse-grained annotations, hindering research on long-video fine-grained understanding and retrieval. To address this, we introduce LoVR—the first fine-grained retrieval benchmark specifically designed for long videos—comprising 467 videos with an average length exceeding one hour and 40,804 high-quality, human-verified temporal segments. We propose a novel vision-language model (VLM)-driven automatic annotation pipeline that integrates caption quality scoring, dynamic optimization, and semantic-fused full-video caption generation to achieve scalable, high-precision fine-grained alignment. Extensive experiments demonstrate substantial performance degradation of mainstream multimodal embedding models on LoVR, confirming its strong challenge. The dataset, source code, and annotation toolkit are fully open-sourced to foster community advancement in long-video retrieval research.

Technology Category

Application Category

📝 Abstract
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack long video duration and quality captions
Current methods struggle with fine-grained video-text retrieval evaluation
Poor machine-generated annotations limit accurate video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient caption generation with VLM and dynamic refinement
Semantic fusion method for coherent full-video captions
Large-scale dataset with fine-grained clips and high-quality captions
🔎 Similar Papers
No similar papers found.
Q
Qifeng Cai
East China Normal University
H
Hao Liang
Peking University
H
Hejun Dong
Beihang University
M
Meiyi Qiang
Beijing Institute of Technology
Ruichuan An
Ruichuan An
Xi'an Jiaotong University|Peking University
VLMData Centric AI
Zhaoyang Han
Zhaoyang Han
Nanjing Forestry University
Z
Zhengzhou Zhu
Peking University
B
Bin Cui
Peking University
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved