HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

📅 2025-01-03

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Current large multimodal models face critical bottlenecks in hour-long video understanding—namely, limited analytical capability, low inference efficiency, and the absence of standardized benchmarks. To address this, we introduce HLV-1K, the first thousand-scale benchmark for hour-long video understanding, comprising 1,009 videos exceeding one hour and 14,847 time-aware QA/MCQA instances spanning frame-level, intra-event, inter-event, and long-horizon reasoning tasks. We formally define and systematically evaluate time-sensitive long-video understanding for the first time, proposing a multi-granularity, temporally aligned annotation and evaluation framework: semantic annotations are generated via high-precision human labeling across hierarchical levels, and questions are rigorously timestamp-aligned to support zero-shot and fine-tuned evaluation of mainstream multimodal LLMs. Comprehensive experiments expose fundamental performance limitations of state-of-the-art models, significantly advancing fine-grained research in real-world applications such as live-stream analysis, meeting summarization, and film understanding.

Technology Category

Application Category

📝 Abstract

Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.

Problem

Research questions and friction points this paper is trying to address.

Large-scale Multimodal Language Models

Long Video Analysis

Efficiency and Dataset Limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

HLV-1K Dataset

Long Video Understanding

Multimodal Language Models

🔎 Similar Papers

LVBench: An Extreme Long Video Understanding Benchmark