LVSum: A Benchmark for Timestamp-Aware Long Video Summarization

📅 2026-04-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
Existing vision-language models struggle to maintain temporal consistency over extended durations and often produce summaries lacking precise semantic and temporal alignment in long video summarization tasks. To address this, this work introduces LVSum, a new benchmark comprising diverse long videos spanning 13 domains, each paired with human-annotated summaries featuring exact timestamps—providing the first fine-grained temporally aligned annotations for long-form video. We propose novel evaluation metrics grounded in large language models to assess content relevance and cross-modal temporal consistency, and conduct a systematic evaluation of both open- and closed-source multimodal large models. Our analysis reveals systematic deficiencies in current models’ temporal reasoning capabilities, establishing a foundational framework for time-aware inference in long video summarization.

Technology Category

Application Category

📝 Abstract
Long video summarization presents significant challenges for current multimodal large language models (MLLMs), particularly in maintaining temporal fidelity over extended durations and producing summaries that are both semantically and temporally grounded. In this work, we present LVSum, a human-annotated benchmark designed specifically for evaluating long video summarization with fine-grained temporal alignment. LVSum comprises diverse long-form videos across 13 domains, each paired with human-generated summaries containing precise temporal references. We conduct a comprehensive evaluation of both proprietary and open-source MLLMs on LVSum, assessing performance using newly introduced LLM-based metrics for content relevance and modality coherence, alongside standard evaluation metrics. Our experiments reveal systematic gaps in temporal understanding among existing MLLMs and offer insights that establish a new foundation for advancing temporal reasoning in long video summarization.
Problem

Research questions and friction points this paper is trying to address.

long video summarization
temporal fidelity
multimodal large language models
temporal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

long video summarization
timestamp-aware
temporal alignment
multimodal large language models
human-annotated benchmark
🔎 Similar Papers
No similar papers found.