LVSum: A Benchmark for Timestamp-Aware Long Video Summarization

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing vision-language models struggle to maintain temporal consistency over extended durations and often produce summaries lacking precise semantic and temporal alignment in long video summarization tasks. To address this, this work introduces LVSum, a new benchmark comprising diverse long videos spanning 13 domains, each paired with human-annotated summaries featuring exact timestamps—providing the first fine-grained temporally aligned annotations for long-form video. We propose novel evaluation metrics grounded in large language models to assess content relevance and cross-modal temporal consistency, and conduct a systematic evaluation of both open- and closed-source multimodal large models. Our analysis reveals systematic deficiencies in current models’ temporal reasoning capabilities, establishing a foundational framework for time-aware inference in long video summarization.

Technology Category

Application Category

📝 Abstract

Long video summarization presents significant challenges for current multimodal large language models (MLLMs), particularly in maintaining temporal fidelity over extended durations and producing summaries that are both semantically and temporally grounded. In this work, we present LVSum, a human-annotated benchmark designed specifically for evaluating long video summarization with fine-grained temporal alignment. LVSum comprises diverse long-form videos across 13 domains, each paired with human-generated summaries containing precise temporal references. We conduct a comprehensive evaluation of both proprietary and open-source MLLMs on LVSum, assessing performance using newly introduced LLM-based metrics for content relevance and modality coherence, alongside standard evaluation metrics. Our experiments reveal systematic gaps in temporal understanding among existing MLLMs and offer insights that establish a new foundation for advancing temporal reasoning in long video summarization.

Problem

Research questions and friction points this paper is trying to address.

long video summarization

temporal fidelity

multimodal large language models

temporal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

long video summarization

timestamp-aware

temporal alignment

multimodal large language models