QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses the limitations of existing video summarization evaluation methods, which rely on human-annotated reference summaries and struggle to accurately capture semantic details or ensure practical applicability. To overcome these challenges, we propose QEVA—a reference-free, multimodal question-answering–driven evaluation framework that introduces a QA mechanism into referenceless video summary assessment for the first time. QEVA directly compares candidate summaries against source videos along three interpretable dimensions: coverage, factuality, and temporal coherence. We also introduce MLVU(VS)-Eval, a dedicated benchmark dataset, and leverage multimodal large language models to enable fine-grained video-text alignment analysis. Experimental results demonstrate that QEVA significantly outperforms current approaches in terms of Kendall’s τ_b, τ_c, and Spearman’s ρ correlation coefficients, achieving strong agreement with human judgments.

Technology Category

Application Category

📝 Abstract
Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall's $τ_b$, $τ_c$, and Spearman's $ρ$. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.
Problem

Research questions and friction points this paper is trying to address.

video-to-text summarization
evaluation metric
reference-free
multimodal question answering
summary quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-free evaluation
multimodal question answering
video summarization
factuality assessment
evaluation benchmark
🔎 Similar Papers
2024-05-22Annual Meeting of the Association for Computational LinguisticsCitations: 2