ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning

๐Ÿ“… 2026-01-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing evaluation metrics struggle to consistently assess the informational coverage of cross-modal video summariesโ€”such as those combining keyframes and text. This work proposes ViSIL, a novel scoring framework that, for the first time, quantifies the degree of information preserved from a source video to its summary through the lens of information loss, leveraging both vision-language models and an information-theoretic foundation. ViSIL enables direct comparison across heterogeneous, cross-modal summarization outputs. It demonstrates strong correlation with human judgments and with vision-language model performance on video question answering (VQA). Notably, without increasing computational overhead, multimodal summaries selected by ViSIL improve VQA accuracy by 7% over text-only summaries and reveal a Pareto-optimal trade-off between information loss and processing efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7\%$ in VQA accuracy without increasing processing load.
Problem

Research questions and friction points this paper is trying to address.

multimodal video captioning
information loss
evaluation metric
vision-language model
video summarization
Innovation

Methods, ideas, or system contributions that make the work stand out.

ViSIL
multimodal video captioning
information loss
vision-language model
summary evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.
P
Po-han Li
The University of Texas at Austin, Texas, USA
Shenghui Chen
Shenghui Chen
University of Texas at Austin
Game TheoryHuman-Agent Interaction
U
U. Topcu
The University of Texas at Austin, Texas, USA
S
Sandeep P. Chinchali
The University of Texas at Austin, Texas, USA