An Experimental Study on Generating Plausible Textual Explanations for Video Summarization

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study addresses the critical challenge of generating *plausible and faithful* textual explanations in video summarization. Recognizing that existing methods conflate plausibility (human-perceived reasonableness) and faithfulness (alignment with visual content), we extend a multi-granularity explanation framework by integrating the large vision-language model LLaVA-OneVision to generate vision-driven natural language explanations. We further propose a novel plausibility evaluation metric based on semantic overlap, leveraging SBERT and SimCSE to quantify textual similarity and systematically uncover the non-positive correlation between faithfulness and plausibility. Experiments on SumMe and TVSum reveal that higher faithfulness does not necessarily imply higher plausibility; consequently, we identify optimal explanation generation configurations and methodological pathways. This work advances explainable video summarization by providing both theoretical insights—particularly regarding the dissociation of explanation quality dimensions—and practical technical tools for balanced explanation generation.

Technology Category

Application Category

📝 Abstract

In this paper, we present our experimental study on generating plausible textual explanations for the outcomes of video summarization. For the needs of this study, we extend an existing framework for multigranular explanation of video summarization by integrating a SOTA Large Multimodal Model (LLaVA-OneVision) and prompting it to produce natural language descriptions of the obtained visual explanations. Following, we focus on one of the most desired characteristics for explainable AI, the plausibility of the obtained explanations that relates with their alignment with the humans' reasoning and expectations. Using the extended framework, we propose an approach for evaluating the plausibility of visual explanations by quantifying the semantic overlap between their textual descriptions and the textual descriptions of the corresponding video summaries, with the help of two methods for creating sentence embeddings (SBERT, SimCSE). Based on the extended framework and the proposed plausibility evaluation approach, we conduct an experimental study using a SOTA method (CA-SUM) and two datasets (SumMe, TVSum) for video summarization, to examine whether the more faithful explanations are also the more plausible ones, and identify the most appropriate approach for generating plausible textual explanations for video summarization.

Problem

Research questions and friction points this paper is trying to address.

Generating plausible textual explanations for video summarization outcomes

Evaluating plausibility through semantic overlap with video summaries

Identifying optimal approaches for faithful and plausible explanations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrating LLaVA-OneVision model for text generation

Evaluating plausibility via semantic overlap of embeddings

Using SBERT and SimCSE for sentence embedding comparison

🔎 Similar Papers

No similar papers found.