🤖 AI Summary
This study addresses the misalignment between superficial metrics and human cognition—and the lack of causal interpretability—in evaluating educational-entertainment short videos. We propose the first interpretable multimodal assessment framework. Methodologically, we leverage vision-language models (e.g., CLIP) to extract unsupervised audiovisual features, apply interpretable clustering to derive semantic dimensions, and train a lightweight regression model to predict real-user engagement. Crucially, we jointly model VLM feature importance and human engagement signals, enabling a paradigm shift from static quality assessment to causal, propagation-driven evaluation. Evaluated on our curated YouTube Shorts annotation dataset, our framework achieves a correlation coefficient of 0.87 with ground-truth engagement—significantly outperforming conventional metrics (e.g., SSIM, FID)—while ensuring high accuracy, strong interpretability, and linear computational efficiency.
📝 Abstract
Evaluating short-form video content requires moving beyond surface-level quality metrics toward human-aligned, multimodal reasoning. While existing frameworks like VideoScore-2 assess visual and semantic fidelity, they do not capture how specific audiovisual attributes drive real audience engagement. In this work, we propose a data-driven evaluation framework that uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features, clusters them into interpretable factors, and trains a regression-based evaluator to predict engagement on short-form edutainment videos. Our curated YouTube Shorts dataset enables systematic analysis of how VLM-derived features relate to human engagement behavior. Experiments show strong correlations between predicted and actual engagement, demonstrating that our lightweight, feature-based evaluator provides interpretable and scalable assessments compared to traditional metrics (e.g., SSIM, FID). By grounding evaluation in both multimodal feature importance and human-centered engagement signals, our approach advances toward robust and explainable video understanding.