๐ค AI Summary
Video summarization faces dual challenges: poor generalizability and difficulty in modeling user intentโexisting methods either rely on domain-specific training data or lack support for natural language queries. This paper introduces the first zero-shot, text-queryable video summarization framework. It first segments videos and generates scene descriptions using a video-language model (VidLM); then employs a large language model (LLM) to assess scene-level importance; finally, propagates importance scores to frames via a novel dual-criterion mechanism enforcing temporal consistency and content uniqueness. Key contributions include: (1) an LLM-driven scene importance evaluation mechanism; (2) a new score propagation paradigm grounded in both temporal coherence and semantic distinctness; and (3) VidSum-Reason, the first query-driven benchmark supporting long-tail concept recognition and multi-step reasoning. Experiments demonstrate state-of-the-art unsupervised performance on SumMe and TVSum, competitive results with supervised methods on QFVS, and establish a strong baseline on VidSum-Reason.
๐ Abstract
The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.