Prompts to Summaries: Zero-Shot Language-Guided Video Summarization

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Video summarization faces dual challenges: poor generalizability and difficulty in modeling user intent—existing methods either rely on domain-specific training data or lack support for natural language queries. This paper introduces the first zero-shot, text-queryable video summarization framework. It first segments videos and generates scene descriptions using a video-language model (VidLM); then employs a large language model (LLM) to assess scene-level importance; finally, propagates importance scores to frames via a novel dual-criterion mechanism enforcing temporal consistency and content uniqueness. Key contributions include: (1) an LLM-driven scene importance evaluation mechanism; (2) a new score propagation paradigm grounded in both temporal coherence and semantic distinctness; and (3) VidSum-Reason, the first query-driven benchmark supporting long-tail concept recognition and multi-step reasoning. Experiments demonstrate state-of-the-art unsupervised performance on SumMe and TVSum, competitive results with supervised methods on QFVS, and establish a strong baseline on VidSum-Reason.

Technology Category

Application Category

📝 Abstract

The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.

Problem

Research questions and friction points this paper is trying to address.

Flexible user-controllable video summarization without domain-specific training data

Incorporating user intent expressed in natural language for summarization

Zero-shot text-queryable video summarization leveraging pretrained multimodal models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot video summarization without training data

Leverages VidLMs and LLMs for guided summarization

Novel metrics: consistency and uniqueness for scoring

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs