SummDiff: Generative Modeling of Video Summarization with Diffusion

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Video summarization must accommodate subjective diversity among human annotators, yet existing methods regress only the mean of multiple human scores, failing to capture individual preference variations. This work introduces diffusion models to video summarization for the first time, reformulating summary generation as a conditional generative task: given visual context, the model learns the multimodal distribution of high-quality summaries, enabling sampling of diverse, human-preference-aligned summaries. We propose a sequential frame selection mechanism and a dynamic visual context adaptation strategy to better align latent representations with human judgments. Additionally, we design a novel knapsack-based analytical metric to rigorously characterize the combinatorial optimization dynamics underlying summary generation. Extensive experiments on multiple benchmarks demonstrate state-of-the-art performance, with significant improvements in both alignment with individual annotator preferences and diversity-aware summary representation.

Technology Category

Application Category

📝 Abstract

Video summarization is a task of shortening a video by choosing a subset of frames while preserving its essential moments. Despite the innate subjectivity of the task, previous works have deterministically regressed to an averaged frame score over multiple raters, ignoring the inherent subjectivity of what constitutes a good summary. We propose a novel problem formulation by framing video summarization as a conditional generation task, allowing a model to learn the distribution of good summaries and to generate multiple plausible summaries that better reflect varying human perspectives. Adopting diffusion models for the first time in video summarization, our proposed method, SummDiff, dynamically adapts to visual contexts and generates multiple candidate summaries conditioned on the input video. Extensive experiments demonstrate that SummDiff not only achieves the state-of-the-art performance on various benchmarks but also produces summaries that closely align with individual annotator preferences. Moreover, we provide a deeper insight with novel metrics from an analysis of the knapsack, which is an important last step of generating summaries but has been overlooked in evaluation.

Problem

Research questions and friction points this paper is trying to address.

Modeling video summarization as conditional generation task

Generating multiple plausible summaries reflecting human subjectivity

Adapting diffusion models to produce context-aware candidate summaries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frames video summarization as conditional generation task

Uses diffusion models to generate multiple candidate summaries

Dynamically adapts to visual contexts for diverse outputs

🔎 Similar Papers

No similar papers found.