π€ AI Summary
This work proposes PRISM, a three-stage framework designed to address the challenge of generating semantically accurate, concise, and contextually coherent summaries from procedural and instructional videos. By integrating adaptive visual sampling, label-driven keyframe anchoring, and contextual verification via large language models (LLMs), PRISM effectively fuses multimodal semantics with procedural structure while significantly suppressing irrelevant or hallucinated content. Remarkably, the method retains 84% of semantic information while sampling fewer than 5% of the original video frames, outperforming existing baselines by up to 33% in summary fidelity. PRISM demonstrates consistent high-quality summarization across multiple instructional and procedural video datasets, achieving high semantic retention at extremely low sampling rates.
π Abstract
Video summarization helps turn long videos into clear, concise representations that are easier to review, document, and analyze, especially in high-stakes domains like surgical training. Prior work has progressed from using basic visual features like color, motion, and structural changes to using pre-trained vision-language models that can better understand what's happening in the video (semantics) and capture temporal flow, resulting in more context-aware video summarization. We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis, that produces semantically grounded video summaries. PRISM combines adaptive visual sampling, label-driven keyframe anchoring, and contextual validation using a large language model (LLM). Our method ensures that selected frames reflect meaningful and procedural transitions while filtering out generic or hallucinated content, resulting in contextually coherent summaries across both domain-specific and instructional videos. We evaluate our method on instructional and activity datasets, using reference summaries for instructional videos. Despite sampling fewer than 5% of the original frames, our summaries retain 84% semantic content while improving over baselines by as much as 33%. Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.