What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the multimodal challenge of automatically generating high-quality textual summaries from scientific presentation videos. To this end, we introduce VISTA—the first large-scale video-summary paired dataset for scientific scenarios—comprising 18,599 AI conference talk videos aligned with their corresponding paper abstracts. We propose an end-to-end summarization framework grounded in explicit planning, which jointly integrates multimodal video understanding, large language model–based generation, and plan-guided decoding to enhance structural coherence and factual consistency. Evaluation combines human assessment with automated metrics (BLEU, ROUGE, and FACTSCORE). Results demonstrate that our planning mechanism improves factual accuracy by 12.3% over strong baselines; however, a substantial performance gap remains relative to human-written abstracts—highlighting fundamental challenges in deep scientific content comprehension and faithful verbalization.

Technology Category

Application Category

📝 Abstract
Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of scientific video summarization.
Problem

Research questions and friction points this paper is trying to address.

Video-to-text summarization
Scientific presentations
Multimodal learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-to-text summarization dataset
Plan-based framework application
Enhances summary quality consistency
🔎 Similar Papers
No similar papers found.