Paper2Video: Automatic Video Generation from Scientific Papers

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses key bottlenecks in automated academic video generation: labor-intensive production, challenges in multimodal information fusion, and complex cross-modal alignment across slides, subtitles, speech, and virtual avatars. Methodologically, we propose an end-to-end framework for academic explanatory video synthesis. We introduce PaperTalker—the first benchmark for academic video generation—design a tree-search-based slide layout optimization with cursor localization to enhance slide readability, and propose two novel multimodal evaluation metrics: Meta Similarity and PresentArena. Our framework adopts a multi-agent architecture integrating visual layout search, text-to-speech (TTS), subtitle synchronization, and virtual avatar driving, augmented with IP Memory to ensure content consistency. Experiments on 101 papers demonstrate that our generated videos achieve significantly higher information completeness and expressive accuracy than baselines, while reducing production cost by orders of magnitude—advancing the practical deployment of automated scientific communication.

Technology Category

Application Category

📝 Abstract

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

Problem

Research questions and friction points this paper is trying to address.

Automating labor-intensive academic presentation video creation from research papers

Handling dense multi-modal content including text, figures, and tables effectively

Coordinating aligned channels like slides, subtitles, speech, and talking-head rendering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework for academic video generation

Tree search visual choice for slide layout refinement

Parallel slide-wise generation for efficient video production

🔎 Similar Papers

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

2024-03-03arXiv.orgCitations: 21

TikTok

San Jose, California

Research Scientist Intern, Multimodal AI (PhD)