VideoAgent: Personalized Synthesis of Scientific Videos

πŸ“… 2025-09-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing scientific video generation methods struggle to support personalized dynamic narration and multimodal content synchronization. This paper introduces the first multi-agent video generation framework tailored for academic papers, enabling end-to-end conversion from scholarly articles to explanatory videos via collaborative parsing, dynamic narrative orchestration, and cross-modal (text/image/animation/speech) synchronized synthesis. The framework allows users to customize narrative logic and introduces SciVidEvalβ€”a novel evaluation benchmark integrating automated metrics with human assessment based on video-based knowledge quizzes to quantify knowledge transfer efficacy. Experiments demonstrate that our approach significantly outperforms leading commercial tools in scientific accuracy, narrative coherence, and communicative effectiveness, achieving video quality comparable to human experts. It substantially enhances both the efficiency of scientific knowledge dissemination and user experience.

Technology Category

Application Category

πŸ“ Abstract
Automating the generation of scientific videos is a crucial yet challenging task for effective knowledge dissemination. However, existing works on document automation primarily focus on static media such as posters and slides, lacking mechanisms for personalized dynamic orchestration and multimodal content synchronization. To address these challenges, we introduce VideoAgent, a novel multi-agent framework that synthesizes personalized scientific videos through a conversational interface. VideoAgent parses a source paper into a fine-grained asset library and, guided by user requirements, orchestrates a narrative flow that synthesizes both static slides and dynamic animations to explain complex concepts. To enable rigorous evaluation, we also propose SciVidEval, the first comprehensive suite for this task, which combines automated metrics for multimodal content quality and synchronization with a Video-Quiz-based human evaluation to measure knowledge transfer. Extensive experiments demonstrate that our method significantly outperforms existing commercial scientific video generation services and approaches human-level quality in scientific communication.
Problem

Research questions and friction points this paper is trying to address.

Automating personalized scientific video generation
Synchronizing multimodal content for dynamic orchestration
Evaluating knowledge transfer in scientific communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework synthesizes personalized scientific videos
Parses papers into asset library for narrative orchestration
Combines static slides with dynamic animations automatically
πŸ”Ž Similar Papers
No similar papers found.
X
Xiao Liang
School of Computer Science and Technology, Xidian University
B
Bangxin Li
School of Computer Science and Technology, Xidian University
Z
Zixuan Chen
School of Computer Science and Technology, Xidian University
H
Hanyue Zheng
School of Computer Science and Technology, Xidian University
Zhi Ma
Zhi Ma
China Mobile (Hangzhou) Information Technology Co., Ltd.
Edge Intelligence Deep Learning LLM
D
Di Wang
School of Computer Science and Technology, Xidian University
Cong Tian
Cong Tian
Xidian University
Formal methodsProgram verificationSoftware engineering
Q
Quan Wang
School of Computer Science and Technology, Xidian University