🤖 AI Summary
Existing long-form video generation methods struggle to maintain knowledge consistency and pedagogical narrative coherence in multi-shot STEM instructional videos. This work proposes EduStory, a unified framework that introduces, for the first time, a teaching-state modeling mechanism coupled with script-guided structured narrative control to generate multi-shot videos that are both factually accurate and logically coherent. The core contributions include the construction of EduVideoBench, a diagnostic benchmark featuring multi-granularity annotations; the design of learning-oriented metrics for assessing knowledge fidelity; and a significant reduction in narrative discontinuities, thereby enhancing alignment between generated videos and intended instructional goals. The approach demonstrates substantial progress in both the accuracy of knowledge transmission and the controllability of narrative structure.
📝 Abstract
Long-horizon video generation has advanced in visual quality, yet existing methods still struggle to maintain knowledge consistency and coherent pedagogical narratives across multi-shot instructional videos, especially in STEM domains. To address these challenges, we propose EduStory, a unified framework for reliable instructional video generation. EduStory integrates pedagogical state modeling to track persistent knowledge states, script-guided structured control to organize multi-shot narratives, and learning-oriented evaluation metrics to assess knowledge fidelity and constraint satisfaction. To support rigorous evaluation, we further introduce EduVideoBench, a diagnostic benchmark with multi-granularity annotations, including pedagogical storyboards, shot-level semantics, and knowledge state transitions, together with baseline tasks for controllable instructional video generation. Extensive experiments demonstrate that domain-aware state modeling and structured control substantially reduce narrative breakdown and improve alignment with instructional intent. These results highlight the significance of domain-specific structural constraints and tailored benchmarks for advancing reliable, controllable, and also trustworthy long-horizon video generation.