🤖 AI Summary
This work addresses the challenges of background inconsistency, discontinuous multi-character shot transitions, and limited scalability to hour-long narratives in long-form video generation. To tackle these issues, the authors propose a video generation framework tailored for complex multi-agent scenes, featuring a background consistency generation pipeline and a transition-aware synthesis module. This design preserves character identity and enables natural entrance and exit transitions across frames. The framework is trained on a newly constructed synthetic dataset comprising 10,000 multi-agent transition sequences. Experimental results demonstrate significant improvements over existing methods on VBench, achieving scores of 88.94 in background consistency, 82.11 in subject consistency, and an average ranking of 2.80, thereby enhancing spatiotemporal coherence, transition smoothness, and long-range narrative capability.
📝 Abstract
Generating long-form storytelling videos with consistent visual narratives remains a significant challenge in video synthesis. We present a novel framework, dataset, and a model that address three critical limitations: background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives. Our approach introduces a background-consistent generation pipeline that maintains visual coherence across scenes while preserving character identity and spatial relationships. We further propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames, going beyond the single-subject limitations of prior work. To support this, we contribute with a synthetic dataset of 10,000 multi-subject transition sequences covering underrepresented dynamic scene compositions. On VBench, InfinityStory achieves the highest Background Consistency (88.94), highest Subject Consistency (82.11), and the best overall average rank (2.80), showing improved stability, smoother transitions, and better temporal coherence.