🤖 AI Summary
Current text-to-image generation models struggle to ensure scene-level narrative coherence and cross-story consistency in multi-image storytelling, particularly lacking structured scene planning and long-term shared modeling. To address this, we propose the first scene-oriented story generation framework: (1) leveraging vision-language models (VLMs) for global–local collaborative scene planning, explicitly encoding spatial, temporal, and semantic constraints; and (2) introducing a long-horizon scene-shared attention mechanism within diffusion models—enabling cross-story scene consistency without additional training while preserving subject diversity. Experiments demonstrate significant improvements over state-of-the-art methods in both scene-level narrative coherence and visual consistency. Our approach establishes a scalable, training-free paradigm for consistent story generation, with direct applicability to artistic creation, film storyboarding, and game narrative design.
📝 Abstract
Recent text-to-image models have revolutionized image generation, but they still struggle with maintaining concept consistency across generated images. While existing works focus on character consistency, they often overlook the crucial role of scenes in storytelling, which restricts their creativity in practice. This paper introduces scene-oriented story generation, addressing two key challenges: (i) scene planning, where current methods fail to ensure scene-level narrative coherence by relying solely on text descriptions, and (ii) scene consistency, which remains largely unexplored in terms of maintaining scene consistency across multiple stories. We propose SceneDecorator, a training-free framework that employs VLM-Guided Scene Planning to ensure narrative coherence across different scenes in a ``global-to-local'' manner, and Long-Term Scene-Sharing Attention to maintain long-term scene consistency and subject diversity across generated stories. Extensive experiments demonstrate the superior performance of SceneDecorator, highlighting its potential to unleash creativity in the fields of arts, films, and games.