🤖 AI Summary
Existing autoregressive visual storytelling methods suffer from high memory overhead, slow inference, and weak contextual modeling, resulting in poor consistency in characters and scenes across image sequences. To address these limitations, we propose an efficient generative framework tailored for visual narrative synthesis. Our approach introduces a spatially enhanced temporal attention mechanism for fine-grained spatiotemporal modeling; a Storyline Contextualizer to capture global narrative structure; and a StoryFlow Adapter to explicitly model dynamic scene evolution. Furthermore, we adopt a multi-stage generation architecture integrating diffusion or VAE-based components to improve controllability and fidelity. Evaluated on the PororoSV and FlintstonesSV benchmarks, our method achieves a 23.6% reduction in FID and an 18.4% improvement in CLIP-Score on both story visualization and continuation tasks, significantly outperforming state-of-the-art approaches.
📝 Abstract
Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous frame-sentence pairs, struggle with high memory usage, slow generation speeds, and limited context integration. To address these issues, we propose ContextualStory, a novel framework designed to generate coherent story frames and extend frames for story continuation. ContextualStory utilizes Spatially-Enhanced Temporal Attention to capture spatial and temporal dependencies, handling significant character movements effectively. Additionally, we introduces a Storyline Contextualizer to enrich context in storyline embedding and a StoryFlow Adapter to measure scene changes between frames for guiding model. Extensive experiments on PororoSV and FlintstonesSV benchmarks demonstrate that ContextualStory significantly outperforms existing methods in both story visualization and story continuation.