🤖 AI Summary
Text-to-image generation models achieve high single-image fidelity but suffer from significant limitations in cross-image thematic consistency—particularly character and object coherence—required for visual storytelling. This paper proposes a training-free diffusion sampling framework centered on an asymmetric Zigzag sampling strategy: it employs alternating asymmetric text prompts and enforces inter-image visual feature sharing in the latent space to enable dynamic consistency modeling during generation. Crucially, the method avoids fine-tuning or auxiliary networks, introducing structured consistency constraints solely at inference time. Evaluated on multiple story visualization benchmarks, it improves the Consistency Score by 23.6% over state-of-the-art methods. Qualitative analysis confirms substantial gains in narrative coherence. The approach establishes a novel zero-shot paradigm for cross-image consistency generation, offering a lightweight, plug-and-play solution for visual storytelling without architectural or training overhead.
📝 Abstract
Text-to-image generation models have made significant progress in producing high-quality images from textual descriptions, yet they continue to struggle with maintaining subject consistency across multiple images, a fundamental requirement for visual storytelling. Existing methods attempt to address this by either fine-tuning models on large-scale story visualization datasets, which is resource-intensive, or by using training-free techniques that share information across generations, which still yield limited success. In this paper, we introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing to enhance subject consistency in visual story generation. Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics, while a visual sharing module transfers visual cues across generated images to %further enforce consistency. Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. The code is available at https://github.com/Mingxiao-Li/Asymmetry-Zigzag-StoryDiffusion.