🤖 AI Summary
Text-to-image diffusion models struggle with cross-scene subject consistency when generating coherent visual narratives; existing fine-tuning approaches incur high computational costs and often degrade pretrained generative capabilities. To address this, we propose a training-free consistency control framework built entirely upon frozen, pretrained diffusion models. Our method introduces mask-guided cross-image attention sharing to align features across corresponding regions and employs region-wise feature harmonization to dynamically coordinate representations of the same subject across multiple generated images. Both components operate solely during forward inference—requiring no optimization or parameter updates. Experiments demonstrate substantial improvements in inter-image consistency for characters and objects across diverse narrative scenarios, while fully preserving the model’s inherent generation diversity, fine-grained detail fidelity, and creative flexibility. This work establishes an efficient, lightweight, plug-and-play paradigm for zero-shot visual storytelling.
📝 Abstract
Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model.