StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models struggle with cross-scene subject consistency when generating coherent visual narratives; existing fine-tuning approaches incur high computational costs and often degrade pretrained generative capabilities. To address this, we propose a training-free consistency control framework built entirely upon frozen, pretrained diffusion models. Our method introduces mask-guided cross-image attention sharing to align features across corresponding regions and employs region-wise feature harmonization to dynamically coordinate representations of the same subject across multiple generated images. Both components operate solely during forward inference—requiring no optimization or parameter updates. Experiments demonstrate substantial improvements in inter-image consistency for characters and objects across diverse narrative scenarios, while fully preserving the model’s inherent generation diversity, fine-grained detail fidelity, and creative flexibility. This work establishes an efficient, lightweight, plug-and-play paradigm for zero-shot visual storytelling.

Technology Category

Application Category

📝 Abstract
Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model.
Problem

Research questions and friction points this paper is trying to address.

Maintaining subject consistency in text-to-image story generation
Avoiding computationally expensive fine-tuning for coherence
Aligning subject features dynamically across multiple images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free subject consistency method
Masked cross-image attention sharing
Regional Feature Harmonization refinement
🔎 Similar Papers
No similar papers found.
G
Gopalji Gaur
University of Freiburg
M
Mohammadreza Zolfaghari
Zebracat AI
Thomas Brox
Thomas Brox
University of Freiburg
Computer VisionMachine LearningArtificial IntelligenceRobotics