🤖 AI Summary
Text-to-image diffusion models struggle to simultaneously maintain subject consistency and text alignment in multi-image generation; existing approaches rely on fine-tuning or image conditioning, incurring high computational costs and poor generalization. This paper proposes a training-free geometric disentanglement method: for the first time, it leverages the geometric structure of the text embedding space, explicitly decoupling shared subject representations from scene descriptions via token-level embedding rescaling and semantic suppression—thereby mitigating cross-frame semantic leakage. The method is plug-and-play and requires only a single text prompt. Experiments demonstrate substantial improvements across multiple benchmarks: subject consistency increases by 32.7% (ID preservation rate) and text alignment accuracy improves by +0.18 in CLIP-Score, surpassing state-of-the-art methods including 1Prompt1Story.
📝 Abstract
Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.