🤖 AI Summary
Addressing two key challenges in multi-turn text-image interaction—semantic consistency (text-image alignment) and contextual consistency (cross-turn character stability)—this paper proposes a training-free three-stage framework: “Scriptwriting–Rehearsal–Final Rendering.” It leverages large language models (LLMs) to autonomously generate structured prompt scripts and spatial layouts, which guide diffusion models to perform character rehearsal and final image synthesis. We introduce CMIGBench, the first benchmark for character-free multi-turn interactive generation, supporting both story generation and iterative editing tasks. Our method requires no model fine-tuning; instead, it achieves synergistic optimization via prompt engineering, layout-guided generation, and reverse denoising injection. On CMIGBench, our approach outperforms Mini DALL·E 3 by +21% in character similarity and +19% in text-image similarity, establishing new state-of-the-art performance over existing methods.
📝 Abstract
Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a"Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on these, Theatergen generate a list of character images and extract guidance information, akin to the"Rehearsal". Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, Theatergen generate the final image, as conducting the"Final Performance". With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. It raises the performance bar of the cutting-edge Mini DALLE 3 model by 21% in average character-character similarity and 19% in average text-image similarity.