TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

📅 2024-04-29

🏛️ arXiv.org

📈 Citations: 12

✨ Influential: 2

career value

172K/year

🤖 AI Summary

Addressing two key challenges in multi-turn text-image interaction—semantic consistency (text-image alignment) and contextual consistency (cross-turn character stability)—this paper proposes a training-free three-stage framework: “Scriptwriting–Rehearsal–Final Rendering.” It leverages large language models (LLMs) to autonomously generate structured prompt scripts and spatial layouts, which guide diffusion models to perform character rehearsal and final image synthesis. We introduce CMIGBench, the first benchmark for character-free multi-turn interactive generation, supporting both story generation and iterative editing tasks. Our method requires no model fine-tuning; instead, it achieves synergistic optimization via prompt engineering, layout-guided generation, and reverse denoising injection. On CMIGBench, our approach outperforms Mini DALL·E 3 by +21% in character similarity and +19% in text-image similarity, establishing new state-of-the-art performance over existing methods.

Technology Category

Application Category

📝 Abstract

Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a"Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on these, Theatergen generate a list of character images and extract guidance information, akin to the"Rehearsal". Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, Theatergen generate the final image, as conducting the"Final Performance". With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. It raises the performance bar of the cutting-edge Mini DALLE 3 model by 21% in average character-character similarity and 19% in average text-image similarity.

Problem

Research questions and friction points this paper is trying to address.

Maintaining semantic consistency in multi-turn image generation

Ensuring contextual consistency across interactive image turns

Managing character consistency without predefined character definitions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLMs and T2I models for multi-turn generation

Uses prompt book and guidance for consistency

Introduces CMIGBench for comprehensive evaluation

🔎 Similar Papers

No similar papers found.