TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

📅 2024-04-29
🏛️ arXiv.org
📈 Citations: 12
Influential: 2
📄 PDF
🤖 AI Summary
Addressing two key challenges in multi-turn text-image interaction—semantic consistency (text-image alignment) and contextual consistency (cross-turn character stability)—this paper proposes a training-free three-stage framework: “Scriptwriting–Rehearsal–Final Rendering.” It leverages large language models (LLMs) to autonomously generate structured prompt scripts and spatial layouts, which guide diffusion models to perform character rehearsal and final image synthesis. We introduce CMIGBench, the first benchmark for character-free multi-turn interactive generation, supporting both story generation and iterative editing tasks. Our method requires no model fine-tuning; instead, it achieves synergistic optimization via prompt engineering, layout-guided generation, and reverse denoising injection. On CMIGBench, our approach outperforms Mini DALL·E 3 by +21% in character similarity and +19% in text-image similarity, establishing new state-of-the-art performance over existing methods.

Technology Category

Application Category

📝 Abstract
Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a"Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on these, Theatergen generate a list of character images and extract guidance information, akin to the"Rehearsal". Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, Theatergen generate the final image, as conducting the"Final Performance". With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. It raises the performance bar of the cutting-edge Mini DALLE 3 model by 21% in average character-character similarity and 19% in average text-image similarity.
Problem

Research questions and friction points this paper is trying to address.

Maintaining semantic consistency in multi-turn image generation
Ensuring contextual consistency across interactive image turns
Managing character consistency without predefined character definitions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLMs and T2I models for multi-turn generation
Uses prompt book and guidance for consistency
Introduces CMIGBench for comprehensive evaluation
🔎 Similar Papers
No similar papers found.
J
Junhao Cheng
Shenzhen Campus of Sun Yat-sen University
B
Baiqiao Yin
Shenzhen Campus of Sun Yat-sen University
K
Kaixin Cai
Shenzhen Campus of Sun Yat-sen University
Minbin Huang
Minbin Huang
The Chinese University of Hong Kong
Hanhui Li
Hanhui Li
Sun Yat-sen University
Deep LearningComputer Vision
Y
Yuxin He
Shenzhen Campus of Sun Yat-sen University
Xi Lu
Xi Lu
Tsinghua University
EnergyWind PowerSolar PowerEnvironmental ScienceClimate Change
Y
Yue Li
Shenzhen Campus of Sun Yat-sen University
Y
Yifei Li
Shenzhen Campus of Sun Yat-sen University
Y
Yuhao Cheng
Lenovo Research
Yiqiang Yan
Yiqiang Yan
Lenovo
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning