AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

📅 2024-06-03

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 1

career value

197K/year

🤖 AI Summary

To address the cross-turn subject consistency challenge in multi-turn interactive image generation—caused by frequent user topic switching—this paper proposes AutoStudio, a training-free multi-agent framework. Methodologically, it introduces (1) a novel Parallel-UNet architecture coupled with subject initialization generation, significantly enhancing small-subject preservation and inter-turn consistency; and (2) a large language model–driven tri-agent collaborative architecture—comprising subject management, layout generation, and supervision optimization agents—integrated with an enhanced Stable Diffusion rendering agent to enable multi-subject coordination, controllable layout synthesis, and high-fidelity image generation. Evaluated on the CMIGBench benchmark, AutoStudio achieves a 13.65% reduction in FID, a 2.83% improvement in character similarity, and establishes new state-of-the-art performance in multi-turn subject consistency.

Technology Category

Application Category

📝 Abstract

As cutting-edge Text-to-Image (T2I) generation models already excel at producing remarkable single images, an even more challenging task, i.e., multi-turn interactive image generation begins to attract the attention of related research communities. This task requires models to interact with users over multiple turns to generate a coherent sequence of images. However, since users may switch subjects frequently, current efforts struggle to maintain subject consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Specifically, AutoStudio consists of (i) a subject manager to interpret interaction dialogues and manage the context of each subject, (ii) a layout generator to generate fine-grained bounding boxes to control subject locations, (iii) a supervisor to provide suggestions for layout refinements, and (iv) a drawer to complete image generation. Furthermore, we introduce a Parallel-UNet to replace the original UNet in the drawer, which employs two parallel cross-attention modules for exploiting subject-aware features. We also introduce a subject-initialized generation method to better preserve small subjects. Our AutoStudio hereby can generate a sequence of multi-subject images interactively and consistently. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well, and it also raises the state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average character-character similarity.

Problem

Research questions and friction points this paper is trying to address.

Maintaining subject consistency in multi-turn image generation

Handling frequent subject switches in interactive image generation

Generating coherent multi-subject image sequences interactively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free multi-agent framework with LLMs

Parallel-UNet for subject-aware image generation

Subject-initialized method preserves small subjects

🔎 Similar Papers

EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance