SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

📅 2025-01-07

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing text-to-image generation methods struggle to simultaneously preserve subject fidelity and ensure semantic consistency of the background. This paper proposes a novel “fixed-subject + background-generation” paradigm that retains the input subject image unchanged while synthesizing a semantically coherent and compositionally harmonious background conditioned solely on the text prompt and object phrases. Our key contributions are: (1) a multimodal layout generation module that jointly models textual descriptions, object phrases, and subject visual features to achieve scene-level semantic alignment; and (2) a latent diffusion model augmented with ControlNet and a gated self-attention adapter, guaranteeing zero degradation in subject appearance. Extensive quantitative and qualitative evaluations demonstrate that our method significantly outperforms state-of-the-art approaches in subject fidelity, background harmony, and overall image quality.

Technology Category

Application Category

📝 Abstract

Due to the demand for personalizing image generation, subject-driven text-to-image generation method, which creates novel renditions of an input subject based on text prompts, has received growing research interest. Existing methods often learn subject representation and incorporate it into the prompt embedding to guide image generation, but they struggle with preserving subject fidelity. To solve this issue, this paper approaches a novel framework named SceneBooth for subject-preserved text-to-image generation, which consumes inputs of a subject image, object phrases and text prompts. Instead of learning the subject representation and generating a subject, our SceneBooth fixes the given subject image and generates its background image guided by the text prompts. To this end, our SceneBooth introduces two key components, i.e., a multimodal layout generation module and a background painting module. The former determines the position and scale of the subject by generating appropriate scene layouts that align with text captions, object phrases, and subject visual information. The latter integrates two adapters (ControlNet and Gated Self-Attention) into the latent diffusion model to generate a background that harmonizes with the subject guided by scene layouts and text descriptions. In this manner, our SceneBooth ensures accurate preservation of the subject's appearance in the output. Quantitative and qualitative experimental results demonstrate that SceneBooth significantly outperforms baseline methods in terms of subject preservation, image harmonization and overall quality.

Problem

Research questions and friction points this paper is trying to address.

Text-to-Image Generation

Semantic Similarity

Visual Representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

SceneBooth

Text-to-Image Synthesis

Theme Preservation

🔎 Similar Papers

No similar papers found.