Be Decisive: Noise-Induced Layouts for Multi-Subject Generation

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing text-to-image diffusion models suffer from subject leakage in multi-subject generation, leading to inaccuracies in subject count, attributes, and visual features. This stems from the misalignment between externally imposed layout constraints and the model’s intrinsic noise prior. To address this, we propose Noise-Induced Layout (NIL), a mechanism that dynamically predicts and iteratively refines semantically grounded spatial structures during denoising via a lightweight neural network. NIL jointly models spatial boundaries and layout consistency constraints, enabling adaptive alignment between layout specifications and the diffusion prior—without requiring predefined layouts. Our method significantly improves text–image alignment and multi-subject generation stability, outperforming existing layout-guided approaches in subject count fidelity, attribute accuracy, and visual correctness, while preserving generation diversity.

Technology Category

Application Category

📝 Abstract

Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.

Problem

Research questions and friction points this paper is trying to address.

Preventing subject leakage in multi-subject image generation

Aligning spatial layouts with initial noise to avoid conflicts

Improving text-image alignment and generation stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts spatial layout from initial noise

Refines layout during denoising process

Employs small neural network for boundaries

🔎 Similar Papers

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance