Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In text-to-image generation, pretrained diffusion models suffer from semantic misalignment due to a fundamental mismatch: during inference, the initial noise is sampled from a prompt-agnostic Gaussian prior, whereas training constrains noise to a prompt-conditioned latent subspace. This work is the first to identify this training-inference noise distribution discrepancy as the root cause of text–image misalignment. To address it, we propose a lightweight, text-conditioned noise projection framework that corrects the initial noise distribution in a single forward pass—without fine-tuning the pretrained model. Our method leverages fine-grained feedback from vision-language models to construct a reward model and optimizes the noise projector via a DPO-inspired algorithm. Experiments demonstrate substantial improvements in text alignment across diverse prompt categories, while preserving both image fidelity and diversity.

Technology Category

Application Category

📝 Abstract
In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.
Problem

Research questions and friction points this paper is trying to address.

Addressing text-image misalignment caused by prompt-agnostic noise sampling in diffusion models
Closing training-inference mismatch through prompt-conditioned noise refinement
Improving text-to-image alignment without modifying pretrained Stable Diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Noise projector refines initial noise using prompt embedding
Distills VLM feedback into reward model for optimization
Enhances alignment via single forward pass without SD modification
🔎 Similar Papers
No similar papers found.
Y
Yunze Tong
Zhejiang University, Hangzhou, Zhejiang, China
Didi Zhu
Didi Zhu
Imperial College London
Multi-Modal LLMsOut of Distribution Generalization
Z
Zijing Hu
Zhejiang University, Hangzhou, Zhejiang, China
J
Jinluan Yang
Zhejiang University, Hangzhou, Zhejiang, China
Ziyu Zhao
Ziyu Zhao
University of South Carolina
computer vision. 2D/3D segmentationGenerative 3D reconstruction