π€ AI Summary
Traditional diffusion models rely on static negative prompts, which fail to adapt to the dynamic semantic evolution of images during the denoising process. To address this limitation, we propose a vision-language model (VLM)-based dynamic negative prompting method: at critical denoising steps, intermediate denoised images are fed into a VLM to generate context-aware, image-conditioned negative prompts; additionally, a learnable negative guidance strength controller enables fine-grained semantic constraint. This approach breaks the fixed-prompt paradigm and is the first to embed a VLM directly into the diffusion sampling loop for real-time, image-driven negative prompt generation. Extensive experiments across multiple benchmark datasets demonstrate significant improvements in textβimage alignment quality, achieving a superior trade-off between fidelity and semantic consistency. Our method establishes a novel paradigm for controllable image generation, advancing the state of the art in conditional diffusion modeling.
π Abstract
We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.