🤖 AI Summary
This work addresses the instability in output quality of existing text-to-image diffusion models, which stems from their reliance on random Gaussian noise and often necessitates repeated sampling to obtain satisfactory results under identical prompts. To mitigate this issue, the authors propose a lightweight prompt evaluation mechanism that leverages a text-to-image preference dataset to directly predict the quality of generated images from the input prompt and initial noise—without altering the underlying diffusion model architecture. By integrating a noise quality prediction module with an efficient ensemble strategy, the method significantly outperforms existing approaches across multiple prompt corpora benchmarks, thereby enhancing both the consistency of image quality and user satisfaction.
📝 Abstract
Text-to-Image (T2I) generation is primarily driven by Diffusion Models (DM) which rely on random Gaussian noise. Thus, like playing the slots at a casino, a DM will produce different results given the same user-defined inputs. This imposes a gambler's burden: To perform multiple generation cycles to obtain a satisfactory result. However, even though DMs use stochastic sampling to seed generation, the distribution of generated content quality highly depends on the prompt and the generative ability of a DM with respect to it.
To account for this, we propose Naïve PAINE for improving the generative quality of Diffusion Models by leveraging T2I preference benchmarks. We directly predict the numerical quality of an image from the initial noise and given prompt. Naïve PAINE then selects a handful of quality noises and forwards them to the DM for generation. Further, Naïve PAINE provides feedback on the DM generative quality given the prompt and is lightweight enough to seamlessly fit into existing DM pipelines. Experimental results demonstrate that Naïve PAINE outperforms existing approaches on several prompt corpus benchmarks.