๐ค AI Summary
This work addresses the high computational cost of the โgenerate-then-filterโ pipeline in current text-to-image generation models and the absence of real-time quality assessment during synthesis. We propose the Probe-Select module, which leverages intermediate activation signals from early denoising steps in diffusion or flow-matching models to predict final image quality, enabling early termination of low-potential generation paths. We demonstrate for the first time that activations from early denoising stages already encode structural and layout information strongly correlated with final image fidelity. Building on this insight, we introduce a selective generation mechanism that requires no modification to the original model. Experiments show that our approach accurately ranks candidate samples using only 20% of the full denoising steps, reducing sampling cost by over 60% while improving the overall quality of retained images.
๐ Abstract
Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate--then--select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20\% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60\% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.