🤖 AI Summary
This work addresses the lack of efficient early-stage quality assessment in existing text-to-image diffusion models, which often leads to wasted computational resources during iterative generation. The study reveals, for the first time, a strong correlation between the cross-attention distributions in early denoising steps and the final image quality. Building on this insight, the authors propose a lightweight, model-agnostic, and generalizable framework for early quality prediction: by extracting statistical features from cross-attention maps and feeding them into a compact CNN probe, the method accurately forecasts the eventual image fidelity. Evaluated across multiple text-to-image models and quality metrics, the approach achieves consistently strong performance (PCC > 0.7, AUC-ROC > 0.9), significantly enhancing the efficiency and output quality of downstream tasks such as prompt refinement and seed selection.
📝 Abstract
Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-generation scenarios such as prompt iteration, agent-based generation, and flow-grpo. We reveal a strong correlation between early diffusion cross-attention distributions and final image quality. Based on this finding, we introduce Diffusion Probe, a framework that leverages internal cross-attention maps as predictive signals. We design a lightweight predictor that maps statistical properties of early-stage cross-attention extracted from initial denoising steps to the final image's overall quality. This enables accurate forecasting of image quality across diverse evaluation metrics long before full synthesis is complete. We validate Diffusion Probe across a wide range of settings. On multiple T2I models, across early denoising windows, resolutions, and quality metrics, it achieves strong correlation (PCC>0.7) and high classification performance (AUC-ROC>0.9). Its reliability translates into practical gains. By enabling early quality-aware decisions in workflows such as prompt optimization, seed selection, and accelerated RL training, the probe supports more targeted sampling and avoids computation on low-potential generations. This reduces computational overhead while improving final output quality. Diffusion Probe is model-agnostic, efficient, and broadly applicable, offering a practical solution for improving T2I generation efficiency through early quality prediction.