🤖 AI Summary
Existing text-to-image evaluation metrics (e.g., FID, CLIPScore) assess only visual fidelity or text-image alignment in isolation, exhibiting low correlation with human preferences. To address this limitation, we propose cFreD—the first unsupervised metric jointly modeling both visual fidelity and text alignment. Its core innovation is the *conditional Fréchet distance*, which constructs a joint conditional distribution from Inception image features and CLIP text embeddings, and quantifies distributional divergence via kernel density estimation and covariance matching. cFreD requires no human annotations and demonstrates strong generalization across models and domains. Extensive experiments across multiple models and datasets show that cFreD achieves significantly higher Spearman correlation with human judgments than FID, CLIPScore, and preference-learning-based metrics. The code and benchmark are publicly available.
📝 Abstract
Evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences. We propose cFreD, a metric based on the notion of Conditional Fr'echet Distance that explicitly accounts for both visual fidelity and text-prompt alignment. Existing metrics such as Inception Score (IS), Fr'echet Inception Distance (FID) and CLIPScore assess either image quality or image-text alignment but not both which limits their correlation with human preferences. Scoring models explicitly trained to replicate human preferences require constant updates and may not generalize to novel generation techniques or out-of-domain inputs. Through extensive experiments across multiple recently proposed text-to-image models and diverse prompt datasets, we demonstrate that cFreD exhibits a higher correlation with human judgments compared to statistical metrics, including metrics trained with human preferences. Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models, standardizing benchmarking in this rapidly evolving field. We release our evaluation toolkit and benchmark in the appendix.