Evaluating Text-to-Image Synthesis with a Conditional Fr'{e}chet Distance

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing text-to-image evaluation metrics (e.g., FID, CLIPScore) assess only visual fidelity or text-image alignment in isolation, exhibiting low correlation with human preferences. To address this limitation, we propose cFreD—the first unsupervised metric jointly modeling both visual fidelity and text alignment. Its core innovation is the *conditional Fréchet distance*, which constructs a joint conditional distribution from Inception image features and CLIP text embeddings, and quantifies distributional divergence via kernel density estimation and covariance matching. cFreD requires no human annotations and demonstrates strong generalization across models and domains. Extensive experiments across multiple models and datasets show that cFreD achieves significantly higher Spearman correlation with human judgments than FID, CLIPScore, and preference-learning-based metrics. The code and benchmark are publicly available.

Technology Category

Application Category

📝 Abstract

Evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences. We propose cFreD, a metric based on the notion of Conditional Fr'echet Distance that explicitly accounts for both visual fidelity and text-prompt alignment. Existing metrics such as Inception Score (IS), Fr'echet Inception Distance (FID) and CLIPScore assess either image quality or image-text alignment but not both which limits their correlation with human preferences. Scoring models explicitly trained to replicate human preferences require constant updates and may not generalize to novel generation techniques or out-of-domain inputs. Through extensive experiments across multiple recently proposed text-to-image models and diverse prompt datasets, we demonstrate that cFreD exhibits a higher correlation with human judgments compared to statistical metrics, including metrics trained with human preferences. Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models, standardizing benchmarking in this rapidly evolving field. We release our evaluation toolkit and benchmark in the appendix.

Problem

Research questions and friction points this paper is trying to address.

Evaluating text-to-image synthesis lacks metrics aligning with human preferences.

Existing metrics fail to assess both image quality and text alignment.

Proposing cFreD for robust evaluation of visual fidelity and prompt alignment.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes cFreD metric for text-to-image evaluation

Combines visual fidelity and text-prompt alignment

Higher correlation with human judgments than alternatives

🔎 Similar Papers

A Survey on Quality Metrics for Text-to-Image Generation

2024-03-18Citations: 0

Bosch Group

Attraktive Vergütung

Horb am Neckar, BW, DE

Research Scientist Intern, Applied Vision and Image Quality (PhD)