Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation and underutilization of metadata in text-to-image (T2I) model assessment and recommendation. Methodologically, we propose the first open-source unified evaluation framework, built upon DeepFashion-MultiModal, integrating multimodal metrics—including CLIP similarity, LPIPS, FID, and retrieval-based scores—and introducing metadata-driven prompt enhancement. We present the first systematic analysis revealing how structured metadata improves visual realism, semantic fidelity, and model robustness. Building on these insights, we design a multi-objective, metric-balanced model-prompt co-recommendation strategy. Experiments demonstrate that our framework significantly enhances state-of-the-art T2I models across perceptual realism, semantic consistency, and cross-architecture stability, enabling fine-grained, task-adaptive model selection and prompt optimization.

Technology Category

Application Category

📝 Abstract
This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models, with a particular focus on the impact of metadata augmented prompts. Leveraging the DeepFashion-MultiModal dataset, we assess generated outputs through a comprehensive set of quantitative metrics, including Weighted Score, CLIP (Contrastive Language Image Pre-training)-based similarity, LPIPS (Learned Perceptual Image Patch Similarity), FID (Frechet Inception Distance), and retrieval-based measures, as well as qualitative analysis. Our results demonstrate that structured metadata enrichments greatly enhance visual realism, semantic fidelity, and model robustness across diverse text-to-image architectures. While not a traditional recommender system, our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.
Problem

Research questions and friction points this paper is trying to address.

Evaluating text-to-image models using multimodal metrics
Assessing impact of metadata-augmented prompts on outputs
Providing task-specific model and prompt recommendations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified benchmarking framework for text-to-image models
Metadata-augmented prompts enhance visual realism
Task-specific recommendations via evaluation metrics
🔎 Similar Papers
No similar papers found.
K
Kapil Wanaskar
Computer Engineering Dept., San Jos´e State University, San Jose, CA
G
Gaytri Jena
Independent Researcher, San Jose, CA
Magdalini Eirinaki
Magdalini Eirinaki
Professor of Computer Engineering, San Jose State University
Recommender systemsSocial network analysis and miningsocial recommender systemspersonalizationmachine learning