🤖 AI Summary
This work addresses the lack of systematic evaluation and underutilization of metadata in text-to-image (T2I) model assessment and recommendation. Methodologically, we propose the first open-source unified evaluation framework, built upon DeepFashion-MultiModal, integrating multimodal metrics—including CLIP similarity, LPIPS, FID, and retrieval-based scores—and introducing metadata-driven prompt enhancement. We present the first systematic analysis revealing how structured metadata improves visual realism, semantic fidelity, and model robustness. Building on these insights, we design a multi-objective, metric-balanced model-prompt co-recommendation strategy. Experiments demonstrate that our framework significantly enhances state-of-the-art T2I models across perceptual realism, semantic consistency, and cross-architecture stability, enabling fine-grained, task-adaptive model selection and prompt optimization.
📝 Abstract
This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models, with a particular focus on the impact of metadata augmented prompts. Leveraging the DeepFashion-MultiModal dataset, we assess generated outputs through a comprehensive set of quantitative metrics, including Weighted Score, CLIP (Contrastive Language Image Pre-training)-based similarity, LPIPS (Learned Perceptual Image Patch Similarity), FID (Frechet Inception Distance), and retrieval-based measures, as well as qualitative analysis. Our results demonstrate that structured metadata enrichments greatly enhance visual realism, semantic fidelity, and model robustness across diverse text-to-image architectures. While not a traditional recommender system, our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.