🤖 AI Summary
This work addresses the pervasive attribute misalignment problem in fashion-domain text-to-image (T2I) generation—where attributes are correctly generated but erroneously associated with non-target entities. We propose the first fine-grained cross-modal alignment evaluation framework, centered on Localized VQAScore (L-VQAScore), a novel metric that jointly leverages visual question answering and spatial localization to quantify two distinct misalignment patterns: “reflection” (attributes incorrectly mapped to nearby entities) and “leakage” (attributes diffusing across entities). Our method integrates pretrained multimodal models, VQA-guided entity-attribute localization, human evaluation protocols, and an automated scoring pipeline. Evaluated on a newly constructed fashion fine-grained challenge dataset, L-VQAScore achieves significantly higher correlation with human judgments than existing metrics (e.g., CLIPScore, PickScore), demonstrating superior effectiveness, interpretability, and scalability.
📝 Abstract
Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities. We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and miss-localized (leakage) attribute generation. On a newly curated dataset featuring challenging compositional alignment scenarios, L-VQAScore outperforms state-of-the-art T2I evaluation methods in terms of correlation with human judgments, demonstrating its strength in capturing fine-grained entity-attribute associations. We believe L-VQAScore can be a reliable and scalable alternative to subjective evaluations.