Evaluating Attribute Confusion in Fashion Text-to-Image Generation

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the pervasive attribute misalignment problem in fashion-domain text-to-image (T2I) generation—where attributes are correctly generated but erroneously associated with non-target entities. We propose the first fine-grained cross-modal alignment evaluation framework, centered on Localized VQAScore (L-VQAScore), a novel metric that jointly leverages visual question answering and spatial localization to quantify two distinct misalignment patterns: “reflection” (attributes incorrectly mapped to nearby entities) and “leakage” (attributes diffusing across entities). Our method integrates pretrained multimodal models, VQA-guided entity-attribute localization, human evaluation protocols, and an automated scoring pipeline. Evaluated on a newly constructed fashion fine-grained challenge dataset, L-VQAScore achieves significantly higher correlation with human judgments than existing metrics (e.g., CLIPScore, PickScore), demonstrating superior effectiveness, interpretability, and scalability.

Technology Category

Application Category

📝 Abstract

Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities. We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and miss-localized (leakage) attribute generation. On a newly curated dataset featuring challenging compositional alignment scenarios, L-VQAScore outperforms state-of-the-art T2I evaluation methods in terms of correlation with human judgments, demonstrating its strength in capturing fine-grained entity-attribute associations. We believe L-VQAScore can be a reliable and scalable alternative to subjective evaluations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating attribute confusion in fashion text-to-image generation

Assessing fine-grained entity-attribute associations in T2I models

Improving automated evaluation of compositional alignment in fashion images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses VQA localization for single entity focus

Introduces L-VQAScore for attribute generation assessment

Combines visual localization with VQA probing

🔎 Similar Papers

No similar papers found.