🤖 AI Summary
Existing text-to-image (T2I) evaluation methods—such as CLIP similarity and face classifiers—rely on superficial visual cues and thus fail to reliably assess alignment between prompts and generated images along socially sensitive attributes (e.g., religion, culture, disability), while lacking calibrated abstention mechanisms. This work proposes FairJudge, a lightweight, instruction-driven multimodal large language model (MLLM)-based evaluation framework that enables interpretable, evidence-grounded judgments of social attribute alignment via visual evidence grounding, constrained label sets, and mandatory abstention. Its core innovation lies in embedding fairness directly into the evaluation paradigm—not merely into generation. Experiments demonstrate that FairJudge significantly outperforms baselines in occupational accuracy and demographic prediction; its robustness and auditability are validated on the 469-image DIVERSIFY dataset.
📝 Abstract
Text-to-image (T2I) systems lack simple, reproducible ways to evaluate how well images match prompts and how models treat social attributes. Common proxies -- face classifiers and contrastive similarity -- reward surface cues, lack calibrated abstention, and miss attributes only weakly visible (for example, religion, culture, disability). We present FairJudge, a lightweight protocol that treats instruction-following multimodal LLMs as fair judges. It scores alignment with an explanation-oriented rubric mapped to [-1, 1]; constrains judgments to a closed label set; requires evidence grounded in the visible content; and mandates abstention when cues are insufficient. Unlike CLIP-only pipelines, FairJudge yields accountable, evidence-aware decisions; unlike mitigation that alters generators, it targets evaluation fairness. We evaluate gender, race, and age on FairFace, PaTA, and FairCoT; extend to religion, culture, and disability; and assess profession correctness and alignment on IdenProf, FairCoT-Professions, and our new DIVERSIFY-Professions. We also release DIVERSIFY, a 469-image corpus of diverse, non-iconic scenes. Across datasets, judge models outperform contrastive and face-centric baselines on demographic prediction and improve mean alignment while maintaining high profession accuracy, enabling more reliable, reproducible fairness audits.