FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing text-to-image (T2I) evaluation methods—such as CLIP similarity and face classifiers—rely on superficial visual cues and thus fail to reliably assess alignment between prompts and generated images along socially sensitive attributes (e.g., religion, culture, disability), while lacking calibrated abstention mechanisms. This work proposes FairJudge, a lightweight, instruction-driven multimodal large language model (MLLM)-based evaluation framework that enables interpretable, evidence-grounded judgments of social attribute alignment via visual evidence grounding, constrained label sets, and mandatory abstention. Its core innovation lies in embedding fairness directly into the evaluation paradigm—not merely into generation. Experiments demonstrate that FairJudge significantly outperforms baselines in occupational accuracy and demographic prediction; its robustness and auditability are validated on the 469-image DIVERSIFY dataset.

Technology Category

Application Category

📝 Abstract

Text-to-image (T2I) systems lack simple, reproducible ways to evaluate how well images match prompts and how models treat social attributes. Common proxies -- face classifiers and contrastive similarity -- reward surface cues, lack calibrated abstention, and miss attributes only weakly visible (for example, religion, culture, disability). We present FairJudge, a lightweight protocol that treats instruction-following multimodal LLMs as fair judges. It scores alignment with an explanation-oriented rubric mapped to [-1, 1]; constrains judgments to a closed label set; requires evidence grounded in the visible content; and mandates abstention when cues are insufficient. Unlike CLIP-only pipelines, FairJudge yields accountable, evidence-aware decisions; unlike mitigation that alters generators, it targets evaluation fairness. We evaluate gender, race, and age on FairFace, PaTA, and FairCoT; extend to religion, culture, and disability; and assess profession correctness and alignment on IdenProf, FairCoT-Professions, and our new DIVERSIFY-Professions. We also release DIVERSIFY, a 469-image corpus of diverse, non-iconic scenes. Across datasets, judge models outperform contrastive and face-centric baselines on demographic prediction and improve mean alignment while maintaining high profession accuracy, enabling more reliable, reproducible fairness audits.

Problem

Research questions and friction points this paper is trying to address.

Evaluating image-prompt alignment in text-to-image systems

Assessing social attribute treatment in generative models

Developing accountable fairness evaluation protocols for multimodal AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLMs as fair judges for evaluation

Applies rubric with constrained label set

Requires evidence-based decisions and abstention

🔎 Similar Papers

BIGbench: A Unified Benchmark for Evaluating Multi-dimensional Social Biases in Text-to-Image Models