Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Evaluating concept customization for image generation remains challenging due to the dual requirement of preserving prompt/concept fidelity and aligning with human preferences—especially in multi-concept compositional settings, where fine-grained, interpretable evaluation metrics are lacking. To address this, we propose D-GPTScore, a decomposition-based automatic evaluation framework grounded in multimodal large language models (MLLMs). It disentangles overall image quality into interpretable dimensions—including semantic consistency, concept fidelity, and visual plausibility—enabling staged, holistic assessment for both single- and multi-concept tasks. Furthermore, we introduce CC-AlignBench, the first benchmark specifically designed for complex concept customization scenarios. Extensive experiments demonstrate that D-GPTScore significantly outperforms existing metrics on CC-AlignBench, achieving a markedly improved correlation with human preferences (Spearman’s ρ = 0.82), thereby establishing a new standard for concept customization evaluation.

Technology Category

Application Category

📝 Abstract

Evaluating concept customization is challenging, as it requires a comprehensive assessment of fidelity to generative prompts and concept images. Moreover, evaluating multiple concepts is considerably more difficult than evaluating a single concept, as it demands detailed assessment not only for each individual concept but also for the interactions among concepts. While humans can intuitively assess generated images, existing metrics often provide either overly narrow or overly generalized evaluations, resulting in misalignment with human preference. To address this, we propose Decomposed GPT Score (D-GPTScore), a novel human-aligned evaluation method that decomposes evaluation criteria into finer aspects and incorporates aspect-wise assessments using Multimodal Large Language Model (MLLM). Additionally, we release Human Preference-Aligned Concept Customization Benchmark (CC-AlignBench), a benchmark dataset containing both single- and multi-concept tasks, enabling stage-wise evaluation across a wide difficulty range -- from individual actions to multi-person interactions. Our method significantly outperforms existing approaches on this benchmark, exhibiting higher correlation with human preferences. This work establishes a new standard for evaluating concept customization and highlights key challenges for future research. The benchmark and associated materials are available at https://github.com/ReinaIshikawa/D-GPTScore.

Problem

Research questions and friction points this paper is trying to address.

Evaluating fidelity to prompts and concept images

Assessing multiple concepts and their interactions

Aligning evaluation metrics with human preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposed GPT Score evaluation method

Multimodal Large Language Model integration

Human-aligned benchmark dataset release

🔎 Similar Papers

No similar papers found.

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

Research Scientist Intern, Multimodal AI (PhD)