🤖 AI Summary
Current automated evaluation of personalized image generation suffers from a fundamental trade-off: existing metrics exhibit low correlation with human preferences, while human evaluation remains costly and inefficient. To address this, we propose the first multimodal GPT-driven benchmark that achieves high alignment with human judgments. Our method introduces a task-reinforced, self-aligned prompting mechanism for GPT, wherein systematic prompt engineering and explicit human preference modeling jointly enable highly consistent evaluation outcomes relative to manual scoring. We further construct a high-quality, multi-scenario evaluation dataset. Extensive validation across seven state-of-the-art generative models demonstrates that our benchmark improves Spearman correlation with human ratings by over 42%, effectively overcoming both the inaccuracy of automated metrics and the inefficiency of human evaluation. This work advances the evaluation paradigm for generative AI.
📝 Abstract
Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive ability to creatively generate personalized content across various contexts. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark that advanced multimodal GPT models automate. Specifically, we systematically design the prompts to let GPT be both human-aligned and self-aligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DreamBench++ results in significantly more human-aligned evaluation, helping boost the community with innovative findings.