AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses a critical gap in text-to-image (T2I) generation evaluation, which has predominantly focused on model performance under fixed prompts while overlooking the varying prompt-generation capabilities of upstream agents—whether human or multimodal large language models (MLLMs). To bridge this gap, the authors propose AtelierEval, the first unified benchmark for assessing prompt-generation proficiency across both humans and MLLMs, encompassing 360 expert-designed cognitive tasks. They further introduce AtelierJudge, a memory-augmented, skill-oriented agent evaluator that integrates subjective and objective scoring. Experiments demonstrate strong alignment between AtelierJudge and human experts, achieving a Spearman correlation of 0.79. Evaluations across four T2I models, eight MLLMs, and 48 human participants validate the framework’s effectiveness and diagnostic power, revealing that “imitation-based” prompting strategies outperform planning-oriented ones.

📝 Abstract

Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.

Problem

Research questions and friction points this paper is trying to address.

text-to-image

prompting proficiency

evaluation benchmark

multimodal LLMs

human-AI collaboration

Innovation

Methods, ideas, or system contributions that make the work stand out.

AtelierEval

prompting proficiency

agentic evaluation