CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation frameworks lack systematicity and struggle to align with human judgment, making it difficult to effectively assess model performance on complex creative image editing tasks. To address this, this work proposes CREval—a fully automated, interpretable, question-answering-based evaluation method—and introduces CREval-Bench, a benchmark encompassing three major categories and nine creative dimensions. By integrating multimodal large language model scoring, structured categorization of creative dimensions, and large-scale human annotations, the proposed framework significantly enhances evaluation transparency and consistency with human judgments. Experimental results show that closed-source models generally outperform open-source counterparts, yet all models exhibit notable limitations on complex creative tasks. User studies further confirm the high correlation between CREval scores and human assessments.
📝 Abstract
Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.
Problem

Research questions and friction points this paper is trying to address.

creative image manipulation
complex instructions
evaluation framework
multimodal image editing
human-aligned assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

CREval
creative image manipulation
automated evaluation
interpretable QA-based assessment
CREval-Bench
🔎 Similar Papers
No similar papers found.