Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF

Technology Category

Application Category

📝 Abstract
Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.
Problem

Research questions and friction points this paper is trying to address.

image editing evaluation
fine-grained assessment
human alignment
evaluation metrics
instruction fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Model
Fine-Grained Evaluation
Image Editing Benchmark
Human-Aligned Judgment
Instruction Fidelity
🔎 Similar Papers
No similar papers found.
R
Runzhou Liu
University of Virginia
H
Hailey Weingord
Columbia University
S
Sejal Mittal
Columbia University
P
Prakhar Dungarwal
Columbia University
A
Anusha Nandula
Columbia University
Bo Ni
Bo Ni
Vanderbilt University
Machine LearningGraph Machine LearningNatural Language Processing
Samyadeep Basu
Samyadeep Basu
Research Scientist at Adobe Research | Prev: UMD, MSR
Machine LearningInfluence FunctionsInterpretabilityFew-shot learning
Hongjie Chen
Hongjie Chen
Dolby Labs.
GraphTime seriesVisualization
Nesreen K. Ahmed
Nesreen K. Ahmed
Senior Principal Scientist, Cisco AI Research, Intel Labs, Purdue University
Geometric Deep LearningGraph Representation LearningML for SystemsML4code
Li Li
Li Li
Pennsylvania State University; Southern University of Science and Technology
Materials ScienceDielectricsFerroelectricsPolymers and composites
J
Jiayi Zhang
University of Wisconsin-Madison
Koustava Goswami
Koustava Goswami
Research Scientist 2 @ Adobe Research
Natural Language ProcessingLanguage ModelMultimodal Learning
Subhojyoti Mukherjee
Subhojyoti Mukherjee
Adobe Research
Multi-armed BanditsReinforcement LearningLarge Language ModelsRLHF
Branislav Kveton
Branislav Kveton
Adobe Research
Artificial IntelligenceMachine Learning
P
Puneet Mathur
Adobe Research
Franck Dernoncourt
Franck Dernoncourt
NLP/ML Researcher. MIT PhD.
Machine LearningNeural NetworksNatural Language Processing
Yue Zhao
Yue Zhao
Assistant Professor of Computer Science, University of Southern California
Anomaly DetectionOut-of-Distribution DetectionTrustworthy AIAI for ScienceML Systems
Yu Wang
Yu Wang
Department of Computer Science, University of Oregon
Data MiningMachine LearningNeural-Symbolic LearningGraph and NetworkStructured Knowledge
Ryan A. Rossi
Ryan A. Rossi
Adobe Research
Machine LearningPersonalizationGraph Representation LearningGraph MLGraph Theory
Zhengzhong Tu
Zhengzhong Tu
Texas A&M University, Google Research, University of Texas at Austin
Agentic AITrustworthy AIEmbodied AI
Hongru Du
Hongru Du
Assistant Professor, University of Virginia
Data-Driven Decision-MakingInfectious Diseases ModelingAI for Public HealthSystems Engineering