Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of a reliable evaluation framework for text-guided image editing (TIE) that aligns with human perception. Existing automatic metrics perform poorly in assessing perceptual quality, instruction adherence, and content preservation. To bridge this gap, the authors introduce TIEdit, a comprehensive benchmark comprising 512 source images, eight editing categories, and 5,120 edited images generated by ten state-of-the-art models, accompanied by 15,360 mean opinion scores (MOS) from 20 expert annotators. Furthermore, they propose EditProbe, a novel method that probes intermediate-layer representations of multimodal large language models to effectively capture semantic and perceptual relationships. Experiments demonstrate that EditProbe significantly outperforms existing metrics across all three dimensions and exhibits strong correlation with human judgments, establishing the first large-scale, multidimensional, human-annotated-driven automatic evaluation system for TIE.

Technology Category

Application Category

📝 Abstract
Evaluating text-guided image editing (TIE) methods remains a challenging problem, as reliable assessment should simultaneously consider perceptual quality, alignment with textual instructions, and preservation of original image content. Despite rapid progress in TIE models, existing evaluation benchmarks remain limited in scale and often show weak correlation with human perceptual judgments. In this work, we introduce TIEdit, a benchmark for systematic evaluation of text-guided image editing methods. TIEdit consists of 512 source images paired with editing prompts across eight representative editing tasks, producing 5,120 edited images generated by ten state-of-the-art TIE models. To obtain reliable subjective ratings, 20 experts are recruited to produce 307,200 raw subjective ratings, which accumulates into 15,360 mean opinion scores (MOSs) across three evaluation dimensions: perceptual quality, editing alignment, and content preservation. Beyond the benchmark itself, we further propose EditProbe, an LLM-based evaluator that estimates editing quality via intermediate-layer probing of hidden representations. Instead of relying solely on final model outputs, EditProbe extracts informative representations from intermediate layers of multimodal large language models to better capture semantic and perceptual relationships between source images, editing instructions, and edited results. Experimental results demonstrate that widely used automatic evaluation metrics show limited correlation with human judgments on editing tasks, while EditProbe achieves substantially stronger alignment with human perception. Together, TIEdit and EditProbe provide a foundation for more reliable and perceptually aligned evaluation of text-guided image editing methods.
Problem

Research questions and friction points this paper is trying to address.

text-guided image editing
evaluation benchmark
perceptual quality
editing alignment
content preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

text-guided image editing
evaluation benchmark
intermediate-layer probing
multimodal LLM
perceptual alignment
🔎 Similar Papers
No similar papers found.
Shiqi Gao
Shiqi Gao
Beihang University
Zitong Xu
Zitong Xu
Shanghai Jiao Tong University
Image Quality AssessmentImage Editing
K
Kang Fu
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
Huiyu Duan
Huiyu Duan
Shanghai Jiao Tong University
Multimedia Signal Processing
X
Xiongkuo Min
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
J
Jia Wang
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays