EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current image editing evaluation methods suffer from two key limitations: paired reference images offer narrow coverage and introduce generation bias, while zero-shot vision-language model (VLM) evaluation lacks accuracy in instruction following, content consistency, and visual fidelity. To address these issues, we propose EdiVal-Agent—the first object-centric, interpretable, and scalable evaluation framework designed for multi-turn image editing. Its core innovations include: (1) the first semantic object decomposition mechanism; (2) modular integration of VLMs, open-vocabulary detectors, and human preference models, enabling dynamic tool composition; and (3) the EdiVal-Bench benchmark, covering nine instruction categories and eleven editing models. Experiments demonstrate strong agreement between EdiVal-Agent scores and human judgments (Spearman’s ρ > 0.92), significantly improved defect detection capability, and effective guidance for next-generation editing model development.

Technology Category

Application Category

📝 Abstract
Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images -- resulting in limited coverage and inheriting biases from prior generative models -- or (ii) rely solely on zero-shot vision--language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated, scalable, and fine-grained evaluation framework for multi-turn instruction-based editing from an object-centric perspective, supported by a suite of expert tools. Given an image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions. For evaluation, it integrates VLMs with open-vocabulary object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality. We show that combining VLMs with object detectors yields stronger agreement with human judgments in instruction-following evaluation compared to using VLMs alone and CLIP-based metrics. Furthermore, the pipeline's modular design allows future tools to be seamlessly integrated, enhancing evaluation accuracy over time. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 11 state-of-the-art editing models spanning autoregressive (AR) (including Nano Banana, GPT-Image-1), flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models. Project page: https://tianyucodings.github.io/EdiVAL-page/.
Problem

Research questions and friction points this paper is trying to address.

Automated evaluation of instruction-based image editing
Overcoming limitations of reference images and VLM assessments
Providing fine-grained object-centric multi-turn editing analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric framework for automated multi-turn editing evaluation
Integrates VLMs with object detectors for instruction assessment
Modular design with expert tools for scalable fine-grained evaluation
🔎 Similar Papers
No similar papers found.