ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Instruction-guided image editing lacks efficient, reproducible automated evaluation methods; existing open-source vision-language models (VLMs) suffer from alignment bias, while proprietary models are hindered by black-box constraints, high cost, and insufficient large-scale public training data and unified benchmarks. Method: We propose ADIEE—a framework featuring (1) the first open-source training dataset with >100K samples; (2) an end-to-end regression-based numerical scoring mechanism built upon LLaVA-NeXT-8B; and (3) customized token design and fine-tuning strategies to enhance human preference modeling. Results: ADIEE achieves a 17.24% improvement in correlation with human ratings on AURORA-Bench; pairwise accuracy rises by 7.21% on GenAI-Bench and 9.35% on AURORA-Bench; and boosts MagicBrush’s ImagenHub average score from 5.90 to 6.43. As a general-purpose reward model, ADIEE supports optimization and selection in image editing systems.

Technology Category

Application Category

📝 Abstract

Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, an automated dataset creation approach which is then used to train a scoring model for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model modified to decode a numeric score from a custom token. The resulting scorer outperforms all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0696 (+17.24%) gain in score correlation with human ratings on AURORA-Bench, and improving pair-wise comparison accuracy by 4.03% (+7.21%) on GenAI-Bench and 4.75% (+9.35%) on AURORA-Bench, respectively, compared to the state-of-the-art. The scorer can act as a reward model, enabling automated best edit selection and model fine-tuning. Notably, the proposed scorer can boost MagicBrush model's average evaluation score on ImagenHub from 5.90 to 6.43 (+8.98%).

Problem

Research questions and friction points this paper is trying to address.

Lack of effective automated evaluation for instruction-guided image editing

Open-source VLMs suffer from alignment issues and lack training datasets

Proprietary models are not transparent or cost-efficient for evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated dataset creation for image editing evaluation

Fine-tuned LLaVA-NeXT-8B model with custom scoring token

High-performance scorer outperforming open-source and proprietary models

🔎 Similar Papers

No similar papers found.