I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing image editing benchmarks suffer from narrow task coverage, coarse-grained evaluation metrics, and heavy reliance on manual annotations—limiting scalability and practical utility. To address these limitations, we propose the first comprehensive, fully automated benchmark for image-to-image editing, encompassing ten single- and multi-image editing tasks and thirty disentangled, fine-grained evaluation dimensions. Our method innovatively integrates domain-specific evaluation tools with large multimodal models (LMMs) to establish a multi-task, multi-dimensional, fully automated hybrid assessment framework, empirically validated to achieve high alignment with human preferences (Spearman’s ρ > 0.85). We systematically evaluate leading image editing models, uncovering inherent trade-offs among fidelity, consistency, and semantic controllability. All code, data, and evaluation tools are publicly released to foster reproducible research and community advancement.

Technology Category

Application Category

📝 Abstract

Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose extbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Evaluates image editing models across diverse tasks

Automates assessment with hybrid tools and LMMs

Validates alignment between benchmark results and human preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated hybrid evaluation with specialized tools and LMMs

Decoupled fine-grained evaluation across 30 dimensions

Comprehensive benchmark covering 10 diverse editing tasks

🔎 Similar Papers

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

2024-06-24arXiv.orgCitations: 0

Apple

Santa Clara, United States of America

Research Scientist Intern, Multimodal AI (PhD)