UniREditBench: A Unified Reasoning-based Image Editing Benchmark

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image editing benchmarks focus narrowly on single-object attribute manipulation, neglecting multi-object interactions and rule-driven scenarios, while relying solely on text-based reference evaluation—leading to unreliable assessments. Method: We propose UniREdit, the first multimodal image editing benchmark explicitly designed for implicit reasoning, covering both real-world and game-based scenes. It systematically formalizes complex reasoning tasks across eight primary and eighteen fine-grained dimensions. We introduce a novel dual-modality, dual-reference evaluation framework—leveraging both image and text references—to enhance assessment robustness. Contribution/Results: We construct UniREdit-Data-100K, a large-scale dataset comprising 100K samples with chain-of-thought annotations. Fine-tuning the Bagel model yields UniREdit-Bagel, which significantly outperforms state-of-the-art methods both in-domain and under cross-distribution settings, thereby comprehensively exposing current models’ limitations in complex, reasoning-intensive image editing.

Technology Category

Application Category

📝 Abstract
Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.
Problem

Research questions and friction points this paper is trying to address.

Evaluating image editing models in complex reasoning scenarios
Addressing limitations of single-object and text-only evaluations
Providing multimodal benchmarks for real and game-world interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dual-reference evaluation for reliable assessment
Automated multi-scenario data synthesis pipeline
Fine-tuned Bagel model with chain-of-thought annotations
🔎 Similar Papers
No similar papers found.
F
Feng Han
Fudan University, Shanghai Innovation Institute
Yibin Wang
Yibin Wang
Intern at UIUC
Trustworthy AI
C
Chenglin Li
Shanghai Innovation Institute, Zhejiang University
Z
Zheming Liang
Shanghai Innovation Institute
Dianyi Wang
Dianyi Wang
Fudan University&&Shanghai Innovation Institute
Multi-modal Learning
Y
Yang Jiao
Fudan University
Zhipeng Wei
Zhipeng Wei
ICSI, UC Berkeley
robustness of deep learning
Chao Gong
Chao Gong
Fudan University
AI SafetyMultimodalityComputer Vision
C
Cheng Jin
Fudan University, Shanghai Innovation Institute
Jingjing Chen
Jingjing Chen
Fudan University
MultimediaComputer VisionMachine LearningPattern recognition
J
Jiaqi Wang
Shanghai Innovation Institute