UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-source multimodal image editing models significantly underperform proprietary counterparts due to scarce high-quality training data and the absence of comprehensive evaluation benchmarks. To address this, we propose an end-to-end data construction paradigm: a unified post-hoc verification mechanism leveraging a 7B dual-task expert model, Qwen-Verify, for automated failure detection and instruction re-description; integrated with human fine-grained annotation and controllable synthetic data generation to overcome the scale–quality trade-off. Based on this, we construct UnicEdit-10M—a million-scale, high-fidelity dataset—and UnicBench, the first benchmark targeting spatial and knowledge reasoning in image editing, introducing novel metrics including non-editing consistency and reasoning accuracy. Empirical analysis reveals systematic deficiencies of mainstream models on reasoning-intensive editing tasks. This work establishes critical infrastructure for model diagnosis, evaluation, and iterative improvement.

Technology Category

Application Category

📝 Abstract
With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, extbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields extbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose extbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including extit{Non-edit Consistency} and extit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.
Problem

Research questions and friction points this paper is trying to address.

Addresses the scale-quality trade-off in multimodal editing datasets
Provides a benchmark for assessing spatial and knowledge reasoning in edits
Diagnoses model weaknesses across diverse editing behaviors and tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight pipeline with end-to-end model and verification
7B dual-task expert model for quality control
Novel metrics for fine-grained editing diagnosis
🔎 Similar Papers
No similar papers found.
K
Keming Ye
Zhejiang University
Zhipeng Huang
Zhipeng Huang
Microsoft Research Asia && University of Science and Technology of China
Multi-ModalityComputer Vision
C
Canmiao Fu
WeChat Vision, Tencent Inc.
Q
Qingyang Liu
Shanghai Jiao Tong University
J
Jiani Cai
Xinjiang University
Z
Zheqi Lv
Zhejiang University
C
Chen Li
WeChat Vision, Tencent Inc.
Jing Lyu
Jing Lyu
Shanghai Jiao Tong University
Power electronicsstabilityrenewable energy grid integrationhigh-voltage dc transmission
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing
S
Shengyu Zhang
Zhejiang University