🤖 AI Summary
Open-source multimodal image editing models significantly underperform proprietary counterparts due to scarce high-quality training data and the absence of comprehensive evaluation benchmarks. To address this, we propose an end-to-end data construction paradigm: a unified post-hoc verification mechanism leveraging a 7B dual-task expert model, Qwen-Verify, for automated failure detection and instruction re-description; integrated with human fine-grained annotation and controllable synthetic data generation to overcome the scale–quality trade-off. Based on this, we construct UnicEdit-10M—a million-scale, high-fidelity dataset—and UnicBench, the first benchmark targeting spatial and knowledge reasoning in image editing, introducing novel metrics including non-editing consistency and reasoning accuracy. Empirical analysis reveals systematic deficiencies of mainstream models on reasoning-intensive editing tasks. This work establishes critical infrastructure for model diagnosis, evaluation, and iterative improvement.
📝 Abstract
With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, extbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields extbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose extbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including extit{Non-edit Consistency} and extit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.