π€ AI Summary
Existing image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, making them inadequate for evaluating multimodal modelsβ reasoning and editing capabilities under structured domain-specific knowledge constraints. To address this gap, this work proposes GRADE, the first discipline-knowledge-driven image editing benchmark, comprising 520 samples across 10 academic domains. We introduce a multidimensional evaluation protocol that integrates both human and automated metrics. Systematic evaluation of 20 state-of-the-art multimodal models on GRADE reveals significant deficiencies in handling implicit, knowledge-intensive editing tasks. The dataset and code are publicly released, establishing a new benchmark and research pathway for advancing multimodal models in domain-specific reasoning and generation.
π Abstract
Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.