🤖 AI Summary
Existing image editing models struggle to interpret implicit instructions due to their limited capacity to model real-world causal reasoning and commonsense knowledge. To address this, this work introduces the WorldEdit dataset and the WorldEdit-Test benchmark, which formalize image editing tasks grounded in real-world causal relationships for the first time. The proposed approach employs a two-stage fine-tuning strategy augmented with a causal verification reward mechanism and knowledge-enhanced training based on models such as Bagel. Experimental results demonstrate that this method significantly improves causal reasoning and knowledge consistency in open-world image editing, achieving state-of-the-art performance in instruction following and factual plausibility, and substantially narrowing the performance gap with advanced systems like GPT-4o and Nano-Banana.
📝 Abstract
Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise because existing models rely on uniform editing strategies that are not equipped to handle the complex world knowledge and reasoning required for implicit instructions. To address this gap, we introduce \textbf{WorldEdit}, a dataset specifically designed to enable world-driven image editing. WorldEdit consists of high-quality editing samples, guided by paraphrased instructions that align with real-world causal logic. Furthermore, we provide \textbf{WorldEdit-Test} for evaluating the existing model's performance on causal editing scenarios. With WorldEdit, we use a two-stage training framework for fine-tuning models like Bagel, integrating with a causal verification reward. Our results show that the proposed dataset and methods significantly narrow the gap with GPT-4o and Nano-Banana, demonstrating competitive performance not only in instruction following but also in knowledge plausibility, where many open-source systems typically struggle.