🤖 AI Summary
This work addresses the challenge in image editing where object removal, insertion, and relocation are typically modeled separately, hindering coherent handling of physically grounded effects such as shadows and reflections. We propose CrimEdit, a unified diffusion-based framework that jointly trains all three tasks within a single model. Our approach introduces task-specific embeddings and region-aware prompt expansion, integrated with classifier-free guidance to enable fine-grained, controllable editing of both target objects and their derived physical effects. Notably, object relocation is achieved in a single denoising step. Contributions include: (1) the first end-to-end framework unifying removal, insertion, and relocation; (2) an effect-aware controllable generation mechanism; and (3) state-of-the-art performance across all three tasks—without additional training or multi-stage pipelines—while achieving superior editing efficiency and visual fidelity.
📝 Abstract
Recent works on object removal and insertion have enhanced their performance by handling object effects such as shadows and reflections, using diffusion models trained on counterfactual datasets. However, the performance impact of applying classifier-free guidance to handle object effects across removal and insertion tasks within a unified model remains largely unexplored. To address this gap and improve efficiency in composite editing, we propose CrimEdit, which jointly trains the task embeddings for removal and insertion within a single model and leverages them in a classifier-free guidance scheme -- enhancing the removal of both objects and their effects, and enabling controllable synthesis of object effects during insertion. CrimEdit also extends these two task prompts to be applied to spatially distinct regions, enabling object movement (repositioning) within a single denoising step. By employing both guidance techniques, extensive experiments show that CrimEdit achieves superior object removal, controllable effect insertion, and efficient object movement without requiring additional training or separate removal and insertion stages.