🤖 AI Summary
Existing text-guided image editing methods exhibit strong semantic understanding but lack fine-grained modeling of local object structures (e.g., parts), limiting precise and controllable localized editing. To address this, we propose a part-aware text-guided image editing framework featuring a novel learnable part-level text token optimization mechanism, integrated with inference-time dynamic masking for part localization, latent-space feature fusion, and an adaptive-threshold editing strategy—enabling part-level semantic alignment and granular manipulation. For systematic evaluation, we introduce PartEditBench, the first benchmark dedicated to part-level editing. Experiments demonstrate that our method consistently outperforms state-of-the-art approaches across all metrics on PartEditBench. Furthermore, user studies show that 77–90% of participants significantly prefer our method over alternatives, confirming its effectiveness and perceptual quality.
📝 Abstract
We present the first text-based image editing approach for object parts based on pre-trained diffusion models. Diffusion-based image editing approaches capitalized on the deep understanding of diffusion models of image semantics to perform a variety of edits. However, existing diffusion models lack sufficient understanding of many object parts, hindering fine-grained edits requested by users. To address this, we propose to expand the knowledge of pre-trained diffusion models to allow them to understand various object parts, enabling them to perform fine-grained edits. We achieve this by learning special textual tokens that correspond to different object parts through an efficient token optimization process. These tokens are optimized to produce reliable localization masks at each inference step to localize the editing region. Leveraging these masks, we design feature-blending and adaptive thresholding strategies to execute the edits seamlessly. To evaluate our approach, we establish a benchmark and an evaluation protocol for part editing. Experiments show that our approach outperforms existing editing methods on all metrics and is preferred by users 77-90% of the time in conducted user studies.