🤖 AI Summary
Existing diffusion-based editing methods struggle to achieve smooth, continuous control over editing intensity under text guidance. This work proposes an Adaptive Origin Guidance (AdaOr) mechanism that dynamically modulates editing strength during inference by interpolating between identity-conditioned and unconditional predictions. AdaOr effectively addresses the discontinuous transitions inherent in conventional Classifier-Free Guidance when applied to editing tasks. Notably, the method requires neither specialized training datasets nor per-edit optimization, yet enables fine-grained, consistent, and fluid control in both image and video editing. Experimental results demonstrate that AdaOr significantly outperforms existing slider-based intensity control strategies in terms of visual quality and controllability.
📝 Abstract
Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly controlling the intensity of text-guided edits. In standard text-conditioned generation, Classifier-Free Guidance (CFG) impacts prompt adherence, suggesting it as a potential control for edit intensity in editing models. However, we show that scaling CFG in these models does not produce a smooth transition between the input and the edited result. We attribute this behavior to the unconditional prediction, which serves as the guidance origin and dominates the generation at low guidance scales, while representing an arbitrary manipulation of the input content. To enable continuous control, we introduce Adaptive-Origin Guidance (AdaOr), a method that adjusts this standard guidance origin with an identity-conditioned adaptive origin, using an identity instruction corresponding to the identity manipulation. By interpolating this identity prediction with the standard unconditional prediction according to the edit strength, we ensure a continuous transition from the input to the edited result. We evaluate our method on image and video editing tasks, demonstrating that it provides smoother and more consistent control compared to current slider-based editing approaches. Our method incorporates an identity instruction into the standard training framework, enabling fine-grained control at inference time without per-edit procedure or reliance on specialized datasets.