🤖 AI Summary
To address the challenge of preserving visual consistency during cross-image editing of in-the-wild photographs—where variations in pose, illumination, and environment severely hinder alignment—this paper proposes a training-free diffusion model editing framework. The method introduces an inference-time editing paradigm grounded in explicit inter-image correspondence modeling, featuring an attention manipulation module that jointly enforces semantic and geometric consistency across images. Additionally, it incorporates an optimized classifier-free guidance (CFG) denoising strategy to enhance editing robustness. The framework provides plug-and-play compatibility with ControlNet and BrushNet via standardized interfaces. Extensive evaluations on real-world images exhibiting diverse poses, lighting conditions, and environmental contexts demonstrate high-fidelity, visually consistent edits. The implementation is publicly available.
📝 Abstract
As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings. We will release the code to facilitate future studies.