🤖 AI Summary
This work proposes SourceSwap, a novel framework for zero-shot object swapping that overcomes key limitations of existing methods—such as per-object fine-tuning, slow inference, or reliance on additional paired data—by enabling cross-object alignment and scene harmonization without video or multi-view inputs. SourceSwap introduces a self-supervised, source-aware alignment mechanism that generates high-quality pseudo-paired data in the initial noise space through frequency-separated perturbations. Leveraging a dual U-Net architecture, full-source conditioning, and a noise-free reference encoder, the method significantly outperforms current approaches in object fidelity, scene preservation, and object-scene harmony. Furthermore, it generalizes effectively to theme-driven refinement and face swapping tasks. To support rigorous evaluation, the authors also introduce SourceBench, the first high-quality benchmark dedicated to object swapping.
📝 Abstract
Object swapping aims to replace a source object in a scene with a reference object while preserving object fidelity, scene fidelity, and object-scene harmony. Existing methods either require per-object finetuning and slow inference or rely on extra paired data that mostly depict the same object across contexts, forcing models to rely on background cues rather than learning cross-object alignment. We propose SourceSwap, a self-supervised and source-aware framework that learns cross-object alignment. Our key insight is to synthesize high-quality pseudo pairs from any image via a frequency-separated perturbation in the initial-noise space, which alters appearance while preserving pose, coarse shape, and scene layout, requiring no videos, multi-view data, or additional images. We then train a dual U-Net with full-source conditioning and a noise-free reference encoder, enabling direct inter-object alignment, zero-shot inference without per-object finetuning, and lightweight iterative refinement. We further introduce SourceBench, a high-quality benchmark with higher resolution, more categories, and richer interactions. Experiments demonstrate that SourceSwap achieves superior fidelity, stronger scene preservation, and more natural harmony, and it transfers well to edits such as subject-driven refinement and face swapping.