🤖 AI Summary
This work addresses the challenge of maintaining spatial consistency in multi-turn image synthesis with diffusion models, where subsequent edits often disrupt previously generated content and physical plausibility. To this end, the authors propose a self-supervised parallel synthesis framework that explicitly models spatial interactions among objects and between objects and the background, enabling high-fidelity paired image generation. The approach centers on an Interaction Transformer to capture spatial dependencies, a mask-guided mixture-of-experts mechanism for localized semantic processing, and an adaptive α-blending strategy to preserve boundary details. Additionally, geometry-aware data augmentation enhances robustness to pose variations. Extensive experiments on virtual try-on, indoor scenes, and street-view synthesis demonstrate that the method significantly outperforms existing techniques, achieving superior generation quality and editing stability.
📝 Abstract
Despite strong single-turn performance, diffusion-based image compositing often struggles to preserve coherent spatial relations in pairwise or sequential edits, where subsequent insertions may overwrite previously generated content and disrupt physical consistency. We introduce PICS, a self-supervised composition-by-decomposition paradigm that composes objects in parallel while explicitly modeling the compositional interactions among (fully-/partially-)visible objects and background. At its core, an Interaction Transformer employs mask-guided Mixture-of-Experts to route background, exclusive, and overlap regions to dedicated experts, with an adaptive {\alpha}-blending strategy that infers a compatibility-aware fusion of overlapping objects while preserving boundary fidelity. To further enhance robustness to geometric variations, we incorporate geometry-aware augmentations covering both out-of-plane and in-plane pose changes of objects. Our method delivers superior pairwise compositing quality and substantially improved stability, with extensive evaluations across virtual try-on, indoor, and street scene settings showing consistent gains over state-of-the-art baselines. Code and data are available at https://github.com/RyanHangZhou/PICS