PICS: Pairwise Image Compositing with Spatial Interactions

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the challenge of maintaining spatial consistency in multi-turn image synthesis with diffusion models, where subsequent edits often disrupt previously generated content and physical plausibility. To this end, the authors propose a self-supervised parallel synthesis framework that explicitly models spatial interactions among objects and between objects and the background, enabling high-fidelity paired image generation. The approach centers on an Interaction Transformer to capture spatial dependencies, a mask-guided mixture-of-experts mechanism for localized semantic processing, and an adaptive α-blending strategy to preserve boundary details. Additionally, geometry-aware data augmentation enhances robustness to pose variations. Extensive experiments on virtual try-on, indoor scenes, and street-view synthesis demonstrate that the method significantly outperforms existing techniques, achieving superior generation quality and editing stability.

Technology Category

Application Category

📝 Abstract

Despite strong single-turn performance, diffusion-based image compositing often struggles to preserve coherent spatial relations in pairwise or sequential edits, where subsequent insertions may overwrite previously generated content and disrupt physical consistency. We introduce PICS, a self-supervised composition-by-decomposition paradigm that composes objects in parallel while explicitly modeling the compositional interactions among (fully-/partially-)visible objects and background. At its core, an Interaction Transformer employs mask-guided Mixture-of-Experts to route background, exclusive, and overlap regions to dedicated experts, with an adaptive {\alpha}-blending strategy that infers a compatibility-aware fusion of overlapping objects while preserving boundary fidelity. To further enhance robustness to geometric variations, we incorporate geometry-aware augmentations covering both out-of-plane and in-plane pose changes of objects. Our method delivers superior pairwise compositing quality and substantially improved stability, with extensive evaluations across virtual try-on, indoor, and street scene settings showing consistent gains over state-of-the-art baselines. Code and data are available at https://github.com/RyanHangZhou/PICS

Problem

Research questions and friction points this paper is trying to address.

image compositing

spatial relations

diffusion models

physical consistency

pairwise editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interaction Transformer

Mixture-of-Experts

mask-guided composition