🤖 AI Summary
Existing text-to-image generation methods often suffer from geometric distortions in 3D layout control due to reliance on 2D cues or iterative deformation, struggling to preserve semantic consistency and editing coherence in multi-object scenes. This work proposes a novel diffusion-based framework that, for the first time, integrates 3D bounding box guidance with instance-level semantic binding into a unified generation process. By leveraging Blended Latent Diffusion and an IP-Adapter reference image conditioning mechanism, the method achieves high-fidelity, layout-accurate multi-object image synthesis in a single forward pass. It enables distortion-free object insertion, deletion, and transformation, significantly outperforming current approaches and achieving notable advances in visual quality, layout fidelity, and interactive controllability.
📝 Abstract
We propose a diffusion-based approach for Text-to-Image (T2I) generation with consistent and interactive 3D layout control and editing. While prior methods improve spatial adherence using 2D cues or iterative copy-warp-paste strategies, they often distort object geometry and fail to preserve consistency across edits. To address these limitations, we introduce a framework for Positioning Objects Consistently and Interactively (POCI-Diff), a novel formulation for jointly enforcing 3D geometric constraints and instance-level semantic binding within a unified diffusion process. Our method enables explicit per-object semantic control by binding individual text descriptions to specific 3D bounding boxes through Blended Latent Diffusion, allowing one-shot synthesis of complex multi-object scenes. We further propose a warping-free generative editing pipeline that supports object insertion, removal, and transformation via regeneration rather than pixel deformation. To preserve object identity and consistency across edits, we condition the diffusion process on reference images using IP-Adapter, enabling coherent object appearance throughout interactive 3D editing while maintaining global scene coherence. Experimental results demonstrate that POCI-Diff produces high-quality images consistent with the specified 3D layouts and edits, outperforming state-of-the-art methods in both visual fidelity and layout adherence while eliminating warping-induced geometric artifacts.