🤖 AI Summary
This paper addresses the challenge of precise object placement and controllable editing in 3D scenes. Methodologically, it introduces a scene-aware diffusion model framework featuring depth/normal-guided visual conditioning and coarse-grained mask-driven local generation to decouple object editing from background preservation; cross-modal text–vision alignment enables joint control over position, pose, and non-rigid deformation. Key contributions include: (i) the first lightweight, geometry-aware visual conditioning signal explicitly designed for 3D object placement—requiring neither fine-grained masks nor complex prompt engineering; and (ii) a multi-dimensional evaluation benchmark for editing quality, specifically constructed for automotive scenes. Experiments demonstrate substantial improvements over baselines: 42% reduction in positional error and 37% reduction in pose error, alongside significant gains in geometric plausibility and background consistency.
📝 Abstract
Image editing approaches have become more powerful and flexible with the advent of powerful text-conditioned generative models. However, placing objects in an environment with a precise location and orientation still remains a challenge, as this typically requires carefully crafted inpainting masks or prompts. In this work, we show that a carefully designed visual map, combined with coarse object masks, is sufficient for high quality object placement. We design a conditioning signal that resolves ambiguities, while being flexible enough to allow for changing of shapes or object orientations. By building on an inpainting model, we leave the background intact by design, in contrast to methods that model objects and background jointly. We demonstrate the effectiveness of our method in the automotive setting, where we compare different conditioning signals in novel object placement tasks. These tasks are designed to measure edit quality not only in terms of appearance, but also in terms of pose and location accuracy, including cases that require non-trivial shape changes. Lastly, we show that fine location control can be combined with appearance control to place existing objects in precise locations in a scene.