π€ AI Summary
This paper addresses the challenge of interactive geometric editing in generated images. We propose a variable-granularity scene representation based on convex 3D primitives, enabling users to manipulate simple geometric elements (e.g., cubes, prisms) to modify scene structure. Integrating depth estimation with geometry-aware texture prompting, our method drives diffusion-based image generation toward geometrically consistent reconstruction. By unifying differentiable rendering and 3D assembly, the framework supports flexible editingβfrom global layout to local details. Compared to prior approaches, our method achieves significant improvements in visual fidelity, editing controllability, and compositional generalization. Notably, it excels in preserving object identity, maintaining material consistency, and accurately modeling camera and object motion. (149 words)
π Abstract
We describe Generative Blocks World to interact with the scene of a generated image by manipulating simple geometric abstractions. Our method represents scenes as assemblies of convex 3D primitives, and the same scene can be represented by different numbers of primitives, allowing an editor to move either whole structures or small details. Once the scene geometry has been edited, the image is generated by a flow-based method which is conditioned on depth and a texture hint. Our texture hint takes into account the modified 3D primitives, exceeding texture-consistency provided by existing key-value caching techniques. These texture hints (a) allow accurate object and camera moves and (b) largely preserve the identity of objects depicted. Quantitative and qualitative experiments demonstrate that our approach outperforms prior works in visual fidelity, editability, and compositional generalization.